The Genesis of OmniSight AI: Bringing Reasoning to the Visual World
The Inspiration
The inspiration for OmniSight AI came from a simple observation of modern industrial environments: we have more cameras than ever, yet safety monitoring remains a reactive, human-dependent task. Traditional computer vision can tell you that an object exists, but it cannot explain why that object poses a risk in a specific context. When Google announced the Gemini 3 API with its advanced "Thinking Mode" and multimodal reasoning, I saw an opportunity to move from simple detection to true spatial intelligence.
How I Built It
OmniSight AI was built using a "Vibe Coding" approach, starting with rapid prototyping in Google AI Studio. The architecture is designed to leverage the modularity of the Gemini 3 Pro model:
- The Reasoning Core: I utilized Gemini 3's Thinking Mode (High) to act as the primary logic layer. This allows the system to process a video frame and "deliberate" on the scene before providing an output.
- Multimodal Tooling: I integrated the Code Execution tool to handle spatial mathematics. For example, if the AI detects a forklift too close to a pedestrian, it calculates the estimated distance using:
and renders a risk-level chart using Matplotlib.
- Real-World Grounding: To solve the problem of hallucinated regulations, I used Google Search Grounding to pull live safety standards (like ISO or OSHA) based on the specific equipment identified in the image.
What I Learned
Building this project taught me that the future of AI isn't just about faster chat—it's about agency. I learned how to orchestrate Context Caching to keep facility blueprints in the model's "short-term memory," which reduced latency by over and made the application feel like a real-time monitor rather than a slow analysis tool.
Challenges Faced
The biggest challenge was "Visual Noise." In a busy warehouse, thousands of objects move at once. Initially, the model would trigger too many alerts. I solved this by refining the System Instructions to prioritize "high-consequence anomalies." I also had to navigate the complexity of mapping 2D image coordinates to 3D space for accurate distance calculations, which required iterative prompting and fine-tuning the Code Execution logic.
Log in or sign up for Devpost to join the conversation.