The Genesis of OmniSight AI: Bringing Reasoning to the Visual World

The Inspiration

The inspiration for OmniSight AI came from a simple observation of modern industrial environments: we have more cameras than ever, yet safety monitoring remains a reactive, human-dependent task. Traditional computer vision can tell you that an object exists, but it cannot explain why that object poses a risk in a specific context. When Google announced the Gemini 3 API with its advanced "Thinking Mode" and multimodal reasoning, I saw an opportunity to move from simple detection to true spatial intelligence.

How I Built It

OmniSight AI was built using a "Vibe Coding" approach, starting with rapid prototyping in Google AI Studio. The architecture is designed to leverage the modularity of the Gemini 3 Pro model:

  1. The Reasoning Core: I utilized Gemini 3's Thinking Mode (High) to act as the primary logic layer. This allows the system to process a video frame and "deliberate" on the scene before providing an output.
  2. Multimodal Tooling: I integrated the Code Execution tool to handle spatial mathematics. For example, if the AI detects a forklift too close to a pedestrian, it calculates the estimated distance using:

and renders a risk-level chart using Matplotlib.

  1. Real-World Grounding: To solve the problem of hallucinated regulations, I used Google Search Grounding to pull live safety standards (like ISO or OSHA) based on the specific equipment identified in the image.

What I Learned

Building this project taught me that the future of AI isn't just about faster chat—it's about agency. I learned how to orchestrate Context Caching to keep facility blueprints in the model's "short-term memory," which reduced latency by over and made the application feel like a real-time monitor rather than a slow analysis tool.

Challenges Faced

The biggest challenge was "Visual Noise." In a busy warehouse, thousands of objects move at once. Initially, the model would trigger too many alerts. I solved this by refining the System Instructions to prioritize "high-consequence anomalies." I also had to navigate the complexity of mapping 2D image coordinates to 3D space for accurate distance calculations, which required iterative prompting and fine-tuning the Code Execution logic.

Built With

Share this project:

Updates