Inspiration VÖRR was inspired by a critical flaw in human biology known as Inattentional Blindness. Research shows that nearly 2.9 million workplace injuries occur annually, often because an operator "looked but didn't see." Their eyes were physically fixed on a danger, but because of cognitive load or distraction, their brain failed to perceive it. We wanted to build a "Third Eye" a safety system that doesn't just watch the environment, but monitors the operator's awareness.

What it does VÖRR is an Epistemic Safety Engine. It fuses real-time gaze tracking with multimodal AI to detect if a worker is actually aware of the hazards around them. If VÖRR sees a mismatch like an operator reaching for a high-voltage line while looking at their phone it triggers an immediate, low-latency vocal intervention. It doesn't just track safety; it monitors perception to prevent accidents before they happen.

How we built it We built VÖRR using a high-performance "Sense-Reason-Act" loop:

Sensing: MediaPipe tracks 468 3D face landmarks to calculate a precise gaze vector. Reasoning: This gaze data, combined with a 720p video stream, is fed into the Gemini 3.0 Multimodal Live API. The model performs "Epistemic Reasoning" by comparing the user's focus against a technical SOP (Standard Operating Procedure). Acting: A custom state machine triggers interventions via vocal commands, delivered through low-latency audio streams. The Epistemic Risk Equation: We model the probability of an accident $\mathcal{P}(A)$ as the divergence between the Hazard state $H$ and the operator’s Perceptual Gaze $G$: $$\mathcal{P}(A) \approx \int (H(t) \cdot [1 - G(t)]) , dt$$

Challenges we ran into The FPS Mismatch: Aligning a 30 FPS webcam feed with a 2 FPS multimodal inference cycle was tough. We solved this by decoupling the UI from the AI, using a "Visual Ghost" reticle that stays fluid while the deep reasoning happens asynchronously. Zero-Latency Audio: Every millisecond counts when a hand is moving toward a hazard. We stripped the audio stack down to raw sound device buffers to ensure Gemini's voice commands were instant. Hardware Simulation: Since we were building for a laptop but designing for Smart Glasses, we had to create a "Mirror Mode" that simulates an outward-facing world view using only a single inward-facing webcam. Accomplishments that we're proud of We are incredibly proud of building a system that understands human intent, not just pixels. VÖRR doesn't just see a "tool" or a "hazard"; it understands what the human believes about that tool. We successfully achieved a reasoning latency of under 300ms, making real-time safety interventions a reality.

What we learned We learned that the Multimodal Live API is the future of industrial safety. Traditional computer vision can tell you if a worker has a hard hat on, but it can't tell you if they've noticed a gas leak. Gemini 3.0 allowed us to create an "Epistemic Agent" that acts as a partner to the operator, bridging the gap between physical reality and human awareness.

What's next for VÖRR The next step for VÖRR is hardware parity. We plan to port the engine to AR Smart Glasses (like the Ray-Ban Meta or HoloLens) to provide true heads-up protection. We also intend to implement a "Safety Mesh" where multiple VÖRR units can share a common map of hazards in a factory, protecting entire teams through a collective epistemic intelligence.

Built With

Share this project:

Updates