ARgus

Inspiration

Workplace safety inspections are still done with clipboards and cameras — slow, manual, and reactive. We asked: what if an AI agent could watch alongside you in real-time and speak up when it spots danger? With Gemini's new Live API enabling persistent bidirectional audio/video streaming, we saw the chance to build something that wasn't possible before.

What it does

ARgus is a real-time AI safety inspection agent powered by Gemini Live. Point any camera at a workspace and ARgus continuously analyzes the scene, announces hazards aloud, overlays AR annotations, and tracks incidents over time — all hands-free via voice commands.

Key capabilities:

Persistent bidirectional streaming — JPEG frames at 1 fps + PCM audio at 16 kHz flow to Gemini Live, with sub-second spoken responses back
20 industry inspection modules — construction, healthcare, kitchen, electrical, warehouse, and 15 more — hot-swappable mid-session by voice
Three adaptive UI modes — Smartphone (full-screen camera), CCTV (multi-feed grid with dashboard), and AR/Headset (near-invisible, pure voice control)
Temporal reasoning engine — SPRT-based statistical tracking across frames: hazards are confirmed, escalated, or auto-resolved over time, not just one-shot detections
Voice-first UX — say "argus" to activate, ask questions, switch modules, generate reports, all without touching the screen
Multi-format report export — PDF, Word, JSON, CSV, HTML with structured findings, timestamps, and rule citations

How we built it

Backend (Go): A WebSocket server on Cloud Run orchestrates the Gemini Live session. An agent controller manages the vision pipeline (frame sampling/buffering), rule engine (module loading), temporal engine (SPRT confidence accumulation with incident lifecycle), and report builders. We use the official google.golang.org/genai SDK for bidirectional streaming with gemini-2.5-flash-native-audio.

Frontend (Next.js + TypeScript): A React app with Tailwind CSS that auto-detects device context. AudioWorklet captures PCM audio on a dedicated thread with voice activity detection. Glassmorphic AR overlays annotate hazards with severity-coded brackets. Theme-aware liquid glass UI adapts to dark/light mode.

Infrastructure: Cloud Run with WebSocket session affinity, Cloud Build for CI/CD, Artifact Registry for containers, and Secret Manager for API keys.

Challenges we ran into

Gemini Live session management — handling GoAway disconnects mid-inspection required serializing active incidents and re-injecting temporal context into the new session seamlessly
Audio pipeline — resampling 24 kHz Gemini PCM output to the browser's native sample rate without glitches, plus managing the interplay between native audio responses and the Web Speech API
False positive filtering — raw frame-by-frame detection produced too many alerts, which led us to build the SPRT temporal reasoning engine with confidence thresholds
Multi-device UX — making the same codebase work for phone, desktop CCTV monitoring, and AR headsets required three distinct UI paradigms sharing one state machine

What we learned

Gemini Live's native audio mode handles interruption and turn-taking natively — fighting it with custom interruption logic makes things worse, not better
Statistical methods (SPRT) are far more effective than simple cooldown timers for streaming hazard detection
Voice-first design forces you to make every feature accessible without a screen, which paradoxically improves the visual UI too

What's next for ARgus

WebXR integration for true AR headset overlays
Multi-camera orchestration for enterprise CCTV deployments
Custom module builder so organizations can define their own inspection rulesets
Offline mode with on-device inference for areas without connectivity

Built With

audioworklet
go
google-cloud-build
google-cloud-run
google-gemini-live-api
next.js
react
speech
tailwind-css
typescript
web
websockets

Updates

Robert Georges jr started this project — Mar 15, 2026 03:02 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.