Inspiration

Workplace safety inspections are still done with clipboards and cameras — slow, manual, and reactive. We asked: what if an AI agent could watch alongside you in real-time and speak up when it spots danger? With Gemini's new Live API enabling persistent bidirectional audio/video streaming, we saw the chance to build something that wasn't possible before.

What it does

ARgus is a real-time AI safety inspection agent powered by Gemini Live. Point any camera at a workspace and ARgus continuously analyzes the scene, announces hazards aloud, overlays AR annotations, and tracks incidents over time — all hands-free via voice commands.

Key capabilities:

  • Persistent bidirectional streaming — JPEG frames at 1 fps + PCM audio at 16 kHz flow to Gemini Live, with sub-second spoken responses back
  • 20 industry inspection modules — construction, healthcare, kitchen, electrical, warehouse, and 15 more — hot-swappable mid-session by voice
  • Three adaptive UI modes — Smartphone (full-screen camera), CCTV (multi-feed grid with dashboard), and AR/Headset (near-invisible, pure voice control)
  • Temporal reasoning engine — SPRT-based statistical tracking across frames: hazards are confirmed, escalated, or auto-resolved over time, not just one-shot detections
  • Voice-first UX — say "argus" to activate, ask questions, switch modules, generate reports, all without touching the screen
  • Multi-format report export — PDF, Word, JSON, CSV, HTML with structured findings, timestamps, and rule citations

How we built it

Backend (Go): A WebSocket server on Cloud Run orchestrates the Gemini Live session. An agent controller manages the vision pipeline (frame sampling/buffering), rule engine (module loading), temporal engine (SPRT confidence accumulation with incident lifecycle), and report builders. We use the official google.golang.org/genai SDK for bidirectional streaming with gemini-2.5-flash-native-audio.

Frontend (Next.js + TypeScript): A React app with Tailwind CSS that auto-detects device context. AudioWorklet captures PCM audio on a dedicated thread with voice activity detection. Glassmorphic AR overlays annotate hazards with severity-coded brackets. Theme-aware liquid glass UI adapts to dark/light mode.

Infrastructure: Cloud Run with WebSocket session affinity, Cloud Build for CI/CD, Artifact Registry for containers, and Secret Manager for API keys.

Challenges we ran into

  • Gemini Live session management — handling GoAway disconnects mid-inspection required serializing active incidents and re-injecting temporal context into the new session seamlessly
  • Audio pipeline — resampling 24 kHz Gemini PCM output to the browser's native sample rate without glitches, plus managing the interplay between native audio responses and the Web Speech API
  • False positive filtering — raw frame-by-frame detection produced too many alerts, which led us to build the SPRT temporal reasoning engine with confidence thresholds
  • Multi-device UX — making the same codebase work for phone, desktop CCTV monitoring, and AR headsets required three distinct UI paradigms sharing one state machine

What we learned

  • Gemini Live's native audio mode handles interruption and turn-taking natively — fighting it with custom interruption logic makes things worse, not better
  • Statistical methods (SPRT) are far more effective than simple cooldown timers for streaming hazard detection
  • Voice-first design forces you to make every feature accessible without a screen, which paradoxically improves the visual UI too

What's next for ARgus

  • WebXR integration for true AR headset overlays
  • Multi-camera orchestration for enterprise CCTV deployments
  • Custom module builder so organizations can define their own inspection rulesets
  • Offline mode with on-device inference for areas without connectivity

Built With

Share this project:

Updates