SightLine — Your AI Eyes, Always On

Inspiration

Every morning, 253 million people with visual impairments wake up and navigate a world designed for the sighted. A simple task — crossing a street, reading a menu, recognizing a friend — becomes a complex challenge requiring constant mental effort.

We asked ourselves: What if AI could be a pair of always-on eyes?

Not a screen reader. Not a text-to-speech converter. A companion — one that understands your surroundings in real time, speaks naturally, remembers the people you know, and adapts its level of detail to exactly what you need at any given moment.

That question became SightLine: a multimodal AI assistant that uses Gemini Live API's native audio streaming to deliver continuous, context-aware scene understanding for blind and low-vision users — directly through their iPhone and Apple Watch.

The key insight was that existing assistive tools are reactive (tap a button, get a description), but real-world navigation is continuous. You need to know about the car approaching from your left before you step off the curb, not after you ask. SightLine bridges that gap with proactive, always-on intelligence.


What it does

SightLine is a real-time AI companion that runs on iPhone (with optional Apple Watch), streaming camera video, audio conversation, and sensor data to a backend powered by Gemini Live API. The system:

Real-Time Scene Understanding

  • Continuous camera analysis using Gemini 3.1 Pro vision — identifies hazards, reads signs, describes spatial layouts using clock positions ("a person at your 2 o'clock, about 3 meters away")
  • Adaptive Level of Detail (LOD) — automatically adjusts verbosity based on context:
    • LOD 1 (Safety): While walking fast or in traffic — only hazard alerts
    • LOD 2 (Navigation): Normal walking — spatial layout, landmarks, signage
    • LOD 3 (Exploration): Standing still — full scene narrative with atmosphere

Natural Voice Conversation

  • Gemini Live API (gemini-live-2.5-flash-native-audio) provides sub-second bidirectional audio — users can interrupt, ask follow-up questions, and have natural dialogue
  • Barge-in detection with echo cancellation ensures the user can always interrupt mid-sentence
  • Session resumption — close the app, come back later, pick up where you left off

Face Recognition

  • Register family and friends — snap 3–5 photos, and SightLine recognizes them in real time
  • InsightFace ArcFace generates 512-D embeddings stored in Firestore — raw images are never saved
  • Silent identification — face IDs are injected into the conversation context without interrupting speech

Intelligent Navigation

  • Turn-by-turn walking directions with slope warnings (ADA threshold detection: >8% grade)
  • Accessibility overlays — tactile paving, wheelchair ramps, audio traffic signals from OpenStreetMap
  • Destination preview — Street View image analyzed by the vision agent before you arrive
  • Clock-position directions — "Turn at your 10 o'clock" instead of "turn left"

Long-Term Memory

  • Remembers people, places, and facts across sessions using vector-indexed memory (Gemini Embedding 2048-D + Firestore)
  • Three memory layers: Episodic (7-day half-life), Semantic (90-day), Procedural (999-day)
  • User-controlled — "Remember that David works at the cafe", "Forget what I just told you"

Text Recognition (OCR)

  • Menu reading with item-price parsing, sign reading, document scanning
  • Powered by Gemini 3 Flash for low-latency text extraction

Sensor Fusion

  • Apple Watch: Real-time heart rate, wrist IMU (pitch/roll/yaw), SpO₂, noise exposure
  • iPhone: GPS, compass heading, step cadence, ambient noise, weather (WeatherKit)
  • CoreML depth estimation: Monocular depth via Depth Anything V2 — obstacle distance without LiDAR

How we built it

Architecture

┌─────────────────────────────────────────────────────┐
│                    iOS Client                        │
│  Camera ─ Audio ─ Sensors ─ Watch ─ Depth(CoreML)   │
└──────────────────────┬──────────────────────────────┘
                       │ WebSocket (binary + JSON)
                       ▼
┌─────────────────────────────────────────────────────┐
│              FastAPI Backend (Cloud Run)              │
│                                                      │
│  ┌──────────┐  ┌────────────┐  ┌────────────────┐  │
│  │ Gemini   │  │ Context    │  │ LOD Decision   │  │
│  │ Live API │◄─│ Injection  │◄─│ Engine         │  │
│  │ (ADK)    │  │ Queue      │  │ (Adaptive)     │  │
│  └────┬─────┘  └────────────┘  └────────────────┘  │
│       │                                              │
│  ┌────┴────────────────────────────────────────┐    │
│  │            Sub-Agent Pool                    │    │
│  │  Vision(3.1 Pro) │ OCR(Flash) │ Face(Arc)   │    │
│  └──────────────────────────────────────────────┘    │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │              Tool Functions                   │    │
│  │  Navigation│Search│Maps│Memory│Accessibility  │    │
│  └──────────────────────────────────────────────┘    │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────┐
│     Google Cloud (Firestore + Secret Manager)        │
│  user_profiles/ → face_library (512-D vectors)       │
│                 → memories (2048-D vectors)           │
│                 → entity_graph (people/places)        │
└─────────────────────────────────────────────────────┘

The Gemini Live API Pipeline

The core innovation is how we orchestrate the Gemini Live API with multiple sub-agents:

  1. Audio streams bidirectionally — the user speaks, Gemini responds with native audio at sub-second latency
  2. Camera frames are analyzed asynchronously by a Vision sub-agent (Gemini 3.1 Pro), and results are injected as context into the Live session
  3. A Context Injection Queue with a state machine (IDLE → GENERATING → DRAINING) ensures that vision results, LOD updates, and tool outputs are delivered without interrupting the model mid-sentence
  4. Tool calling (18 functions) is handled via Google ADK — Gemini autonomously decides when to call navigation, search, or memory tools

The LOD (Level of Detail) Engine

We built a decision engine that dynamically adjusts the system prompt based on real-time telemetry:

$$ \text{LOD} = f(\text{motion}, \text{noise}, \text{cadence}, \text{space_type}, \text{user_profile}) $$

The engine evaluates a priority chain:

  1. Motion state → walking triggers LOD 2, stationary triggers LOD 3
  2. Noise level → loud environments suppress low-priority speech
  3. Space transitions → entering/exiting buildings triggers a detail boost
  4. User preference → verbosity and O&M (Orientation & Mobility) skill level
  5. Explicit override → user can say "more detail" or "quiet"

Each LOD level controls:

  • Vision resolution and token budget (70 / 560 / 1120 tokens per frame)
  • VAD sensitivity and silence duration thresholds
  • System prompt focus (safety-only → navigation → full narrative)
  • Speech threshold model: \( \text{info_value} > \text{base} + \text{movement_penalty} + \text{noise_penalty} \)

Binary WebSocket Protocol

To minimize latency, we designed a binary protocol with magic bytes:

  • 0x01 prefix → raw PCM audio (eliminates 33% Base64 overhead)
  • 0x02 prefix → JPEG camera frame
  • JSON fallback → control messages, telemetry, tool events

iOS Client Engineering

The iOS app is a SwiftUI application with deep system integration:

  • SharedAudioEngine: A single AVAudioEngine shared between capture and playback enables Apple's hardware Acoustic Echo Cancellation (AEC) — critical for natural conversation
  • Silero VAD: On-device voice activity detection confirms barge-in intent (6 consecutive frames above RMS threshold, 0.75 VAD probability) while filtering AEC echo residual
  • FrameSelector: LOD-based frame rate control (0.5–1 FPS) with pixel-diff deduplication — downsamples to 32×32 grayscale thumbnails and skips static frames (MAD threshold: 5.0)
  • DepthEstimator: CoreML Depth Anything V2 (F16) runs on Neural Engine, providing quadrant-level distance maps without requiring LiDAR hardware
  • Haptic navigation: Core Haptics patterns for directional cues — distinct patterns for left/right/ahead/stop, obstacle proximity intensity scaling, and object-type textures (person=gentle pulse, vehicle=sharp buzz, stairs=rhythmic stepping)

Infrastructure as Code

Everything is managed with Terraform:

  • Cloud Run v2 (2 vCPU, 2GiB, min 1 instance for zero cold starts)
  • Firestore with vector indexes (512-D face, 2048-D memory, COSINE distance)
  • Secret Manager for API keys
  • IAM service account with least-privilege roles
  • Artifact Registry for Docker images
  • CI/CD via Cloud Build

Challenges we ran into

1. Context Injection Without Interruption

The hardest engineering problem was injecting vision results into an active Gemini Live session without causing overlapping audio responses. When we naively sent a [VISION ANALYSIS] context message while the model was speaking, Gemini would generate a second audio stream that overlapped with the first — producing garbled, incomprehensible output.

Solution: We built a state machine (ContextInjectionQueue) that tracks whether the model is idle, generating, or draining audio. Context is batched in a 400ms window and only flushed when the model reaches IDLE state. A 15-second max-age force-flush prevents stale context, and safety timeouts (5s generating, 8s draining) handle edge cases.

2. Echo Cancellation + Barge-In

Apple's hardware AEC only works when capture and playback share a single AVAudioEngine. But distinguishing genuine user speech from AEC residual echo was a nightmare — the barge-in detector kept triggering on the model's own voice leaking through the microphone.

Solution: Multi-layer filtering: (1) RMS threshold gate (0.12), (2) AEC residual filter (<0.02 RMS), (3) Silero VAD probability threshold (0.75), (4) 150ms guard window after each model audio chunk, (5) requiring 6 consecutive qualifying frames (~600ms) before confirming barge-in.

3. numpy 2.x Incompatibility

InsightFace requires numpy <2.0, but OpenCV 4.13+ requires numpy >=2.0. This created a dependency deadlock.

Solution: Pinned opencv-python-headless==4.10.0.84 and numpy>=1.24,<2.0 — the last compatible combination. Documented as a hard constraint.

4. Adaptive LOD Without Latency Spikes

Changing the LOD mid-conversation requires rebuilding the system prompt and re-injecting it. But Gemini's VAD settings (silence duration, sensitivity) are locked at session creation — they can't be updated mid-session.

Solution: We decouple behavioral LOD (what the model talks about) from VAD LOD (how long to wait for speech). The system prompt changes dynamically via context injection, while VAD is set conservatively at session start to work across all LOD levels.

5. Face Privacy

We needed face recognition that never stores photos. Users are rightfully concerned about biometric data.

Solution: InsightFace generates a 512-D L2-normalized embedding from each photo during registration, then immediately discards the image. Only the embedding vector is stored in Firestore. Reconstruction from embeddings is computationally infeasible.


Accomplishments that we're proud of

  • Sub-second voice response — Gemini Live API's native audio streaming makes the conversation feel natural, not robotic
  • The LOD engine — a novel approach to adaptive AI verbosity that respects the user's cognitive bandwidth
  • Zero-config local development — Debug builds auto-connect to localhost via Bonjour mDNS, Release builds hit Cloud Run, with zero code changes
  • 18 function tools seamlessly integrated with voice — the user just says "navigate to Starbucks" and everything happens
  • Haptic navigation language — directional patterns that blind users can learn intuitively (left/right/ahead/stop)
  • Memory that persists — vector-indexed long-term memory with user-controlled forget

What we learned

  1. Gemini Live API is transformative for accessibility — native audio streaming with sub-second latency enables truly conversational AI, not the stilted request-response pattern of traditional assistants
  2. Context management is the real engineering — the hard part isn't calling the API; it's orchestrating multiple async information sources (vision, OCR, face, telemetry) into a coherent context without causing audio collisions
  3. Adaptive detail is essential — blind users don't want constant narration; they want the right information at the right time
  4. Sensor fusion creates superpowers — combining GPS, compass, step cadence, heart rate, noise level, and weather creates a rich contextual signal that a single sensor can't provide
  5. Privacy-first face recognition is possible — 512-D embeddings provide excellent recognition accuracy without storing any images

What's next for SightLine

  • Continuous memory extraction — extract memories during the session, not just at the end
  • Indoor navigation — integrate with BLE beacons and indoor mapping for building-level wayfinding
  • Multi-language live translation — real-time sign/menu translation for travelers
  • Community accessibility mapping — crowdsourced accessibility data contributed back to OpenStreetMap
  • Wrist-tap navigation — Apple Watch haptics for turn-by-turn without audio cues
  • Offline fallback — on-device models for basic hazard detection when connectivity is lost

Built With

  • artifact-registry
  • avfoundation
  • cloud-build
  • core-haptics
  • core-location
  • coreml
  • coremotion
  • docker
  • fastapi
  • gemini-3-flash
  • gemini-3.1-pro
  • gemini-embedding-api
  • gemini-live-api
  • google
  • google-address-validation-api
  • google-adk
  • google-cloud-firestore
  • google-cloud-run
  • google-elevation-api
  • google-geocoding-api
  • google-maps-platform
  • google-places
  • google-plus-codes
  • google-routes-api
  • google-secret-manager
  • google-street-view-api
  • healthkit
  • insightface
  • onnx-runtime
  • openstreetmap
  • python
  • silero-vad
  • swift
  • swiftui
  • terraform
  • vertex-ai
  • watchconnectivity
  • weatherkit
  • websocket
Share this project:

Updates