SightLine — Your AI Eyes, Always On

Inspiration

Every morning, 253 million people with visual impairments wake up and navigate a world designed for the sighted. A simple task — crossing a street, reading a menu, recognizing a friend — becomes a complex challenge requiring constant mental effort.

We asked ourselves: What if AI could be a pair of always-on eyes?

Not a screen reader. Not a text-to-speech converter. A companion — one that understands your surroundings in real time, speaks naturally, remembers the people you know, and adapts its level of detail to exactly what you need at any given moment.

That question became SightLine: a multimodal AI assistant that uses Gemini Live API's native audio streaming to deliver continuous, context-aware scene understanding for blind and low-vision users — directly through their iPhone and Apple Watch.

The key insight was that existing assistive tools are reactive (tap a button, get a description), but real-world navigation is continuous. You need to know about the car approaching from your left before you step off the curb, not after you ask. SightLine bridges that gap with proactive, always-on intelligence.

What it does

SightLine is a real-time AI companion that runs on iPhone (with optional Apple Watch), streaming camera video, audio conversation, and sensor data to a backend powered by Gemini Live API. The system:

Real-Time Scene Understanding

Continuous camera analysis using Gemini 3.1 Pro vision — identifies hazards, reads signs, describes spatial layouts using clock positions ("a person at your 2 o'clock, about 3 meters away")
Adaptive Level of Detail (LOD) — automatically adjusts verbosity based on context:
- LOD 1 (Safety): While walking fast or in traffic — only hazard alerts
- LOD 2 (Navigation): Normal walking — spatial layout, landmarks, signage
- LOD 3 (Exploration): Standing still — full scene narrative with atmosphere

Natural Voice Conversation

Gemini Live API (gemini-live-2.5-flash-native-audio) provides sub-second bidirectional audio — users can interrupt, ask follow-up questions, and have natural dialogue
Barge-in detection with echo cancellation ensures the user can always interrupt mid-sentence
Session resumption — close the app, come back later, pick up where you left off

Face Recognition

Register family and friends — snap 3–5 photos, and SightLine recognizes them in real time
InsightFace ArcFace generates 512-D embeddings stored in Firestore — raw images are never saved
Silent identification — face IDs are injected into the conversation context without interrupting speech

Intelligent Navigation

Turn-by-turn walking directions with slope warnings (ADA threshold detection: >8% grade)
Accessibility overlays — tactile paving, wheelchair ramps, audio traffic signals from OpenStreetMap
Destination preview — Street View image analyzed by the vision agent before you arrive
Clock-position directions — "Turn at your 10 o'clock" instead of "turn left"

Long-Term Memory

Remembers people, places, and facts across sessions using vector-indexed memory (Gemini Embedding 2048-D + Firestore)
Three memory layers: Episodic (7-day half-life), Semantic (90-day), Procedural (999-day)
User-controlled — "Remember that David works at the cafe", "Forget what I just told you"

Text Recognition (OCR)

Menu reading with item-price parsing, sign reading, document scanning
Powered by Gemini 3 Flash for low-latency text extraction

Sensor Fusion

Apple Watch: Real-time heart rate, wrist IMU (pitch/roll/yaw), SpO₂, noise exposure
iPhone: GPS, compass heading, step cadence, ambient noise, weather (WeatherKit)
CoreML depth estimation: Monocular depth via Depth Anything V2 — obstacle distance without LiDAR

How we built it

Architecture

┌─────────────────────────────────────────────────────┐
│                    iOS Client                        │
│  Camera ─ Audio ─ Sensors ─ Watch ─ Depth(CoreML)   │
└──────────────────────┬──────────────────────────────┘
                       │ WebSocket (binary + JSON)
                       ▼
┌─────────────────────────────────────────────────────┐
│              FastAPI Backend (Cloud Run)              │
│                                                      │
│  ┌──────────┐  ┌────────────┐  ┌────────────────┐  │
│  │ Gemini   │  │ Context    │  │ LOD Decision   │  │
│  │ Live API │◄─│ Injection  │◄─│ Engine         │  │
│  │ (ADK)    │  │ Queue      │  │ (Adaptive)     │  │
│  └────┬─────┘  └────────────┘  └────────────────┘  │
│       │                                              │
│  ┌────┴────────────────────────────────────────┐    │
│  │            Sub-Agent Pool                    │    │
│  │  Vision(3.1 Pro) │ OCR(Flash) │ Face(Arc)   │    │
│  └──────────────────────────────────────────────┘    │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │              Tool Functions                   │    │
│  │  Navigation│Search│Maps│Memory│Accessibility  │    │
│  └──────────────────────────────────────────────┘    │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────┐
│     Google Cloud (Firestore + Secret Manager)        │
│  user_profiles/ → face_library (512-D vectors)       │
│                 → memories (2048-D vectors)           │
│                 → entity_graph (people/places)        │
└─────────────────────────────────────────────────────┘

The Gemini Live API Pipeline

The core innovation is how we orchestrate the Gemini Live API with multiple sub-agents:

Audio streams bidirectionally — the user speaks, Gemini responds with native audio at sub-second latency
Camera frames are analyzed asynchronously by a Vision sub-agent (Gemini 3.1 Pro), and results are injected as context into the Live session
A Context Injection Queue with a state machine (IDLE → GENERATING → DRAINING) ensures that vision results, LOD updates, and tool outputs are delivered without interrupting the model mid-sentence
Tool calling (18 functions) is handled via Google ADK — Gemini autonomously decides when to call navigation, search, or memory tools

The LOD (Level of Detail) Engine

We built a decision engine that dynamically adjusts the system prompt based on real-time telemetry:

$$ \text{LOD} = f(\text{motion}, \text{noise}, \text{cadence}, \text{space_type}, \text{user_profile}) $$

The engine evaluates a priority chain:

Motion state → walking triggers LOD 2, stationary triggers LOD 3
Noise level → loud environments suppress low-priority speech
Space transitions → entering/exiting buildings triggers a detail boost
User preference → verbosity and O&M (Orientation & Mobility) skill level
Explicit override → user can say "more detail" or "quiet"

Each LOD level controls:

Vision resolution and token budget (70 / 560 / 1120 tokens per frame)
VAD sensitivity and silence duration thresholds
System prompt focus (safety-only → navigation → full narrative)
Speech threshold model: $ \text{info_value} > \text{base} + \text{movement_penalty} + \text{noise_penalty} $

Binary WebSocket Protocol

To minimize latency, we designed a binary protocol with magic bytes:

0x01 prefix → raw PCM audio (eliminates 33% Base64 overhead)
0x02 prefix → JPEG camera frame
JSON fallback → control messages, telemetry, tool events

iOS Client Engineering

The iOS app is a SwiftUI application with deep system integration:

SharedAudioEngine: A single AVAudioEngine shared between capture and playback enables Apple's hardware Acoustic Echo Cancellation (AEC) — critical for natural conversation
Silero VAD: On-device voice activity detection confirms barge-in intent (6 consecutive frames above RMS threshold, 0.75 VAD probability) while filtering AEC echo residual
FrameSelector: LOD-based frame rate control (0.5–1 FPS) with pixel-diff deduplication — downsamples to 32×32 grayscale thumbnails and skips static frames (MAD threshold: 5.0)
DepthEstimator: CoreML Depth Anything V2 (F16) runs on Neural Engine, providing quadrant-level distance maps without requiring LiDAR hardware
Haptic navigation: Core Haptics patterns for directional cues — distinct patterns for left/right/ahead/stop, obstacle proximity intensity scaling, and object-type textures (person=gentle pulse, vehicle=sharp buzz, stairs=rhythmic stepping)

Infrastructure as Code

Everything is managed with Terraform:

Cloud Run v2 (2 vCPU, 2GiB, min 1 instance for zero cold starts)
Firestore with vector indexes (512-D face, 2048-D memory, COSINE distance)
Secret Manager for API keys
IAM service account with least-privilege roles
Artifact Registry for Docker images
CI/CD via Cloud Build

Challenges we ran into

1. Context Injection Without Interruption

The hardest engineering problem was injecting vision results into an active Gemini Live session without causing overlapping audio responses. When we naively sent a [VISION ANALYSIS] context message while the model was speaking, Gemini would generate a second audio stream that overlapped with the first — producing garbled, incomprehensible output.

Solution: We built a state machine (ContextInjectionQueue) that tracks whether the model is idle, generating, or draining audio. Context is batched in a 400ms window and only flushed when the model reaches IDLE state. A 15-second max-age force-flush prevents stale context, and safety timeouts (5s generating, 8s draining) handle edge cases.

2. Echo Cancellation + Barge-In

Apple's hardware AEC only works when capture and playback share a single AVAudioEngine. But distinguishing genuine user speech from AEC residual echo was a nightmare — the barge-in detector kept triggering on the model's own voice leaking through the microphone.

Solution: Multi-layer filtering: (1) RMS threshold gate (0.12), (2) AEC residual filter (<0.02 RMS), (3) Silero VAD probability threshold (0.75), (4) 150ms guard window after each model audio chunk, (5) requiring 6 consecutive qualifying frames (~600ms) before confirming barge-in.

3. numpy 2.x Incompatibility

InsightFace requires numpy <2.0, but OpenCV 4.13+ requires numpy >=2.0. This created a dependency deadlock.

Solution: Pinned opencv-python-headless==4.10.0.84 and numpy>=1.24,<2.0 — the last compatible combination. Documented as a hard constraint.

4. Adaptive LOD Without Latency Spikes

Changing the LOD mid-conversation requires rebuilding the system prompt and re-injecting it. But Gemini's VAD settings (silence duration, sensitivity) are locked at session creation — they can't be updated mid-session.

Solution: We decouple behavioral LOD (what the model talks about) from VAD LOD (how long to wait for speech). The system prompt changes dynamically via context injection, while VAD is set conservatively at session start to work across all LOD levels.

5. Face Privacy

We needed face recognition that never stores photos. Users are rightfully concerned about biometric data.

Solution: InsightFace generates a 512-D L2-normalized embedding from each photo during registration, then immediately discards the image. Only the embedding vector is stored in Firestore. Reconstruction from embeddings is computationally infeasible.

Accomplishments that we're proud of

Sub-second voice response — Gemini Live API's native audio streaming makes the conversation feel natural, not robotic
The LOD engine — a novel approach to adaptive AI verbosity that respects the user's cognitive bandwidth
Zero-config local development — Debug builds auto-connect to localhost via Bonjour mDNS, Release builds hit Cloud Run, with zero code changes
18 function tools seamlessly integrated with voice — the user just says "navigate to Starbucks" and everything happens
Haptic navigation language — directional patterns that blind users can learn intuitively (left/right/ahead/stop)
Memory that persists — vector-indexed long-term memory with user-controlled forget

What we learned

Gemini Live API is transformative for accessibility — native audio streaming with sub-second latency enables truly conversational AI, not the stilted request-response pattern of traditional assistants
Context management is the real engineering — the hard part isn't calling the API; it's orchestrating multiple async information sources (vision, OCR, face, telemetry) into a coherent context without causing audio collisions
Adaptive detail is essential — blind users don't want constant narration; they want the right information at the right time
Sensor fusion creates superpowers — combining GPS, compass, step cadence, heart rate, noise level, and weather creates a rich contextual signal that a single sensor can't provide
Privacy-first face recognition is possible — 512-D embeddings provide excellent recognition accuracy without storing any images

What's next for SightLine

Continuous memory extraction — extract memories during the session, not just at the end
Indoor navigation — integrate with BLE beacons and indoor mapping for building-level wayfinding
Multi-language live translation — real-time sign/menu translation for travelers
Community accessibility mapping — crowdsourced accessibility data contributed back to OpenStreetMap
Wrist-tap navigation — Apple Watch haptics for turn-by-turn without audio cues
Offline fallback — on-device models for basic hazard detection when connectivity is lost

Built With

artifact-registry
avfoundation
cloud-build
core-haptics
core-location
coreml
coremotion
docker
fastapi
gemini-3-flash
gemini-3.1-pro
gemini-embedding-api
gemini-live-api
google
google-address-validation-api
google-adk
google-cloud-firestore
google-cloud-run
google-elevation-api
google-geocoding-api
google-maps-platform
google-places
google-plus-codes
google-routes-api
google-secret-manager
google-street-view-api
healthkit
insightface
onnx-runtime
openstreetmap
python
silero-vad
swift
swiftui
terraform
vertex-ai
watchconnectivity
weatherkit
websocket

Updates

LIUWEI Wei started this project — Mar 03, 2026 02:29 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.