Sightline

Inspiration

Most assistants still rely on text-only chat. We wanted to build something that feels natural and human: a real-time agent you can talk to, interrupt, and show your environment through camera input. The Gemini Live Agent Challenge was the perfect push to build that experience end-to-end.

What it does

Sightline Live is a multimodal live assistant that can:

listen to microphone input in real time
process camera frames for visual context
respond with live voice output and transcripts
handle interruptions gracefully when users speak over the agent
keep a clear running transcript of both user and agent turns

This makes it useful for visual tutoring, guided troubleshooting, and real-time contextual support.

How we built it

We built Sightline Live as a TypeScript full-stack app:

Frontend: React + Vite UI for session controls, transcript, and live status
Backend: Node.js + TypeScript WebSocket bridge for streaming events
AI layer: Google GenAI SDK with the Gemini Developer API (Live API) for bidirectional multimodal interaction
Cloud: Dockerized deployment to Google Cloud Run, secret handling via Secret Manager, and deployment automation with Cloud Build

Challenges we ran into

Choosing a compatible Gemini Live model and API configuration
Fixing audio-modality/session issues in live streams
Preventing duplicate transcript entries during streaming deltas
Preserving user turns while still supporting interruption behavior
Resolving deployment blockers (quota limits, billing linkage, reserved env vars, IAM permissions for secrets)

Accomplishments that we're proud of

Built a working real-time multimodal live agent (audio in, vision in, voice out)
Implemented interruption-aware interaction that feels conversational
Delivered a clean transcript experience with better message handling
Designed a polished dark UI without breaking core functionality
Deployed successfully on Google Cloud with reproducible setup

What we learned

Real-time agent UX depends heavily on state/event management, not just prompting
Multimodal systems require careful synchronization between audio, vision, and UI streams
Cloud reliability comes from correct IAM, secrets, and deployment automation
Fast iteration with user feedback dramatically improves product quality

What's next for Sightline

Add mode presets (translator, visual tutor, support agent)
Improve memory across sessions for continuity
Expand multilingual voice interaction
Add richer visual guidance and actionable on-screen suggestions
Strengthen CI/CD and infrastructure automation for production-scale reliability

Built With

Updates

deleted deleted started this project — Feb 25, 2026 10:55 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.