Inspiration

Most assistants still rely on text-only chat. We wanted to build something that feels natural and human: a real-time agent you can talk to, interrupt, and show your environment through camera input. The Gemini Live Agent Challenge was the perfect push to build that experience end-to-end.

What it does

Sightline Live is a multimodal live assistant that can:

  • listen to microphone input in real time
  • process camera frames for visual context
  • respond with live voice output and transcripts
  • handle interruptions gracefully when users speak over the agent
  • keep a clear running transcript of both user and agent turns

This makes it useful for visual tutoring, guided troubleshooting, and real-time contextual support.

How we built it

We built Sightline Live as a TypeScript full-stack app:

  • Frontend: React + Vite UI for session controls, transcript, and live status
  • Backend: Node.js + TypeScript WebSocket bridge for streaming events
  • AI layer: Google GenAI SDK with the Gemini Developer API (Live API) for bidirectional multimodal interaction
  • Cloud: Dockerized deployment to Google Cloud Run, secret handling via Secret Manager, and deployment automation with Cloud Build

Challenges we ran into

  • Choosing a compatible Gemini Live model and API configuration
  • Fixing audio-modality/session issues in live streams
  • Preventing duplicate transcript entries during streaming deltas
  • Preserving user turns while still supporting interruption behavior
  • Resolving deployment blockers (quota limits, billing linkage, reserved env vars, IAM permissions for secrets)

Accomplishments that we're proud of

  • Built a working real-time multimodal live agent (audio in, vision in, voice out)
  • Implemented interruption-aware interaction that feels conversational
  • Delivered a clean transcript experience with better message handling
  • Designed a polished dark UI without breaking core functionality
  • Deployed successfully on Google Cloud with reproducible setup

What we learned

  • Real-time agent UX depends heavily on state/event management, not just prompting
  • Multimodal systems require careful synchronization between audio, vision, and UI streams
  • Cloud reliability comes from correct IAM, secrets, and deployment automation
  • Fast iteration with user feedback dramatically improves product quality

What's next for Sightline

  • Add mode presets (translator, visual tutor, support agent)
  • Improve memory across sessions for continuity
  • Expand multilingual voice interaction
  • Add richer visual guidance and actionable on-screen suggestions
  • Strengthen CI/CD and infrastructure automation for production-scale reliability
Share this project:

Updates