Inspiration
Most assistants still rely on text-only chat. We wanted to build something that feels natural and human: a real-time agent you can talk to, interrupt, and show your environment through camera input. The Gemini Live Agent Challenge was the perfect push to build that experience end-to-end.
What it does
Sightline Live is a multimodal live assistant that can:
- listen to microphone input in real time
- process camera frames for visual context
- respond with live voice output and transcripts
- handle interruptions gracefully when users speak over the agent
- keep a clear running transcript of both user and agent turns
This makes it useful for visual tutoring, guided troubleshooting, and real-time contextual support.
How we built it
We built Sightline Live as a TypeScript full-stack app:
- Frontend: React + Vite UI for session controls, transcript, and live status
- Backend: Node.js + TypeScript WebSocket bridge for streaming events
- AI layer: Google GenAI SDK with the Gemini Developer API (Live API) for bidirectional multimodal interaction
- Cloud: Dockerized deployment to Google Cloud Run, secret handling via Secret Manager, and deployment automation with Cloud Build
Challenges we ran into
- Choosing a compatible Gemini Live model and API configuration
- Fixing audio-modality/session issues in live streams
- Preventing duplicate transcript entries during streaming deltas
- Preserving user turns while still supporting interruption behavior
- Resolving deployment blockers (quota limits, billing linkage, reserved env vars, IAM permissions for secrets)
Accomplishments that we're proud of
- Built a working real-time multimodal live agent (audio in, vision in, voice out)
- Implemented interruption-aware interaction that feels conversational
- Delivered a clean transcript experience with better message handling
- Designed a polished dark UI without breaking core functionality
- Deployed successfully on Google Cloud with reproducible setup
What we learned
- Real-time agent UX depends heavily on state/event management, not just prompting
- Multimodal systems require careful synchronization between audio, vision, and UI streams
- Cloud reliability comes from correct IAM, secrets, and deployment automation
- Fast iteration with user feedback dramatically improves product quality
What's next for Sightline
- Add mode presets (translator, visual tutor, support agent)
- Improve memory across sessions for continuity
- Expand multilingual voice interaction
- Add richer visual guidance and actionable on-screen suggestions
- Strengthen CI/CD and infrastructure automation for production-scale reliability
Built With
- docker
- gensdk
- google-cloud
- node.js
- react
- typescript
- webaudio
- websockets
Log in or sign up for Devpost to join the conversation.