Inspiration
Many hands-on tasks become difficult when help isn’t available at the moment it’s needed. We wanted to explore how live multimodal AI could act as a true “second pair of eyes” — not just answering questions, but actively guiding users in real time while they work.
What it does
Second Pair of Eyes is a real-time, hands-free AI agent that watches what the user sees and listens to what they say, then provides immediate spoken guidance. The agent can be interrupted, redirected, and asked to clarify, making the interaction feel natural and human-like.
How we built it
The agent is built using Gemini’s live multimodal capabilities and is hosted on Google Cloud. A web interface captures live inputs, which are processed by a cloud-hosted backend using the Google GenAI SDK to generate real-time responses.
Challenges
Designing an agent that behaves proactively — rather than like a simple chatbot — was a key challenge. Ensuring low-latency interaction and handling interruptions reliably required careful architectural choices.
Learnings
This project showed us that the future of AI interaction lies in shared context and real-time understanding. When an AI can see, listen, and respond instantly, it becomes a collaborator rather than a tool.
Built With
- fastapi
- google-cloud-run
- google-gemini-live-api
- google-genai-sdk
- html
- javascript
- python
- vertex-ai
- web-audio-api
Log in or sign up for Devpost to join the conversation.