Inspiration

Many hands-on tasks become difficult when help isn’t available at the moment it’s needed. We wanted to explore how live multimodal AI could act as a true “second pair of eyes” — not just answering questions, but actively guiding users in real time while they work.

What it does

Second Pair of Eyes is a real-time, hands-free AI agent that watches what the user sees and listens to what they say, then provides immediate spoken guidance. The agent can be interrupted, redirected, and asked to clarify, making the interaction feel natural and human-like.

How we built it

The agent is built using Gemini’s live multimodal capabilities and is hosted on Google Cloud. A web interface captures live inputs, which are processed by a cloud-hosted backend using the Google GenAI SDK to generate real-time responses.

Challenges

Designing an agent that behaves proactively — rather than like a simple chatbot — was a key challenge. Ensuring low-latency interaction and handling interruptions reliably required careful architectural choices.

Learnings

This project showed us that the future of AI interaction lies in shared context and real-time understanding. When an AI can see, listen, and respond instantly, it becomes a collaborator rather than a tool.

Built With

  • fastapi
  • google-cloud-run
  • google-gemini-live-api
  • google-genai-sdk
  • html
  • javascript
  • python
  • vertex-ai
  • web-audio-api
Share this project:

Updates