Inspiration

Inspired by the Gemini Live Agent Challenge, we wanted to solve a real-world problem: when you're trying to repair an engine or assemble a complex piece of furniture, your hands are full, and you can't look at a screen. Traditional AI assistants operate on a rigid "ask and wait" model. However, a true coach needs to see what you're doing and, more importantly, proactively interrupt you if you're about to make a critical mistake before you ruin the project.

What it does

AI Help is an interactive trainer with full-duplex communication. You point your device's camera at your workbench and speak naturally to the Main Agent to ask for instructions. While you converse, an asynchronous "agent" silently runs in the background, processing each video frame. If you pick up the wrong screw or connect a cable where it doesn't belong, the agent detects the visual hazard and immediately interrupts the conversation with a voice alert (e.g., "Wait, don't use the red cable there, use the blue one").

How we built it

We built AI Help using the Google Agent Development Kit (ADK) and the Gemini Live API. Main Agent: Configured with StreamingMode.BIDI to support bidirectional communication and interruption detection. Multimodal Streaming: We implemented WebSockets in the frontend (React) to simultaneously capture and stream video frames at 2 FPS and PCM audio at 16 kHz. The Sentinel (Streaming Tool): Instead of a traditional tool, we used an AsyncGenerator function (like monitor_for_hazard). This tool continuously extracts frames from the request queue (LiveRequestQueue) and uses vision models to detect errors, issuing an alert only when a critical state change is detected. Infrastructure: We deployed the backend on Google Cloud Run.

Challenges we ran into

The biggest challenge was impedance mismatch: the client sends data (video/audio) at varying rates, while the model requires a sequential stream, all without blocking the voice conversation. Resolving this required switching from standard HTTP (half-duplex) connections to WebSockets (full-duplex). Additionally, we had to orchestrate concurrent asynchronous tasks (upstream and downstream) using the ADK's LiveRequestQueue, which acts as a multimodal multiplexer to serialize the audio, video, and tooltip signals into a single, coherent timeline.

Accomplishments that we're proud of

In AI Help, AI doesn't just wait for you to finish speaking; it processes the video stream in parallel and can interrupt its own downstream voice response if it detects an imminent danger at the workbench, keeping the lead agent in control of the smooth conversation while delegating continuous visual monitoring.

What we learned

We learned to master the ADK's bidirectional streaming application lifecycle, from session initialization to event loop management. We discovered the fundamental difference between request-response tools and Streaming Tools, which are vital for acting as persistent background monitors that don't break conversational immersion.

What's next for AI Help

Memory Bank Integration: We will use Vertex AI's Memory Bank to enable AI Help to remember the user's skill level, the tools they have in their inventory, and the mistakes they made in previous sessions.

Graph RAG in Spanner: We plan to incorporate graph databases to cross-reference complex repair manuals, allowing AI to infer advanced diagnostic solutions in real time.

Remote Multi-Agent (A2A) Systems: We will implement the Agent-to-Agent protocol so that AI Help can consult remote specialist agents (e.g., an "Architect Agent" or an "Electrician Agent") if the assembly becomes too complex.

Built With

Share this project:

Updates