Echo AI Companion - Project Story

Inspiration

The inspiration for Echo AI Companion comes from a deeply personal, universal struggle: trying to help my parents and grandparents navigate modern technology. I've been there many times—spending 30 minutes on a phone call just trying to explain how to find a specific button on a website, only to end in mutual frustration. As software becomes more complex, the digital divide widens for non-technical users. I wanted to build something that feels less like a sterile software tool and more like a patient, helpful grandchild sitting right next to them. The metaphor of "Cassettes"—a familiar, nostalgic medium for older generations—inspired my approach. Instead of learning complex AI prompt engineering, families can just create or select an "AI Cassette" for a specific task and press play.

What it does

Echo AI Companion is a multimodal, real-time voice and vision agent designed to guide non-technical users through digital tasks step-by-step.

It is divided into two distinct parts:

  1. The Studio (Admin): A simple interface where technical family members can create "AI Cassettes"—custom sets of instructions (like "Help Mom with WhatsApp Web" or "How to print a PDF").
  2. The Receiver (Client): A highly simplified interface for the non-technical user. They simply click "Start Guide" on a cassette.

Once started, Echo establishes a secure WebSocket connection with the Gemini Live 2.5 engine via the Vertex AI BidiService. It captures the user's screen and microphone, acting as a real-time, emotionally intelligent guide. Echo "sees" the screen, speaks slowly in a localized language (like Hinglish), pauses for the user to complete actions, verifies visual success before moving to the next step, and never judges them for making a mistake.

How we built it

I architected Echo with a strict separation between a FastAPI backend (Python) and a Next.js frontend (React/TypeScript).

  • Backend: I utilized FastAPI to serve as the orchestrator. It manages the Cassette metadata and handles the secure authentication handshake with Google Cloud. I integrated the google-auth library to dynamically provision Vertex AI access tokens for the frontend client.
  • Frontend: I built a beautiful, high-contrast, accessible UI in Next.js. I implemented custom WebSocket logic to capture the browser's MediaDevices (Microphone and Screen Share API) and stream the data in real-time as base64-encoded PCM audio and JPEG image chunks directly to the gemini-live-2.5-flash-native-audio model.
  • Prompt Engineering: I crafted a complex, chain-of-thought System Prompt that forces the LLM to behave with specific cognitive guardrails: it must mandate visual checks before proceeding, never ask users to overwrite the current active browser tab, and maintain a slow, empathetic pacing.
  • Deployment: The entire stack is containerized using Docker and deployed entirely serverlessly on Google Cloud Run, backed by Google Artifact Registry, ensuring scalable, secure access.

Challenges we ran into

Integrating a real-time, bidirectional, multimodal WebSocket connection in a browser environment was incredibly challenging.

  • Audio processing: Managing raw PCM audio streams in JavaScript, handling sample rate conversions (browser mics default to 44.1kHz or 48kHz, but the API required 16kHz input and returned 24kHz output), and preventing audio feedback loops required precise manipulation of the Web Audio API.
  • Vision throttling: I initially tried sending video frames too quickly, which caused API quota errors and latency. I had to implement a custom canvas-based drawing loop to sample the screen at a stable 1 frame per second.
  • Deployment & Security: Deploying a static Next.js frontend that needed to proxy WebSocket connections to a separate backend via Cloud Run led to tricky caching bugs and IAM permission roadblocks. I had to restructure the Docker builds to use build-time arguments for the API URL and explicitly disable Next.js's aggressive fetch caching to ensure the UI stayed perfectly synced with the backend state.

Accomplishments that we're proud of

I am incredibly proud of achieving true "Zero-Click Navigation" for the end-user. Once they click "Start Guide," they don't have to touch the Echo UI ever again. The AI is entirely voice and vision-driven, meaning the user can completely focus on the task at hand (like navigating WhatsApp Web in another tab) while Echo watches and talks them through it.

What we learned

I learned that prompting an LLM for real-time voice interaction is completely different from text generation. I had to extensively tune the system prompt to stop the AI from generating "blocks" of text and instead force it to generate single, atomic sentences, pause, and explicitly wait for visual confirmation. I also learned a great deal about Google Cloud's IAM architecture and the intricacies of the Gemini Live 2.5 Multimodal BidiService.

What's next for Echo AI Companion

My immediate next step is to finalize the Tauri Desktop client. While the web interface is great, a native desktop app that can sit "Always on Top" and interact with native OS elements will make the experience even more seamless. I also plan to build a community "Cassette Store" where families can share highly optimized guides for common and complex tasks (like "Booking flights" or "Redeem credit card points") so no one has to write an instruction set from scratch. Ultimately, I want to expand language support beyond Hinglish to truly localize digital independence across the globe.

Built With

  • docker
  • fastapi
  • gemini-adk
  • gemini-live-2.5
  • google-artifact-registry
  • google-cloud-build
  • google-cloud-run
  • mediadevices-api
  • multimodal-bidiservice
  • next.js
  • oauth-2.0
  • python
  • rust
  • tailwind-css
  • tauri
  • typescript
  • uvicorn
  • vertex-ai
  • web-audio-api
  • websockets
Share this project:

Updates