Inspiration 💡

The core inspiration for FixMate came from a very human struggle: the frustration of trying to follow a YouTube tutorial or a complex repair manual while both hands are busy holding a motherboard, a screwdriver, or a soldering iron.

We realized that when you are in the middle of a physical task, your attention is divided. You can't type, you can't scroll, and you often lack the exact terminology to ask for help (e.g., "What is this tiny blue cylinder next to the big black square?"). We wanted to build an AI that breaks the traditional "chatbot text window" paradigm. We envisioned an AI that doesn't just read what you type, but sees what you see and talks to you like a master technician standing right next to you. The introduction of the Gemini 2.5 Live API and BidiStream made this vision—a real-time, interruptible, multimodal hardware companion—finally possible.

What it does 🛠️

FixMate is a real-time, hands-free visual AI assistant designed for hardware repair, PC building, and complex physical installations.

Imagine you are looking at a complex circuit board. You simply point your camera at it and ask, "FixMate, where do I plug in the front panel audio connector?"

  1. It Sees: FixMate streams continuous video frames to the Gemini 2.5 Flash model, instantly analyzing the components in your view.
  2. It Speaks: It responds back with natural, conversational audio, guiding you step-by-step.
  3. It Listens: If FixMate says, "Connect it to the bottom left pins," and you say, "Wait, my motherboard layout is different, I don't see pins there!"—FixMate gracefully interrupts itself, re-evaluates the current video frame, and corrects its guidance in real-time.

It bridges the gap between digital AI knowledge and physical world execution.

How we built it 🏗️

FixMate was built with a decoupled, cloud-native architecture optimized for zero-latency streaming, laying the groundwork for future integration with AR wearables (like smart glasses).

  • The Brain (Backend): We built a robust Node.js backend deployed seamlessly on Google Cloud Run. It acts as the critical orchestration layer. It receives binary video chunks via WebSockets from the client and establishes a continuous BidiStream connection with the Gemini 2.5 Flash Live API. It handles the complex bidirectional flow of video frames going up and natural audio coming down.
  • The Eyes & Ears (Frontend): We developed a Progressive Web App (PWA) using React, TypeScript, and Vite. It utilizes WebRTC and modern browser APIs to capture raw microphone audio and camera frames efficiently. We designed a sleek, cyberpunk-inspired HUD (Heads-Up Display) that provides a futuristic, yet intuitive user experience.
  • Infrastructure: To ensure reproducibility, we utilized Infrastructure-as-Code (Terraform) and Google Cloud Build, allowing the entire backend to be spun up and scaled with single commands.

Challenges we ran into 🧗‍♂️

Building a truly real-time, bidirectional audio-visual streaming pipeline is inherently complex.

  1. Latency vs. Quality Trade-offs: Sending high-resolution video frames at 30 FPS to an LLM would choke any network. We had to optimize the frontend to downsample and compress video frames efficiently (e.g., sending keyframes natively via the Gemini SDK) before transmitting them over WebSockets, ensuring the AI gets enough visual data without introducing lag to the voice conversation.
  2. Graceful Interruptions: Handling interruptions naturally was a major hurdle. When a user speaks over the AI, the system needs to instantly flush the output audio buffers and signal the Gemini model to stop processing its current thought and pivot to the new context. Synchronizing the WebSocket audio streams with the Live API's underlying mechanisms required precise event handling.
  3. State Management across Streams: Keeping the conversational context alive while simultaneously streaming fresh visual data meant we had to carefully maintain the state of the session between the stateless Google Cloud Run instances and the persistent Gemini Live session.

Accomplishments that we're proud of 🏆

  • True Real-Time Multimodal Interaction: We successfully integrated the brand new Gemini Live API to create an experience that feels magical—where the AI actually looks at what you are doing and converses with you naturally.
  • Seamless Deployment Architecture: Building a robust Terraform and Cloud Build pipeline so that any developer can deploy this complex WebSocket infrastructure to Google Cloud in minutes.
  • The UX/UI: Designing a gorgeous, futuristic HUD that makes the user feel like they are stepping into the future of augmented reality repair.

What we learned 🧠

  • The Power of BidiStream: We learned the incredible potential of the Gemini Live API's BidiStream. Moving away from traditional Request/Response REST APIs to continuous WebSocket streams unlocks entirely new classes of applications.
  • Audio/Video Handling in JS: We significantly leveled up our skills in handling raw MediaStreamTracks, dealing with AudioContexts, and encoding/decoding binary data in the browser.
  • AI as a Companion, Not a Tool: We learned that when you give an AI eyes and a voice, the interaction model shifts from "querying a database" to "collaborating with a partner."

What's next for FixMate 🚀

We see FixMate as the foundational software layer for the imminent wave of AR hardware.

  • AR Glasses Integration: The immediate next step is porting the frontend client to run natively on AR glasses (like Meta Ray-Bans or future Google devices), making the experience truly hands-free without needing a smartphone acting as the camera.
  • Spatial Awareness: Improving the visual processing to understand 3D depth and provide specific localized highlights (e.g., projecting an AR arrow onto the exact screw that needs to be removed).
  • Enterprise Integration: Expanding FixMate's knowledge base by training it on proprietary enterprise schematics (aviation, automotive repair) to assist professional technicians in the field.

Built With

Share this project:

Updates