Project Story

💡 Inspiration

Every year, millions of tons of repairable machinery—from vehicles to appliances—are discarded simply because diagnosing and fixing them is too complex for the average person. Professional mechanics are expensive and scarce in many regions. We asked: What if your smartphone could look at a broken engine and not just tell you what's wrong, but guide your hands to fix it?

With Gemini 3's multimodal streaming capabilities, we realized we could build a "Marathon Agent" that doesn't just chat, but sees, reasons, and orchestrates repairs alongside the user.

🤖 What it does

MechGuide includes:

  • 👀 Live Diagnostics: Point your camera at any machine. MechGuide analyzes live video to detect vibration patterns, wear, and structural damage that static photos miss.
  • 👓 AR Precision: It projects 3D bounding boxes and tool paths directly onto the real-world video feed, showing you exactly which component to check.
  • 🧠 Long-Horizon Reasoning: Unlike simple chatbots, MechGuide maintains a "Thread of Thought" across hour-long repair sessions, remembering the context of parts removed 40 minutes ago.
  • 🗣️ Voice-Guided Workflow: Hands-free interaction means you never have to drop your tools. It speaks instructions and listens for your confirmation.

⚙️ How we built it

We built MechGuide using a Next.js 16 frontend integrated with the Gemini 3 Multimodal Live API.

  • Vision Pipeline: We feed live webcam frames to Gemini 3, which performs spatial-temporal analysis to identify components and diagnose issues based on visual and audio cues.
  • Agentic Orchestration: We implemented a "Thinking Level" architecture where the agent plans a repair strategy, verifies each step, and self-corrects if the user makes a mistake.
  • AR Interface: Using Three.js and Framer Motion, we overlay Gemini's coordinate predictions onto the live camera feed, creating a seamless augmented reality experience in the browser.
  • Glassmorphism UI: Designed with Tailwind CSS v4 for a futuristic, premium feel.

🚧 Challenges we ran into

  • Latency & Sync: Synchronizing Gemini's reasoning with 30fps video for stable AR overlays was difficult. We optimized frame sampling and used client-side interpolation to make the overlays feel "sticky" and responsive.
  • Complex Repair Context: Getting the agent to remember context across a long session was tricky. We implemented a state management system that feeds relevant history back into the prompt context for each new step.

🏆 Accomplishments that we're proud of

  • Building a truly agentic workflow that feels like a human expert is watching over your shoulder.
  • Achieving real-time AR alignment purely through a web browser without native app dependencies.
  • The "Marathon" capability: The agent successfully guided us through a complex simulated engine teardown without losing context.

🎓 What we learned

  • Multimodal > Text: Video and audio provide critical context that text descriptions simply can't capture.
  • Agent Trust: For physical tasks, users need the agent to be confident but also cautious. Adding "safety checks" increased user trust significantly.
  • Gemini 3's Speed: The low latency of Gemini 3 is a game-changer for real-time interaction.

🔮 What's next for MechGuide

  • Commerce Integration: Automatically ordering replacement parts identified during diagnosis.
  • Offline Mode: Edge-based reasoning for remote areas with poor connectivity.
  • Enterprise Fleet Support: Scaling to industrial machinery and fleet maintenance tracking.

Built With

  • Gemini 3 API (Multimodal Live Vision & Reasoning)
  • Next.js 16 (App Router & Server Actions)
  • React 19 (UI Components)
  • Tailwind CSS v4 (Styling & Glassmorphism)
  • Three.js (Augmented Reality Overlays)
  • Framer Motion (Animations)
  • Web Speech API (Voice Input/Output)
  • Vercel (Deployment)

Built With

Share this project:

Updates