Inspiration

I've watched field maintenance technicians inspect transit infrastructure — escalators, elevators, rail switches, power systems — and the workflow hasn't changed in decades. A technician spots a hydraulic leak on an elevator, knows exactly what's wrong from 15 years of experience, but then spends the next 15 minutes navigating a EAM system on a tablet: looking up the asset ID, selecting the right Problem Code from a dropdown of 60+ options, mapping it to a Fault Code, choosing an Action Code, setting the priority, typing a description — all while standing on a ladder or crouched in a tunnel.

That gap between seeing the problem and recording it correctly is where errors creep in, time gets wasted, and critical safety issues get miscategorized.

what if the technician could just point their phone and talk? "Hey Max, Metrotown escalator 3 has a grinding noise coming from the drive unit. Sounds like bearing degradation." And the AI sees the equipment, understands the context, classifies the fault, and files the work order — all in a 30-second hands-free conversation.

That's what Maintenance-Eye is.

What it does

Maintenance-Eye is a real-time AI co-pilot for physical infrastructure maintenance. A field technician opens the PWA on their phone, points the camera at equipment, and has a natural voice conversation with Max — an AI maintenance engineer powered by Gemini 2.5 Flash Live API.

Max can:

  • See equipment through the live camera feed (2 FPS JPEG streaming)
  • Listen to the technician's spoken observations via real-time audio (PCM 16kHz)
  • Identify faults by combining visual analysis with the technician's verbal description
  • Auto-classify problems using structured EAM codes (Problem Code → Fault Code → Action Code)
  • Search the maintenance database using natural language — "show me open P1 work orders for rolling stock"
  • Create work orders with proper priority, department, asset assignment, and EAM classification
  • Confirm before acting — every critical action goes through a human-in-the-loop confirmation card on screen. Max proposes, the technician approves.
  • Handle interruptions — the technician can barge in mid-sentence, and Max stops immediately and listens (native Live API capability)

The system also includes a 5-page enterprise data explorer dashboard for browsing Work Orders, Assets, Locations, Knowledge Base articles, and EAM Codes — with debounced search, multi-field filtering, and responsive layout. The dashboard is populated with synthetic data to mimic a partial EAM data record system.

How I built it

The architecture is a real-time bidirectional streaming pipeline:

Phone PWA → WebSocket → FastAPI → ADK Runner → Gemini 2.5 Flash Live API
         ← audio/cards ← events ← run_live() ←

Frontend: A lightweight Progressive Web App built with vanilla JavaScript — no framework dependencies or build tooling required. Camera and audio streams are base64-encoded and transmitted over a single WebSocket connection. Agent audio responses are rendered through the Web Audio API. The dashboard and confirmation card UI are built with semantic HTML and CSS, with no external component libraries.

Backend: Python 3.12 + FastAPI, fully async. The WebSocket handler manages an InspectionSession that bridges the client to Google's ADK LiveRequestQueue. Two concurrent async tasks handle bidirectional flow — upstream_task routes client audio/video/text/confirmations to the ADK, and downstream_task routes agent events (audio chunks, transcriptions, tool calls) back to the client.

Agent: Built with Google ADK (Agent Development Kit v1.10.0), using the gemini-2.5-flash-native-audio-latest model. The agent has 9 specialized tools:

  • smart_search — natural language query engine with intent detection, ID normalization, alias mapping, and synonym expansion
  • lookup_asset — exact asset ID lookup
  • manage_work_order — CRUD operations for work orders
  • propose_action — human-in-the-loop confirmation (the agent cannot execute critical actions without technician approval)
  • get_safety_protocol, search_knowledge_base, get_inspection_history, generate_report, check_pending_actions

Data layer: Cloud Firestore for production, with a transparent JsonEAM fallback backed by seed_data.json (66 assets, 150 work orders, 60 EAM codes, 40 inspections, 25 knowledge base entries). The EAMService abstract interface means swapping backends requires zero changes to agent tools or REST routes.

Infrastructure: Docker (Python 3.12-slim), Google Cloud Run (0-3 auto-scaling instances), Cloud Storage for session frame artifacts and audit trails, Terraform for IaC.

Challenges I ran into

The hardest problem: natural language to structured database mapping.

When a technician says "the escalator at Main Street station is making a grinding noise," the AI needs to:

  • Recognize "escalator at Main Street" could match ESC-MS-001 or ESC-MS-002
  • Understand "grinding noise" maps to Problem Code MECH-WEAR and Fault Code BEAR-DEG
  • Handle variations: "the moving stairs," "that escalator we looked at yesterday," "unit ESC MS 1"

Users describe the same thing in dozens of different ways. I built a QueryEngine — a pre-query intelligence layer that sits between the agent's natural language output and the structured EAM database. It handles:

  • Intent detection: Is this a work order query, asset lookup, location search, or EAM code query?
  • ID normalization: "wo 10234" → WO-2025-10234, "esc ms 1" → ESC-MS-001
  • Alias mapping: "critical" → P1, "rolling stock" → rolling_stock department
  • Synonym expansion: "vibration" also searches "noise," "shaking," "rattle"
  • Result ranking: Scored results with match types (exact ID, name match, description match, expanded term match)

Bidirectional streaming state management was another major challenge. The WebSocket handler runs two concurrent async tasks (upstream + downstream) plus a side-channel queue for tool results. When the agent calls propose_action, the confirmation payload needs to travel: Gemini → ADK tool → side-channel queue → WebSocket → phone UI → user taps confirm → WebSocket → confirmation manager → execute action → send result back. Getting this state machine reliable across all the async boundaries required careful coordination.

The human-in-the-loop timing problem: The agent must propose an action, then wait for confirmation before proceeding — but in a streaming audio conversation, the model wants to keep talking. I had to engineer the confirmation flow so the agent says a brief "Please confirm on your screen" and genuinely pauses, rather than narrating the full proposal details (which are already displayed on the confirmation card). Getting the prompt engineering right for this behavior took many iterations.

Accomplishments that I'm proud of

  • The conversation feels natural. Technicians can interrupt Max mid-sentence (barge-in), refer back to earlier context ("close that work order we talked about"), and use informal language — and it just works.

  • Human-in-the-loop actually works in real-time. The agent proposes, the card appears on screen instantly, the technician taps confirm, and the work order is created — all while the voice conversation continues seamlessly. No page reload, no separate workflow.

  • Smart search bridges the language gap. A technician can say "show me critical open work orders for the power department" and the QueryEngine translates that to the right database query with P1 priority filter, "open" status, and "power" department — no exact syntax required.

  • Zero-build frontend that handles real-time audio + video streaming. No React, no bundler — just vanilla JS managing WebSocket streams, Web Audio API playback, camera capture, and a responsive 5-page dashboard. It works as a PWA on any phone.

  • Fully pluggable data backend. The same agent tools work whether the data lives in Firestore, a JSON file, or (in production) Hexagon EAM. The EAMService abstraction means the hackathon demo and a real enterprise deployment use identical agent code.

  • End-to-end audit trail. Session frames are periodically uploaded to Cloud Storage, confirmed work orders are persisted as JSON artifacts, and every confirmation action is tracked with timestamps and stats.

What I learned

The biggest surprise was how challenging it is to wire real-time AI into a production-grade full-stack application. Gemini's Live API is remarkably capable — processing simultaneous video and audio streams with natural conversation flow and barge-in interruption. But making that capability useful in a real enterprise context required layers of engineering that go far beyond the model itself.

The model can see the equipment and understand what the technician is saying. But turning that understanding into a correct, validated work order in a structured database — with the right asset ID, the right EAM codes, the right priority, and human approval — requires an entire intelligence stack: prompt engineering, tool design, query normalization, fuzzy matching, confirmation flows, and careful async orchestration.

I also learned that prompt engineering for voice agents is fundamentally different from text. In text, you can show the user a detailed proposal. In voice, the agent needs to be concise — "I've proposed creating that work order. Check your screen to confirm." Getting Max to stop narrating tool calls, stop repeating confirmation details that are already visible on the card, and stop being verbose in a hands-free field context took serious iteration.

The Google ADK framework proved to be a powerful foundation. The LiveRequestQueue + run_live() pattern for bidirectional streaming, the tool registration system, and the session management made it possible to build a complex multi-tool agent that operates in real-time. The framework handles the hard parts of streaming orchestration, letting me focus on domain-specific intelligence.

What's next for Maintenance-Eye

  • Production EAM integration — Connect to Hexagon EAM, SAP PM, or IBM Maximo via the existing EAMService interface. The pluggable abstraction is already built; it just needs a real adapter.
  • Multi-model visual analysis — Use Gemini's vision capabilities to detect specific failure patterns (corrosion progression, crack propagation, wear patterns) with confidence scores calibrated against historical inspection data.
  • Multi-language support — Gemini's native audio model supports multiple languages; adding multilingual support would make the tool accessible to diverse maintenance teams.

Built With

Share this project:

Updates