Inspiration
Field service technicians spend 40% of their time searching through equipment manuals, troubleshooting guides, and service records—time that could be spent actually fixing equipment. We witnessed firsthand how technicians struggle with thick paper manuals in challenging field conditions, often needing to call dispatch for information that should be at their fingertips. This inefficiency costs the HVAC industry billions annually and leaves customers waiting hours for simple repairs.
FieldMind was inspired by the vision of giving every field technician an AI expert assistant that can see what they see, hear what they describe, and instantly provide grounded, cited technical guidance—all through a simple mobile interface that works in real field conditions.
What it does
FieldMind is a multimodal AI assistant that transforms how field technicians diagnose and repair HVAC equipment. Using Gemini Live API, it combines three modalities simultaneously:
Vision: The camera continuously analyzes equipment, automatically identifying make, model, and serial numbers from nameplates, and detecting visible fault codes or damage.
Voice: Technicians have natural conversations with FieldMind using hands-free voice interaction. The AI listens, responds with technical guidance, and supports barge-in interruptions—just like talking to a human expert.
Knowledge Grounding: Every response is grounded in actual equipment manuals stored in Vertex AI Vector Search. FieldMind provides citations like "[Manual: Carrier 50XC, Section 4.2, Page 23]" so technicians can trust the guidance and verify if needed.
Intelligent Escalation: When issues exceed field repair scope, FieldMind automatically creates escalation cases in Firestore, publishes to Pub/Sub, and triggers Cloud Functions to notify dispatch—ensuring complex problems get specialist attention within 30 minutes.
How we built it
Frontend (React):
Built with React and TypeScript for type safety and maintainability Tailwind CSS for responsive, field-optimized UI design Custom hooks for WebSocket management, audio streaming (16kHz PCM), and camera capture (1fps) Real-time transcript overlay showing both technician and AI responses Equipment badge component that appears when equipment is identified Escalation alert modal with auto-dismiss functionality Deployed on Firebase Hosting for global CDN distribution Backend (Python FastAPI):
FastAPI WebSocket server handling persistent connections for real-time communication Gemini Live API integration using the new ADK (Agent Development Kit) Gemini Flash 2.0 Vision API for equipment identification from camera frames Four custom tools registered with the agent: analyze_equipment(): Vision-based equipment identification search_manuals(): RAG-powered manual search with citations get_service_history(): Firestore query for service records escalate_case(): Multi-step escalation workflow Deployed on Cloud Run with min-instances=1 for zero cold starts during demos Environment-based configuration for API keys and GCP project settings Data & AI Pipeline:
Manual PDFs chunked into 400-token segments with metadata Text embeddings generated using text-embedding-004 (768 dimensions) Vertex AI Vector Search index for semantic similarity search Firestore collections: /equipment, /service_records, /manual_chunks, /cases Pub/Sub topic field-escalations for event-driven escalation workflow Cloud Function (Gen2) triggered by Pub/Sub to send escalation emails via SendGrid GCP Services Used:
Cloud Run: Scalable backend hosting with WebSocket support Gemini Live API (Vertex AI): Multimodal conversational AI with tool calling Gemini Flash 2.0 Vision: Equipment identification from camera frames Vertex AI Vector Search: Semantic search across equipment manuals Firestore: Real-time database for equipment, service history, and escalation cases Pub/Sub: Event-driven messaging for escalation notifications Cloud Functions: Serverless email notifications via SendGrid Firebase Hosting: Global CDN for PWA delivery
Challenges we ran into
WebSocket Stability with Gemini Live API: The biggest challenge was maintaining stable WebSocket connections between the frontend, our FastAPI backend, and Gemini Live API. We had to implement exponential backoff reconnection logic, ping/pong keepalive messages every 20 seconds, and message queuing for audio chunks sent during brief disconnections. The solution involved creating a custom useWebSocket hook with automatic reconnection and a session manager on the backend to track active connections.
Audio Streaming Format: Gemini Live API requires 16kHz, 16-bit, mono PCM audio, but browser MediaRecorder APIs default to different formats. We had to implement custom audio processing using Web Audio API's AudioContext to resample and convert audio in real-time, then base64-encode chunks for WebSocket transmission. This added latency initially, but we optimized by sending 100ms chunks instead of waiting for full utterances.
Vision Processing Without Blocking Audio: Processing camera frames with Gemini Vision API takes 1-2 seconds, which would block the audio pipeline if done synchronously. We solved this by implementing async frame processing with a frame buffer that keeps only the latest frame, processes it in the background, and updates equipment context without interrupting the voice conversation. This required careful state management to avoid race conditions.
Built With
- artifactregistry
- cloudbuild
- cloudfirestore
- cloudfunctionsgen2
- cloudpub/sub
- cloudstorage
- docker
- fastapi
- firebasehosting
- geminiflash2.0
- getusermediaapi
- googlecloudrun
- googleliveapi
- mediarecorderapi
- python
- tailwindcss
- typescript
- vectoraitextembeddings
- vectoraivectorsearch
- webaudioapi
- websocket
Log in or sign up for Devpost to join the conversation.