Voice-First Agent for FSM for Small Businesses of All Trades

Voice-First Agent Architecture Diagram

Inspiration - The Problem and The Solution

Every small service business has the same problem — techs in the field need information fast, but their hands are full. Office managers need answers while the customer is on the line, but the data is buried across multiple screens. To solve this, we built Scout — a voice-first AI agent powered by Gemini. We embedded it in a field service management app with operational data to show you what it can do.

What It Does

Scout is a voice-first AI assistant that can be embedded or integrated with field service management (FSM) apps. For purposes of demonstration and to provide Scout with background context and data, Scout was embedded in a FSM app snippet used by a fictional company that services medical equipment.

For this fictional but representative small business, Scout serves two personas:

For field technicians (Anthony):

• "Brief me on my next appointment" -- Scout shares details about his next appointment to include equipment that requires servicing, contact and parking details, history of equipment they're about to service, flags recurring issues, and suggests proactive actions. All from a single voice command while driving. • "Have we seen this before?" -- Scout mines historical service orders to recover 'tribal knowledge' buried in other technicians' notes from months ago. Instead of calling a colleague or blind guessing, the tech gets the exact fix in seconds. Demonstrates the Inversion Pattern -- Scout asks "do you want the full troubleshooting steps or just the fix?" before dumping data.

For the office manager (Lydia):

• "Scout, I need to schedule a follow-up at Lakeside University for next week. Which Techs are available and which Techs seem to be the best fit for the job and why?" • "Who's available Thursday afternoon for an urgent situation?" -- Real-time technician availability with visit history context ("Austin is available and has been to this customer twice before -- he knows the facility").

Key design principles:

• Persona-aware: Scout adapts its communication style based on who's asking. Technicians get short, action-first answers (they're driving or working with their hands). Office managers get analytical, pattern-spotting responses (they're at a desk and want insight, not just data). • Scout reads but NEVER writes. The human is always in control. • Every answer comes from live Firestore data via function calling -- no fabricated responses. • Text chat works as a foundation. Voice upgrades the experience. Both use the same tools.

How We Built It

We built the Scout Gemini Live Agent completely AI-assisted, adhering strictly to a spec-driven and test-driven approach.
This meant drafting comprehensive specifications and test scenarios before a single line of code was ever generated.
We started by building the foundational FSM application to serve as our data and contextual source, utilizing Claude Code running inside Visual Studio Code.
We then switched to the Antigravity agent, powered by Claude Opus 4.6, to begin architecting the Gemini Live Agent itself.
However, we quickly encountered severe roadblocks that Claude was unable to resolve, leading to ballooning response times that reached an unusable 10 seconds of latency.
In response, we pivoted and switched the AI engine inside Antigravity from Claude to Gemini 3.1 Pro.
This switch was the turning point; Gemini grasped the architecture, significantly improved upon Claude's initial code, resolved the latency roadblocks, and successfully brought Scout to completion.

Key Architecture and Design Considerations

Scout uses a single, highly-optimized model configuration (gemini-2.5-flash) for two distinctly different interaction paradigms:

1. Background Analysis & Text Chat (Web API)

• Triggered when a user opens their home screen or types a message • Analyzes the full day's schedule, cross-references equipment service history, detects patterns and risks • Runs via Firebase Cloud Function using standard HTTP request/response with function calling • No latency impact on voice conversations

2. Real-Time Voice (Live API)

• WebSocket connection from browser directly to the Live API (gemini-2.5-flash-native-audio-latest) • Native audio output (not text-to-speech) for human-like latency • Built-in barge-in support -- users interrupt naturally, Scout stops and pivots • Shares the exact same underlying system prompt, tools, and persona logic as the text mode

Progressive Disclosure Prompt Architecture (L1-L3)

Scout's system prompt isn't a monolithic wall of text. We strictly apply ADK Agent Skill Design Patterns to progressive disclosure layers: • L1 (always loaded): Core identity, safety rules, anti-hallucination constraints, tool names. ~150 tokens. • L2 (Inversion Pattern, per-session): Persona skill (office manager vs field tech) + screen context. The agent flips the script when needed — using the Inversion pattern to ask clarifying questions before guessing (e.g. asking to review recent service notes before diagnosing an issue). • L2b (Dynamic Business Rule Injection): We map screen IDs to user intent and inject applicable Standard Operating Procedures (SOP Checklists) dynamically based on the current context, serving as a powerful Reviewer Pattern gatekeeper. • L3 (Tool Wrapper Pattern, on-demand): Full data records loaded only via function calling. The FSM logic is wrapped in tools loaded only when needed. The tools themselves act as a Generator Pattern database schema proxy, hiding raw database complexity from the LLM.

We also use the Reviewer Pattern for our background analysis instructions, having the model evaluate schedule and equipment data against a specific checklist rather than providing open-ended summaries. This keeps the system prompt under 400 tokens while making Scout genuinely context-aware.

The "Go Live" Upgrade

Scout opens as a text chat panel. Users can type questions and get answers from live data. When they click "Go Live," the panel transforms into a voice interface -- the text box literally disappears. This design:

Proves both interaction modes work with the same agent logic
Provides a fallback if voice has issues
Matches real user preferences -- some people prefer text, some prefer voice

Anti-Hallucination Strategy

Scout's defense against fabrication is multi-layered:

System prompt grounding: "ONLY state facts that come from your function calling tools. Never invent data. When citing data, include specific identifiers (SO numbers, dates, equipment names) so the user can verify."
Read-only data access: All 6 Firestore tools are read-only queries. Scout cannot create, modify, or delete anything.
Structured tool responses: Tools return JSON with explicit field names and counts. Empty results return "count": 0 -- Gemini sees there's no data rather than guessing.
Graceful refusal: When asked about data outside its tools (certifications, pricing, inventory, warranty terms), Scout says "I don't have that information in the system" rather than guessing.

Challenges We Ran Into

• Model name confusion between Developer API and Vertex AI: The Gemini Live API uses different model identifiers depending on which API platform you use. gemini-live-2.5-flash-native-audio only works with Vertex AI; the Developer API requires gemini-2.5-flash-native-audio-latest. This caused an outage during testing when we switched names. • Client-Side API Security Limitations: We originally built an ephemeral token generator in a Firebase Cloud Function to proxy our web clients to the Live API securely. We soon discovered that deploying these proxies over standard HTTP(S) endpoints causes the bidiGenerateContent WebSocket streaming protocol to fail entirely. Due to the nascent state of the Web SDK, we had to revert to a raw client-side API key. We also discovered that native browser WebSockets explicitly strip Origin and Referer headers for security reasons, meaning you cannot lock down your Google Cloud API key using "Website Restrictions". We mitigated this by enforcing strict "API Restrictions", limiting the key ONLY to the Generative Language API to protect our Firebase database. • WebSocket lifecycle management: When the Live API WebSocket closes (timeout, error, or model rejection), the microphone AudioWorklet continues streaming audio chunks into the dead socket, flooding the console with hundreds of errors per second. We had to add explicit stopMic() calls in the onclose and onerror handlers, plus state guards in the mic processor. • Customer name resolution required parent documents: Firestore subcollection queries can't search parent document fields. Our resolveCustomerId function needs parent customers/{id} documents with a name field — but our seed script only created subcollections, causing all customer lookups to fail with "NOT FOUND." • Pre-Visit Briefing Latency (The "Mega-Tool" Solution): When Scout gathered a pre-visit briefing, it had to make 5 sequential tool fetching rounds (schedule, site notes, and three equipment histories). Each WebSocket round-trip induced a massive silence block. We solved this by architecting a backend "Mega-Tool" (getPreVisitBriefing) that consolidates all 5 queries server-side and returns one massive JSON object, dropping latency from 10 seconds to near-zero. • Chain-of-Thought Blocking: The native-audio models still attempt to generate text-based Chain-of-Thought (e.g., Planning Schedule...) before speaking. Because text over WebSockets blocks the audio channel, this introduced 1-2 seconds of startup latency. We had to violently suppress this with extremely negative prompt constraints ("NEVER USE MARKDOWN", "NEVER THINK OUT LOUD").

What We Learned

We brought in Google's Antigravity agent as our "closer" to rescue the project and solve the hard engineering problems. Together, we engineered 7 Gemini Live Agent Scout Architectural Design Elements to push this app across the finish line:

Real-World Complexity. We embedded Scout in an app spanning multiple trades, proving it can handle the realistic, complex environments of small businesses. Using Claude for standard React generation, and then deploying Google's Antigravity specifically to architect the bleeding-edge Gemini Live WebSocket layer, proved that multi-agent delegation is the future of development.
The Mega-Tool Pattern. We wiped out connection latency by collapsing sequential queries into massive, single Firebase executions. If your AI has to make 5 tool calls sequentially, your user is waiting for 5 network round-trips. We abandoned micro-tools and collapsed our relational database queries into single, massive Firebase Cloud Functions, reducing 'thinking time' from 10 seconds to near-zero.
The Inversion Pattern. Scout intercepts its own data dumps, proactively asking the user what they want to hear before reading an entire document aloud. Asking "do you want the full steps or just the fix?" before reading 5 paragraphs of tech notes completely changed how usable the system is when driving.
Progressive Disclosure. Our system prompt shields the LLM from database complexity by leveraging a strict Tool Wrapper Pattern. We don't teach Gemini our Firestore schema; we teach the tools to handle the schema and return clean JSON.
The Reviewer Pattern. We map the current screen intent to dynamically inject the right context and instructions straight into Scout's context window. If a user is on a vehicle routing screen, Scout automatically knows what data to prioritize without explicitly being asked.
Prompt Suppression. We used extreme negative constraints to kill text-generation, forcing Gemini to output zero-latency audio tokens. You must actively tell the Live API model to stop emitting markdown and stop "thinking out loud", otherwise it will clog the WebSocket with text tokens before sending the binary audio tokens, creating a dreadful 1-2 second UX startup latency.
Strict API Restrictions. Because browser WebSockets are unprotected, we heavily restricted our keys to secure the application. Web browser WebSockets do not pass Origin tracking headers. Because you cannot secure a client-side API key using "Website Restrictions", we had to enforce strict "API Restrictions" (limiting the key strictly to the Generative Language API) to insulate the Firebase backend.

Built With

agent-development-kit
antigravity-claude-opus-4.6
antigravity-gemini-3.1
claude-code-vs-studio
cloud-firestore
cloud-functions
firebase
firebase-authentication
firebase-hosting
gemini-2.5-flash
gemini-live-api
google-cloud
google-genai-sdk
multimodal-live-api
node.js
react
typescript

Updates

Erol Eraybar started this project — Mar 16, 2026 01:04 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.