Inspiration
When tragedy strikes, the first point of contact is a voice on a phone line. Our 911 dispatchers are heroes, but they are forced to operate within a terrifyingly fragile and linear paradigm. They must verbally de-escalate a panicked caller, manually transcribe the chaos into a CAD (Computer-Aided Dispatch) system, triangulate the threat, and organize a response—all while relying entirely on fragmented, panicked testimony. In these moments of extreme cognitive overload, the human bottleneck costs precious minutes. And in emergencies, minutes cost lives. We were inspired by a singular question: what if AI could break out of the text box and act as an unblinking, hyper-rational guardian connecting the digital and physical worlds? For the Gemini Live Agent Challenge, we wanted to build a lifeline. We envisioned a system that doesn't just record emergencies but rather actively intercepts them. An intelligence that processes raw acoustic chaos and human panic, and autonomously triggers police dispatch, medical units, and city infrastructure in milliseconds. Sentinel-911 is our thesis for the future of emergency response—in moments of crisis, AI shouldn't just be conversational. It must be operational.
What it does
Sentinel-911 is an autonomous, multimodal emergency dispatch AI built directly on the Gemini Live API. It completely replaces the manual data entry of traditional dispatch while still keeping human dispatchers in the loop to oversee and approve every decision.
- Live Audio Context: It processes continuous bi-directional audio from the caller's device. This enables the agent to actively listen to the emergency and dynamically assess the situation based on verbal and acoustic cues, rather than relying solely on post-call testimony.
- Live Translation: If a caller speaks a language other than English, Sentinel-911 detects it instantly, responds flawlessly in their native tongue to calm them, and concurrently streams the English translation to the Dispatcher UI via tool calls.
- Autonomous Dispatch: Unlike turn-based chatbots, Sentinel-911 operates asynchronously. As it listens, it parses chaos into a structured "Live Incident Board" and autonomously triggers function calls to dispatch Police/Fire/Medical units mapped onto real-world streets using open-source routing APIs.
- Smart City Control: Currently, responders are able to use CENTRAX to access traffic signals and locking down sectors. However Sentinel-911 makes this autonomous by using the bases of the threat's escalation to proactively trigger infrastructure overrides, such as turning traffic lights green for incoming ambulances, locking down sectors, or deploying visual reconnaissance drones through function calls to these systems.
- Visual Reconnaissance (Simulated): In a production environment, this system hooks into actual municipal reconnaissance feeds. For the scope of this hackathon demo, when an address is locked, the backend generates a photorealistic aerial drone perspective image representing the anticipated scene using Gemini's image generation capabilities, demonstrating how we grant first responders unprecedented situational awareness before they arrive.
How we built it
We architected Sentinel-911 as a high-performance React (TypeScript/Vite) frontend communicating with a scalable Google Cloud Run backend built in Python (FastAPI).
- Continuous Audio Streaming: We built a custom LiveClient class using the Google GenAI SDK to maintain a raw WebSockets connection and WebRTC audio context. We used
AudioWorkletNodefor zero-latency microphone streaming, piping the raw audio directly into the Gemini Live session. - Tri-Model AI Orchestration: We utilized three Gemini models in concert:
- Live Stream:
gemini-2.0-flash-exphandles the continuous audio context and the 9 autonomous function declarations. - Background Analytics:
gemini-3-flashpowers asynchronous loops that query the expanding transcript to populate the UI’s structured JSON components without interrupting the voice flow. - Visual Synthesis: The
/api/reconendpoint hits thegemini-2.0-flash-expimage generation model to dynamically synthesize drone surveillance imagery.
- Live Stream:
- Spatial Intelligence: To achieve real-time unit dispatching without latency, we integrated an R-Tree spatial index (
RBush) on the frontend to query an 18,000-node database of US Police Stations. By defining bounding boxes, we reduced the time complexity of the geospatial distance search to ( O(\log n) ), avoiding expensive backend roundtrips. Dispatched units are routed using the OSRM (Open Source Routing Machine) API. - Secure Interlayer (AES Encryption): Because the Live API requires a direct WebSocket connection from the frontend to Google's servers, exposing the raw API key in the React client was a major security risk. To solve this, our Python backend acts as an encryption broker, transmitting the API key as an AES-CBC encrypted payload that the frontend decrypts in-memory at runtime to initialize the WebRTC session securely.
- Backend & Cloud Integration: Our containerized Python backend handles requests using secure AES-CBC encrypted payloads. It leverages Google Cloud Firestore to permanently log every autonomous decision for transparent audit trails. Furthermore, we integrated Google Search Grounding into the backend analytics loop so the AI bases its triage and infrastructure commands on real-world realities, drastically mitigating hallucinations.
Challenges we ran into
- WebRTC Security & API Key Exposure: The Gemini Live API requires establishing a direct WebRTC/WebSocket connection from the browser to Google. This meant the client needed the raw API Key, presenting a massive security vulnerability. We solved this by building a zero-trust architecture: the frontend requests the key from the backend, which encrypts it using AES-CBC encryption, sending a cipher payload that the frontend decrypts and immediately consumes in-memory to establish the connection safely without exposing it in source control or the network panel.
- Live Audio Latency & Duplication: Standard Web Audio implementations like
ScriptProcessorNodeintroduced unacceptable latency and caused the AI to interrupt itself or duplicate tool call responses. We resolved this by migrating entirely to low-levelAudioWorkletprocessing and explicitly omitting tool call responses back to the server since they triggered recursive audio loops. - The "Tool Dump" Problem: When we first integrated our 9 spatial tools, the Gemini model wanted to solve the entire emergency on turn one, dumping all 9 function calls instantly. We spent significant time engineering a "Progressive Response Protocol" inside the system prompt to force the agent into a disciplined workflow: "Step 1: Location Only -> Step 2: Basic Dispatch -> Step 3: Escalation/Lockdown."
- Spatial Search Bottlenecks: A naive linear search ( O(n) ) to find the closest responding station from an 18,000-row database locked the UI thread. Re-architecting this utilizing bounding-box mathematics via
RBushreduced nearest-station lookup times to under 1ms.
Accomplishments that we're proud of
- Simultaneous Native Translation: Successfully implementing a
log_translationfunction that allows the AI to perfectly maintain a verbal conversation in Spanish or French while simultaneously feeding the structured, exact English translation to the application state without breaking character. - The Zero-Click Triage: Watching the system autonomously identify a threat from a chaotic transcript, map it, run an R-Tree search, fetch real street routing, execute a Cloud Firestore log, and move a police icon across the Live Map—without a single human interaction—was our "Eureka" moment.
- Automated Cloud Deployment: We built an Infrastructure-as-Code pipeline (deploy.sh) to reliably orchestrate the Dockerization and publication of the backend to Google Cloud Run and Artifact Registry, validating the architecture for productionscale. ## What we learned
- We learned that to build a true "Live Agent", you have to break the sequential request/response paradigm. By decoupling the Live API audio stream from the asynchronous backend loops that poll the transcript, you can create a system that constantly evaluates its environment concurrently, much like a human dispatcher multi-tasking.
- Prompt engineering for multimodal, real-time agents requires drastically different patterns than text models. You have to instruct the AI not just on what to say, but on when to act versus when to speak.
- Grounding is critical. When executing real-world commands like evacuations or lockdowns, giving the
gemini-3-flashmodel Google Search capability drastically improved the validity of its escalation logic. ## What's next for Sentinel-911 While Sentinel-911 is currently a highly capable proof-of-concept, the next steps involve integrating it with true dispatch ecosystems. We plan to integrate SIP/VoIP trunking so it can intercept actual SIP phone networks rather than web microphones. Furthermore, we aim to expand the Smart City API hooks to test integration with actual physical infrastructure like municipal IoT traffic control systems and automated drone deployment pads, pushing true autonomy to the tactical edge.
Built With
- aes
- audioworklet
- docker
- docker-compose
- fastapi
- gemini
- gemini-2.0-flash-exp
- gemini-3-flash
- gemini-3-pro-image
- gemini-live-api
- geopandas
- google-artifact-registry
- google-cloud-firestore
- google-cloud-run
- google-search-grounding
- leaflet.js
- nominatim-api
- osrm-api
- pydantic
- python
- rbush
- react
- react-leaflet
- typescript
- us-police-stations-db
- uvicorn
- vite
- webrtc
- websockets
Log in or sign up for Devpost to join the conversation.