Inspiration

Every year, thousands of people die because bystanders freeze in emergencies. They want to help, they just don’t know how. There are resources online, instructional videos, even apps, but in the moment of crisis, when someone collapses in front of you, you can’t read a screen. Your hands aren’t free. You’re panicking. I realized that what people need isn’t more information. They need a voice: a calm, human voice that sees what they see, understands when they’re panicking, and guides them step-by-step until help arrives. That’s why I built RESPONDO: the voice that becomes your lifeline.

What it does

RESPONDO transforms terrified bystanders into confident first responders using three AI modes.

🎙️ Voice Mode: Your AI Emergency Coach

Open the app and describe what you see. RESPONDO instantly identifies the emergency (cardiac arrest, choking, anaphylaxis, etc.) using Vertex AI Gemini, then starts guiding you step by step using ElevenLabs Conversational AI. The voice adapts in real time to your emotional state.

  • Panicking "Stop. Breathe with me. You CAN do this."
  • Uncertain "You're doing it right. Keep going."
  • Succeeding "Perfect. You're saving their life."

No menus. No tapping. Just speak, and RESPONDO becomes the paramedic in your pocket.

👁️ Video Mode: Real-Time Technique Validation

Unsure if your CPR hand placement is correct? Show RESPONDO your hands. Gemini Vision analyzes your technique through your phone’s camera, and the ElevenLabs voice responds immediately.

"Move your hands down two inches. Perfect. Now lock your elbows."

It is like having a paramedic over your shoulder, correcting your form in real time.

🚨 SOS Mode: AI-Powered 911 Coach

When it is time to call emergency services, RESPONDO generates a complete call script based on:

  • Your exact location (GPS + address)
  • What emergency is happening
  • Everything you have done (with timestamps)
  • Current victim status

The AI reads the script aloud using ElevenLabs Text-to-Speech, so you can repeat it verbatim to 911. When paramedics arrive, you hand them a complete incident summary via QR code.

No guessing. No missed details. Just facts.

How we built it

The RESPONDO iOS app is built in SwiftUI. It captures and streams voice, uses CoreLocation to fetch GPS coordinates, and integrates with the native phone dialer. All requests flow to a Google Cloud Run backend running serverless Python (Flask). Cloud Run auto-scales to handle real-time emergency interactions and exposes REST API endpoints for voice, vision, and SOS workflows. The backend uses Google Cloud services to power intelligence and state. Vertex AI Gemini (2.5 Flash) performs emergency triage from speech, analyzes images for technique validation, and generates the 911 call script and clinical summaries. Cloud Firestore persists each session, including transcripts, extracted actions, and a timestamped incident timeline. Google Maps Geocoding converts GPS coordinates into a precise street address for emergency calls. ElevenLabs provides the human voice layer. Conversational AI Agents deliver real-time guidance with emotional intelligence and multi-turn dialogue. Text-to-Speech narrates the generated 911 script and key instructions so the user can keep hands and eyes free.

Voice Mode Implementation

Voice Mode uses the ElevenLabs Conversational AI Agent with RAG medical protocols configured in the ElevenLabs dashboard. The agent is pre-configured with emergency-specific knowledge bases and system prompts for each emergency type (cardiac arrest, choking, bleeding, etc.), enabling context-aware, real-time voice guidance. During conversations, transcripts are streamed to the Google Cloud Run backend, which uses the Vertex AI Gemini Text API to extract structured clinical data (actions, observations, critical flags) from the dialogue. All conversation data and extracted insights are persisted in Cloud Firestore, creating a complete incident record for later analysis and summary generation.

System prompt engineering creates a voice that dynamically switches between three tones (calm, reassuring, instructional) based on real-time detection of panic, uncertainty, or confidence in the user's speech. This is not static text-to-speech. It is multi-turn guidance designed to feel like a human lifeline.

Video Mode Implementation

Video Mode uses Vertex AI Gemini API through the Google Cloud Run backend for real-time image analysis and CPR technique validation. When users capture images during emergency procedures, the iOS app sends frames to Cloud Run via a /v1/analyze-position endpoint, which processes them through Gemini Vision to detect hand placement, compression technique, and procedural accuracy. The backend uses Gemini’s structured JSON output to provide immediate corrective feedback, with ElevenLabs TTS generating audio responses. All video analysis sessions are tracked in Cloud Firestore, linking visual checks to the broader incident timeline.

SOS Mode Implementation

SOS Mode orchestrates emergency response through Google Cloud Run. It aggregates incident data from Cloud Firestore, including transcripts, actions, and observations from Voice Mode, then uses Gemini to generate a comprehensive clinical summary. When the SOS flow is triggered via a 3-second hold gesture, Cloud Run fetches the complete session history from Firestore, runs synthesis through Gemini, and returns a structured summary payload. The system also generates shareable QR codes that link to web-accessible incident summaries, enabling a fast handoff to emergency responders. All session metadata, location data, and generated summaries are stored in Cloud Firestore for persistent record-keeping and audit trails.

Challenges we ran into

  1. Making AI sound human, not robotic
    The hardest part was building a voice that feels like a real paramedic, not a chatbot. We iterated through 20+ ElevenLabs system prompts and improved results by switching from one-way instructions to dialogue. That meant quick checkpoint questions, emotion-aware tone shifts, and plain language like "Are they breathing?" instead of clinical jargon.

  2. Real-time video analysis latency
    Early Gemini's vision responses took 2 to 3 seconds, which felt too slow during CPR. We reduced latency by:

    • Dropping image resolution to 720p
    • Using structured JSON output
    • Adding client-side caching for repeat checks
  3. Extracting actions from conversational transcripts
    Transcripts are messy, and turning speech into structured events was difficult. We built a Gemini extraction pipeline to detect:

    • Actions (compressions started, EpiPen administered)
    • Body locations (left thigh, center of chest)
    • Observations (not breathing, blue lips)
    • Timestamps
      This produces a machine-readable incident timeline from natural dialogue.

Accomplishments that we're proud of

Most emergency apps are static checklists. RESPONDO listens, adapts, and responds like a real human coach. When you panic, it grounds you. When you're uncertain, it builds confidence. It is not just answering questions. It creates human connection in the most inhuman moments.

  • Built a true multimodal system by combining ElevenLabs Conversational AI with Gemini Vision for visual plus voice coaching
    You can ask "Is this right?" while showing your hands and get immediate technique feedback.
  • Turned live conversation into a complete emergency call script and dialer-ready flow
    This saves critical seconds and reduces missed details during a 911 call.
  • Created a timestamped incident audit trail automatically
    Each session stores actions, observations, and outcomes, then generates a QR code handoff for paramedics with full context.
  • Made the experience multilingual by default
    Hindi, English, Spanish. RESPONDO adapts automatically, expanding access for non-English speakers who might otherwise hesitate to use an emergency app.

What we learned

Prompt engineering is half the battle. We spent around 40% of development time refining the ElevenLabs agent prompt, and tiny tone shifts like "Stop. Breathe." versus "Please calm down" completely changed how helpful the experience felt, because the best prompts sound like natural speech, not formal instructions. Gemini Vision is powerful for real-world use because it understands context such as CPR hand placement, not just objects, and structured output mode is essential in production to avoid brittle parsing. Serverless works for emergencies since Cloud Run auto-scaling handled our demo load smoothly and cold starts under 1 second are acceptable when the first instruction is to call 911. Most importantly, multimodal AI is ready now. Combining voice, vision, text, and location creates an experience that is far more capable than any single modality, and it is clear that modern AI apps should be multimodal by default.

What's next for RESPONDO

Practice Mode with AI Feedback

  • Safe training scenarios with immediate correction
  • Gamification with confidence badges for technique mastery
  • Built for schools and workplaces to run first aid training at scale

Wearable Integration

  • Apple Watch app for hands-free SOS activation
  • Haptic feedback to keep CPR compression rhythm
  • Heart rate monitoring to detect responder stress

Offline Mode

  • Works with no signal using on-device emergency checklists and audio guidance
  • One-tap access to critical steps while the phone tries to reconnect
  • Auto-syncs the incident timeline when connectivity returns

For the first time, anyone with a smartphone has instant access to AI-powered emergency coaching that adapts to emotional state, validates technique in real time, and guides the entire incident from panic to action to EMS handoff. RESPONDO does not replace emergency services. It amplifies bystanders in the critical minutes before help arrives.

Built with

  • Google Cloud: Vertex AI Gemini 2.5 Flash, Cloud Run, Firestore, Maps API
  • ElevenLabs: Conversational AI Agents, Text-to-Speech
  • iOS: Swift, SwiftUI

Built With

Share this project:

Updates