NO AUDIO IN THE VIDEO DUE TO SOME ISSUE. Please read below to understand what it is about. We receive a call and an email for alerts that were indicated at the bottom of the report depending on the urgency level We showed that to judges in-person :)

๐Ÿง  VitalSight

Transforming surveillance into situational awareness


๐Ÿš€ Inspiration

VitalSight was born from a simple but unsettling realization: in a world blanketed by cameras, emergencies still go unseen.
Hospitals, elder-care facilities, and factories may be under constant CCTV watch โ€” yet when someone collapses at 2 AM or a worker shows signs of distress, those critical first moments still depend on a human noticing. Surveillance today is passive; awareness still requires attention.

We wanted to change that.

Our vision was to build a system that doesnโ€™t just see, but understands โ€” one that transforms ordinary cameras into intelligent sentinels capable of recognizing and articulating emergencies as they happen. By fusing real-time computer vision with the interpretive reasoning of large language models, VitalSight removes human latency from the detection phase entirely and delivers contextual, actionable intelligence โ€” not just a red light on a dashboard.

What began as a fall-detection prototype evolved into something larger: a content-aware environment, a digital guardian that perceives discomfort, interprets context, and initiates help on its own.


๐Ÿฅ What it does

VitalSight is an AI-driven emergency detection and response system that converts passive CCTV infrastructure into an active safety layer for humans.
It continuously analyzes video feeds โ€” live cameras, uploaded footage, or batch recordings โ€” to identify and act upon five categories of emergencies:

  1. ๐ŸŸข Fall Detection (Low Priority)
    YOLO 11 + MediaPipe Pose track orientation, velocity, and body aspect ratios to distinguish genuine falls from benign movements such as sitting or kneeling.

  2. ๐Ÿ”ด Fire Detection (Critical Priority)
    HSV-based flame and smoke segmentation with temporal filtering differentiates actual combustion from reflections or glare.

  3. ๐ŸŸ  Respiratory / Medical Distress (High Priority)
    Tracks hand-to-chest proximity and abnormal posture dynamics to recognize early indicators of cardiac or respiratory crises.

  4. ๐ŸŸก Violence / Panic / Crowd Disorder (Medium Priority)
    Multi-person tracking identifies aggressive postures, raised arms, and chaotic motion within groups.

  5. ๐Ÿ”ด Severe Injury / Immobility (Critical Priority)
    Detects person-object collisions, prone postures, and extended immobility following impact.


๐Ÿง  The Intelligence Layer

When an anomaly is detected, VitalSight doesnโ€™t just raise an alarm โ€” it explains what happened.
Each incident frame is analyzed through our AI-reasoning stack, powered by Gemini 2.0 Flash and enhanced narration via Eleven Labs TTS.


๐Ÿ”ท Gemini 2.5 Flash (API Integration)

Gemini 2.5 Flash serves as the situational reasoning core of VitalSight.
It transforms raw detection metadata into human-readable, structured reports through prompt-engineered contextual templates:

  • Immediate Situation: concise natural-language summary of the visual event
  • Observable Details: objects, people, positions, and potential hazards
  • Assessment: likely cause, escalation risk, or misdetection confidence
  • Recommended Action: clear responder guidance and equipment hints

How it works:

  • Inference requests are generated asynchronously from the detection thread.
  • Each report runs through Geminiโ€™s fast multimodal reasoning endpoint for sub-second contextual understanding.
  • Output is formatted in Markdown / JSON for dashboard rendering and Twilio notifications.
  • Also serves as the fallback layer implemented every few hundred frames in case a situation that we have never encountered or have not trained for comes up

๐Ÿ”Š Eleven Labs API (Voice Generation Layer)

To make alerts audible and accessible in control-room environments, we integrated Eleven Labsโ€™ Speech Synthesis API.
Whenever Gemini produces a textual report, Eleven Labs converts the summary into a natural, human-like voice notification:

  • Dynamic tone control: calmer narration for minor events, urgent tone for critical alerts
  • Multi-language readiness: easily localizable for multilingual facilities
  • Edge deployment: lightweight MP3 generation streamed directly to browser speakers or IoT speakers

This pairing of Geminiโ€™s cognitive reasoning with Eleven Labsโ€™ expressive narration creates a truly multimodal understanding + response system โ€” one that sees, reasons, and speaks.


๐Ÿ”” Automated Response Orchestration

Integrated with Twilio, VitalSight can instantly:

  • Call designated responders for CRITICAL alerts
  • Text or email security teams with the Gemini-generated report and Eleven Labs audio clip
  • Escalate unresolved alerts by re-contacting backups or triggering IoT alarms

Every response follows a configurable escalation matrix based on priority.


๐Ÿ’ป The User Experience

A modern, web-based dashboard built with Flask + Tailwind provides:

  • A 3 ร— 4 live grid of processed or streaming videos
  • Real-time MJPEG previews with bounding boxes and pose overlays
  • Clickable incident tiles showing Gemini reports and Eleven Labs audio playback
  • Progress tracking with live frame counters and completion bars
  • Session-secured access for authenticated users

๐Ÿง  How we built it

Architecture Overview

  1. Detection Engine (detector_v2.py) โ€“ YOLO 11 (Object + Pose) via PyTorch 2.0

    • Multi-person tracking, keypoint extraction, temporal smoothing
    • Custom scoring per emergency type
    • Real-time frame callback for web streaming
  2. Pose Analysis (pose.py) โ€“ MediaPipe Pose

    • Torso angle, hand-chest distance, limb geometry
    • Temporal debouncing
  3. AI Reasoning (gemini_reporter.py) โ€“ Gemini 2.0 Flash

    • Context-aware prompt templates
    • Background threading for non-blocking inference
    • Structured markdown output
  4. Voice Alerts (tts_notifier.py) โ€“ Eleven Labs TTS API

    • Converts text summaries into audio notifications
    • Streams MP3 files to dashboard and Twilio call endpoints
  5. Web Application (webapp.py) โ€“ Flask 3 + Tailwind CSS

    • MJPEG streaming, REST APIs, authentication, glass-morphic UI
  6. Batch Processing (batch_process.py)

    • Headless directory inference + report bundling

Tech Stack

  • Vision & ML: PyTorch 2.0+, YOLO 11, MediaPipe Pose
  • AI LLMs: Gemini 2.0 Flash, Eleven Labs API
  • Web: Flask 3.0, Tailwind CSS
  • Comms: Twilio (SMS / voice / email)
  • Async: Python threading
  • Config: YAML runtime profiles

โš™๏ธ Challenges we ran into

  • The dataSet was super difficult to find, good data that is not blurry and that works for our case
  • Real-time streaming without duplication
  • Browser codec compatibility and autoplay restrictions
  • Asynchronous Gemini threads and Eleven Labs audio queuing
  • Unified emoji-based severity parsing
  • Live progress counters and authentication caching

๐Ÿ† Accomplishments that weโ€™re proud of

  • Built a fully operational multi-modal emergency AI system in under 36 hours
  • Achieved real-time inference + LLM reasoning + voice generation
  • Deployed an end-to-end alert loop โ€” detection โ†’ reasoning โ†’ voice + text alerts
  • Created a content-aware environment: cameras that understand and speak
  • Engineered a scalable modular pipeline ready for on-prem or cloud

๐Ÿ“š What we learned

  • Hybrid reasoning (fast CV + slow LLM) yields richer awareness
  • Real-time systems depend equally on codec, network, and UX tuning
  • Automation means little without communication โ€” voice alerts closed the loop

๐Ÿ”ฎ Whatโ€™s next for VitalSight

  • Multi-camera scaling on Jetson / L4 GPU clusters
  • Hybrid cloud architecture with local detection + cloud verification
  • Motion transformers for predictive behavior modeling
  • Vector-based incident memory for analytics
  • Integration with 911 dispatch and hospital EHR systems

VitalSight represents a shift from monitoring to understanding โ€” from cameras that see to environments that care.

Built With

Share this project:

Updates