NO AUDIO IN THE VIDEO DUE TO SOME ISSUE. Please read below to understand what it is about. We receive a call and an email for alerts that were indicated at the bottom of the report depending on the urgency level We showed that to judges in-person :)

🧠 VitalSight

Transforming surveillance into situational awareness

🚀 Inspiration

VitalSight was born from a simple but unsettling realization: in a world blanketed by cameras, emergencies still go unseen.
Hospitals, elder-care facilities, and factories may be under constant CCTV watch — yet when someone collapses at 2 AM or a worker shows signs of distress, those critical first moments still depend on a human noticing. Surveillance today is passive; awareness still requires attention.

We wanted to change that.

Our vision was to build a system that doesn’t just see, but understands — one that transforms ordinary cameras into intelligent sentinels capable of recognizing and articulating emergencies as they happen. By fusing real-time computer vision with the interpretive reasoning of large language models, VitalSight removes human latency from the detection phase entirely and delivers contextual, actionable intelligence — not just a red light on a dashboard.

What began as a fall-detection prototype evolved into something larger: a content-aware environment, a digital guardian that perceives discomfort, interprets context, and initiates help on its own.

🏥 What it does

VitalSight is an AI-driven emergency detection and response system that converts passive CCTV infrastructure into an active safety layer for humans.
It continuously analyzes video feeds — live cameras, uploaded footage, or batch recordings — to identify and act upon five categories of emergencies:

🟢 Fall Detection (Low Priority)
YOLO 11 + MediaPipe Pose track orientation, velocity, and body aspect ratios to distinguish genuine falls from benign movements such as sitting or kneeling.
🔴 Fire Detection (Critical Priority)
HSV-based flame and smoke segmentation with temporal filtering differentiates actual combustion from reflections or glare.
🟠 Respiratory / Medical Distress (High Priority)
Tracks hand-to-chest proximity and abnormal posture dynamics to recognize early indicators of cardiac or respiratory crises.
🟡 Violence / Panic / Crowd Disorder (Medium Priority)
Multi-person tracking identifies aggressive postures, raised arms, and chaotic motion within groups.
🔴 Severe Injury / Immobility (Critical Priority)
Detects person-object collisions, prone postures, and extended immobility following impact.

🧠 The Intelligence Layer

When an anomaly is detected, VitalSight doesn’t just raise an alarm — it explains what happened.
Each incident frame is analyzed through our AI-reasoning stack, powered by Gemini 2.0 Flash and enhanced narration via Eleven Labs TTS.

🔷 Gemini 2.5 Flash (API Integration)

Gemini 2.5 Flash serves as the situational reasoning core of VitalSight.
It transforms raw detection metadata into human-readable, structured reports through prompt-engineered contextual templates:

Immediate Situation: concise natural-language summary of the visual event
Observable Details: objects, people, positions, and potential hazards
Assessment: likely cause, escalation risk, or misdetection confidence
Recommended Action: clear responder guidance and equipment hints

How it works:

Inference requests are generated asynchronously from the detection thread.
Each report runs through Gemini’s fast multimodal reasoning endpoint for sub-second contextual understanding.
Output is formatted in Markdown / JSON for dashboard rendering and Twilio notifications.
Also serves as the fallback layer implemented every few hundred frames in case a situation that we have never encountered or have not trained for comes up

🔊 Eleven Labs API (Voice Generation Layer)

To make alerts audible and accessible in control-room environments, we integrated Eleven Labs’ Speech Synthesis API.
Whenever Gemini produces a textual report, Eleven Labs converts the summary into a natural, human-like voice notification:

Dynamic tone control: calmer narration for minor events, urgent tone for critical alerts
Multi-language readiness: easily localizable for multilingual facilities
Edge deployment: lightweight MP3 generation streamed directly to browser speakers or IoT speakers

This pairing of Gemini’s cognitive reasoning with Eleven Labs’ expressive narration creates a truly multimodal understanding + response system — one that sees, reasons, and speaks.

🔔 Automated Response Orchestration

Integrated with Twilio, VitalSight can instantly:

Call designated responders for CRITICAL alerts
Text or email security teams with the Gemini-generated report and Eleven Labs audio clip
Escalate unresolved alerts by re-contacting backups or triggering IoT alarms

Every response follows a configurable escalation matrix based on priority.

💻 The User Experience

A modern, web-based dashboard built with Flask + Tailwind provides:

A 3 × 4 live grid of processed or streaming videos
Real-time MJPEG previews with bounding boxes and pose overlays
Clickable incident tiles showing Gemini reports and Eleven Labs audio playback
Progress tracking with live frame counters and completion bars
Session-secured access for authenticated users

🧠 How we built it

Architecture Overview

Detection Engine (detector_v2.py) – YOLO 11 (Object + Pose) via PyTorch 2.0
- Multi-person tracking, keypoint extraction, temporal smoothing
- Custom scoring per emergency type
- Real-time frame callback for web streaming
Pose Analysis (pose.py) – MediaPipe Pose
- Torso angle, hand-chest distance, limb geometry
- Temporal debouncing
AI Reasoning (gemini_reporter.py) – Gemini 2.0 Flash
- Context-aware prompt templates
- Background threading for non-blocking inference
- Structured markdown output
Voice Alerts (tts_notifier.py) – Eleven Labs TTS API
- Converts text summaries into audio notifications
- Streams MP3 files to dashboard and Twilio call endpoints
Web Application (webapp.py) – Flask 3 + Tailwind CSS
- MJPEG streaming, REST APIs, authentication, glass-morphic UI
Batch Processing (batch_process.py)
- Headless directory inference + report bundling

Tech Stack

Vision & ML: PyTorch 2.0+, YOLO 11, MediaPipe Pose
AI LLMs: Gemini 2.0 Flash, Eleven Labs API
Web: Flask 3.0, Tailwind CSS
Comms: Twilio (SMS / voice / email)
Async: Python threading
Config: YAML runtime profiles

⚙️ Challenges we ran into

The dataSet was super difficult to find, good data that is not blurry and that works for our case
Real-time streaming without duplication
Browser codec compatibility and autoplay restrictions
Asynchronous Gemini threads and Eleven Labs audio queuing
Unified emoji-based severity parsing
Live progress counters and authentication caching

🏆 Accomplishments that we’re proud of

Built a fully operational multi-modal emergency AI system in under 36 hours
Achieved real-time inference + LLM reasoning + voice generation
Deployed an end-to-end alert loop — detection → reasoning → voice + text alerts
Created a content-aware environment: cameras that understand and speak
Engineered a scalable modular pipeline ready for on-prem or cloud