Inspiration

Personal safety rarely fails because danger is invisible — it fails because
signals are fragmented, subtle, and easy to miss.

In real emergencies, distress may surface as a short message, a change in voice tone, an unexpected sound, or a fleeting visual cue captured at the wrong moment. Often, the person at risk cannot safely speak, type, or interact with an app at all.

Through earlier explorations of safety-focused AI systems, we identified a recurring design gap:

The moments of highest risk are often the moments when users are least able to interact.

Silence does not indicate safety. In many real-world scenarios, silence itself is a critical signal.

This reality became impossible to ignore after encountering a powerful investigation by :contentReference[oaicite:0]{index=0} that examined Uber’s sexual-assault problem:
https://www.nytimes.com/video/business/100000010323329/ubers-sexual-assault-problem.html

The investigation revealed how warning signs often appeared before incidents escalated—missed signals, fragmented context, and moments where intervention could have mattered, but no system was capable of listening across all channels at once.

That article opened our minds to the scale of this problem and exposed a critical limitation of traditional, interaction-dependent safety systems.

Gemini AllSenses

Gemini AllSenses was created to address this gap by enabling AI systems to reason across multiple modalities — text, audio, and visual context — and support human safety precisely when traditional mechanisms break down.

It is designed for the moments when users cannot press a button, speak a word, or ask for help—yet are still sending signals that matter.


What it does

Gemini AllSenses is a real-time, multimodal AI guardian for human safety.

The system detects potential distress by correlating text, audio, and visual signals, instead of relying on a single trigger. It evaluates cross-modal alignment to infer situational risk using:

  • Text signals — emergency keywords and abnormal message patterns
  • Audio signals — non-verbal sounds and environmental anomalies
  • Visual context — short, privacy-preserving video frames

Overall risk is inferred as a fusion of modality-specific evidence:

$$ \text{Risk}_{total} = f(\text{Audio}, \text{Text}, \text{Vision}) $$

Event-Scoped Video Intelligence

Video is not always on.

The visual component activates only after an emergency risk threshold is detected. When triggered, the system captures 1–3 short video frames to provide minimal contextual evidence.

These frames are analyzed exclusively for environmental risk indicators, such as:

  • Low visibility or darkness
  • Isolation or confinement
  • Abrupt movement or instability
  • Disruptive environmental conditions

No biometric identification is performed. Individuals are not recognized or tracked.

This approach maximizes situational awareness while minimizing data collection and privacy risk.


How we built it

Gemini AllSenses was rebuilt from the ground up using a Gemini-first, multimodal architecture, optimized for event-driven safety reasoning rather than passive monitoring.

Key design principles:

  • Multimodal fusion over single triggers
    Each modality contributes partial evidence; risk is inferred from alignment, not isolation.

  • Explainability by design
    Each safety assessment includes structured reasoning, confidence levels, and modality-specific findings.

  • Minimal and proportional data capture
    Video frames are captured only when risk is detected and only in the smallest form needed.

  • Human-in-the-loop orientation
    Outputs are designed to assist trusted contacts or responders — not to automate irreversible actions.

The system leverages Gemini 3 for multimodal reasoning and produces structured outputs suitable for downstream alerts and notifications.


Challenges we ran into

  • Balancing safety with privacy
    We had to design a video system that provides meaningful context without continuous monitoring or identity recognition.

  • Multimodal alignment complexity
    Combining partial, asynchronous signals from text, audio, and vision required careful orchestration to avoid false positives.

  • Latency and reliability
    Emergency systems must respond quickly and consistently, even when some signals are missing or degraded.

  • Explainability requirements
    Safety decisions must be transparent and understandable, not opaque model outputs.


Accomplishments that we're proud of

  • Built a fully multimodal safety system using Gemini 3
  • Designed event-scoped video intelligence instead of always-on surveillance
  • Achieved explainable, confidence-aware safety assessments
  • Maintained a privacy-first architecture while increasing situational clarity
  • Delivered a working, end-to-end prototype suitable for real-world safety scenarios

What we learned

  • Silence is often the strongest safety signal
  • Multimodal reasoning dramatically reduces false assumptions compared to single-signal systems
  • Video is most valuable when used sparingly and intentionally
  • Explainability is essential for trust in safety-critical AI
  • AI safety systems must support humans, not replace judgment

What's next for Gemini AllSenses — multimodal AI guardian for human safety

Next steps include:

  • Expanding multimodal confidence calibration
  • Improving robustness across diverse environments
  • Enhancing responsible alerting workflows for trusted contacts
  • Further refining privacy guarantees and data minimization
  • Exploring additional real-world safety use cases where user interaction is limited

Gemini AllSenses demonstrates how multimodal AI can support human safety when people cannot speak for themselves.

Built With

  • ai-safety
  • amazon-dynamodb
  • amazon-sns
  • api-gemini
  • artificial-intelligence
  • aws-cloudformation
  • aws-lambda
  • cloud-computing
  • event-driven-architecture
  • expplainable-ai
  • google-gemini
  • multimodal-api
  • python
  • rest-api
Share this project:

Updates

posted an update

From Recording to Real-Time Action

Many people reflect on past tragedies where video evidence existed but intervention did not occur in time.

The missing element was not the camera. The missing element was active interpretation.

Traditional systems record. They store. They document.

But they do not evaluate the situation while it is happening.

Gemini AllSenses changes that model.

Instead of waiting for someone to manually review footage, the system actively analyzes video, audio, and contextual signals in real time. If visual indicators of distress, violence, or danger are detected, emergency contacts are automatically alerted during the event — not hours later.

The difference is simple but profound:

Cameras capture.

AI interprets.

Action is triggered immediately.

In situations where past video evidence only became relevant after the fact, an active multimodal AI system could potentially have identified escalating risk patterns and initiated alerts while intervention was still possible.

That is the relationship.

AllSenses is not about reviewing what happened. It is about recognizing what is happening.

As AI technologists, our responsibility is to build systems that do more than observe. We must design systems that reason, detect, and respond when seconds matter.

Moving from passive evidence to proactive protection — that is the value AI can bring to human safety.

Log in or sign up for Devpost to join the conversation.