SightSense

Inspiration

Most AI assistants today can see the world, but they don’t understand urgency. For visually impaired users, the real challenge isn’t knowing what exists, but knowing what matters right now.

A laptop nearby is harmless. A person approaching is not. A loud warning sound could be critical.

We realized that existing tools focus on detection, not decision-making.

That’s where SightSense was born: an assistant that doesn’t just see the world, but interprets it for safety.

What it does

SightSense is a real-time AI safety assistant that provides situational awareness, not just object detection.

It - Detects objects using on-device AI Understands position and distance Tracks motion over time (e.g., “approaching”) Detects warning sounds Combines everything into prioritized alerts Instead of overwhelming users, it filters: Laptop nearby: LOW (ignored) Person approaching: HIGH (urgent) Loud sound: HIGH (override) It communicates through:

Voice (TTS)
Haptics
Minimal, accessible UI And runs fully on-device, ensuring:
Low latency
Privacy
Offline reliability

How we built it

We designed SightSense as a multimodal edge AI pipeline: Vision Layer LiteRT / TensorFlow Lite model (.tflite) Runs on NPU/GPU/CPU via delegate fallback Extracts:

object label
position (left/center/right)
distance (near/medium/far) Audio Layer Lightweight signal-based detection (RMS + amplitude) Identifies warning-level sounds in real time Fusion Engine (Core Innovation) Deterministic reasoning system Combines:
object type
spatial context
motion trends
audio signals Outputs:
LOW / MEDIUM / HIGH priority alerts
Motion Intelligence
Tracks distance changes across frames: far -> medium -> near = approaching Accessibility Layer Voice-first interaction (TTS) Haptic alerts Minimal UI Optimization LiteRT delegate fallback (CNN -> GPU -> CPU) Cooldown system to prevent alert spam Confidence filtering for stability

Challenges we ran into

Avoiding alert overload (UX challenge) Initial versions produced too many alerts: Bottle detected, laptop detected Problem: Users get overwhelmed Important alerts get buried System becomes unusable Solution: We redesigned the system into a priority-based alert engine: LOW -> harmless objects MEDIUM -> obstacles HIGH -> immediate danger
From detection -> decision-making (core difficulty) Most AI models only provide: Object labels + bounding boxes But we needed: Context + urgency + motion understanding Problem: No built-in notion of “danger” No understanding of motion like “approaching” Solution: We built a custom fusion engine that: derives position (left/center/right) estimates distance (near/medium/far) tracks temporal changes across frames detects motion patterns like: far -> medium -> near = approaching

Accomplishments that we're proud of

Built a fully working real-time system, not just a prototype Achieved multimodal fusion (vision + audio + motion) Implemented priority-based intelligence, not just detection Designed for actual visually impaired usability, not demo-only Created a system that is:

fast
stable
explainable

Most importantly: We moved from AI that sees -> AI that understands

What we learned

What's next for SightSense

We see SightSense evolving into a full situational intelligence platform:

Near-term YAMNet-based sound classification (sirens, horns) Multi-object tracking with IDs Auto Danger Mode (context-aware sensitivity) Mid-term Personalized risk models Navigation assistance Environment classification (indoor/outdoor) Long-term Integration with wearables (glasses, earbuds) Fully context-aware AI assistant Large-scale deployment for accessibility

This is the github release - https://github.com/sravan1023/Qualcomm-Google-Hackathon/releases/tag/v0.1.0