Small camera. Big vision. Aura turns this tiny device into your eyes, anytime, anywhere.
Target acquired. Aura detects exactly what you are looking for, highlighted, guided by voice.
Aura turns any camera into eyes that never blink, mounted on a cap, ready to assist.

Aura - Intelligent Vision Assistant for the Visually Impaired

Inspiration

Close your eyes for a moment. Now try to find your water bottle. Try to understand if someone is sitting in front of you. Try to read the label on a medicine bottle.

For over 8 million visually impaired individuals in India, this is not a thought experiment—it's everyday reality.

We realized that most assistive solutions are:

Extremely expensive (₹50,000 to ₹5,00,000)
Dependent on specialized hardware
Limited to a single function

A smart cane detects obstacles but can't read text. A text-to-speech device reads but can't locate objects.

Meanwhile, powerful AI models like YOLO and BLIP exist—but no one had unified them into a single, voice-controlled assistant.

We asked:

What if a ₹2,000 smartphone camera + a laptop could become intelligent eyes?

Aura was born from this belief:

Independence should not be a luxury. Technology should restore dignity.

What It Does

Aura is a real-time, voice-controlled AI assistant that transforms any camera into an intelligent guide, completely hands-free.

Scene Understanding Mode

Automatically analyzes surroundings every 10 seconds
Uses BLIP image captioning
Provides natural voice descriptions:
- “I see a person sitting at a desk with a laptop”
- “I see a busy street with cars and pedestrians”
No screen needed, pure voice interaction

Object Search Mode

Activated by pressing ENTER or voice command
User says what they want (e.g., “water bottle”)
YOLOv8 detects the object in real time
Provides directional guidance:
- “Move left”
- “Move right”
- “Straight ahead”
Confirms when object is centered

Turns any room into a navigable space

Hardware

Laptop webcam OR smartphone via DroidCam
No specialized devices
Fully portable and affordable

How We Built It

Aura is built using a modular Python architecture integrating multiple AI systems:

Computer Vision

YOLOv8 (Ultralytics) for object detection
Optimized inference: 416×416 resolution
Sub-2-second response time

Scene Understanding

Salesforce BLIP via Hugging Face
Caption length: max 40 tokens (quick & clear)

Voice Interaction

SpeechRecognition + Google Speech API
Windows SpeechSynthesizer for audio output

System Flow

Normal Mode → Scene descriptions every 10 seconds
ENTER → Object Search Mode
Voice input captured
YOLO detects object
Direction calculated (80-pixel threshold)
Audio guidance provided
Return to Normal Mode

Tech Stack

Python 3.10
OpenCV (camera + frame processing)
PyTorch (deep learning backend)
DroidCam (mobile camera streaming)

Challenges We Faced

Latency Issues

YOLO + BLIP caused delays
Solution:
- Reduced resolution (416×416 → 40% faster)
- Frame resizing (640×480)
- Buffer optimization

Speech Recognition Noise

Background noise + accents
Solution:
- Ambient noise adjustment
- 5-second phrase limit
- Confidence thresholds

Resource Constraints

Multiple models caused memory issues
Solution:
- Sequential loading
- Graceful fallback handling

Direction Accuracy

Needed precise guidance
Solution:
- 80-pixel threshold calibration
- Timeout for failed searches

Hardware Compatibility

Different camera behaviors
Solution:
- Automatic fallback (DroidCam → Webcam)

Accomplishments

Built a multi-mode AI assistant combining scene understanding + object search
Achieved real-time detection (<2 seconds) on standard hardware
Created a 100% voice-first interface (no screen required)
Reduced cost by ~95% compared to existing solutions
Designed error-resilient system (never crashes mid-use)