Aura - Intelligent Vision Assistant for the Visually Impaired
Inspiration
Close your eyes for a moment. Now try to find your water bottle. Try to understand if someone is sitting in front of you. Try to read the label on a medicine bottle.
For over 8 million visually impaired individuals in India, this is not a thought experiment—it's everyday reality.
We realized that most assistive solutions are:
- Extremely expensive (₹50,000 to ₹5,00,000)
- Dependent on specialized hardware
- Limited to a single function
A smart cane detects obstacles but can't read text. A text-to-speech device reads but can't locate objects.
Meanwhile, powerful AI models like YOLO and BLIP exist—but no one had unified them into a single, voice-controlled assistant.
We asked:
What if a ₹2,000 smartphone camera + a laptop could become intelligent eyes?
Aura was born from this belief:
Independence should not be a luxury. Technology should restore dignity.
What It Does
Aura is a real-time, voice-controlled AI assistant that transforms any camera into an intelligent guide, completely hands-free.
Scene Understanding Mode
- Automatically analyzes surroundings every 10 seconds
- Uses BLIP image captioning
- Provides natural voice descriptions:
- “I see a person sitting at a desk with a laptop”
- “I see a busy street with cars and pedestrians”
- No screen needed, pure voice interaction
Object Search Mode
- Activated by pressing ENTER or voice command
- User says what they want (e.g., “water bottle”)
- YOLOv8 detects the object in real time
- Provides directional guidance:
- “Move left”
- “Move right”
- “Straight ahead”
- Confirms when object is centered
Turns any room into a navigable space
Hardware
- Laptop webcam OR smartphone via DroidCam
- No specialized devices
- Fully portable and affordable
How We Built It
Aura is built using a modular Python architecture integrating multiple AI systems:
Computer Vision
- YOLOv8 (Ultralytics) for object detection
- Optimized inference: 416×416 resolution
- Sub-2-second response time
Scene Understanding
- Salesforce BLIP via Hugging Face
- Caption length: max 40 tokens (quick & clear)
Voice Interaction
- SpeechRecognition + Google Speech API
- Windows SpeechSynthesizer for audio output
System Flow
- Normal Mode → Scene descriptions every 10 seconds
- ENTER → Object Search Mode
- Voice input captured
- YOLO detects object
- Direction calculated (80-pixel threshold)
- Audio guidance provided
- Return to Normal Mode
Tech Stack
- Python 3.10
- OpenCV (camera + frame processing)
- PyTorch (deep learning backend)
- DroidCam (mobile camera streaming)
Challenges We Faced
Latency Issues
- YOLO + BLIP caused delays
- Solution:
- Reduced resolution (416×416 → 40% faster)
- Frame resizing (640×480)
- Buffer optimization
Speech Recognition Noise
- Background noise + accents
- Solution:
- Ambient noise adjustment
- 5-second phrase limit
- Confidence thresholds
- Ambient noise adjustment
Resource Constraints
- Multiple models caused memory issues
- Solution:
- Sequential loading
- Graceful fallback handling
- Sequential loading
Direction Accuracy
- Needed precise guidance
- Solution:
- 80-pixel threshold calibration
- Timeout for failed searches
- 80-pixel threshold calibration
Hardware Compatibility
- Different camera behaviors
- Solution:
- Automatic fallback (DroidCam → Webcam)
- Automatic fallback (DroidCam → Webcam)
Accomplishments
- Built a multi-mode AI assistant combining scene understanding + object search
- Achieved real-time detection (<2 seconds) on standard hardware
- Created a 100% voice-first interface (no screen required)
- Reduced cost by ~95% compared to existing solutions
- Designed error-resilient system (never crashes mid-use)
What We Learned
- Accessibility must be core design, not an afterthought
- Speed matters → Slow AI = unusable AI
- Reliability > Accuracy in real-world use
- Voice systems must handle real-world speech diversity
- Shipping a working product beats chasing perfection
What's Next
- Distance estimation (e.g., “2 meters ahead”)
- Multi-level obstacle detection (knee, waist, head)
- Android app using TensorFlow Lite
- Offline speech recognition (Vosk)
- Multilingual support
- Hardware-agnostic platform
Ultimate Goal
Independence shouldn't be a luxury, it should be a right.
Log in or sign up for Devpost to join the conversation.