Aura: Vision Through Voice
Inspiration
Close your eyes for a moment. Now try to find your water bottle. Try to understand if someone is sitting in front of you. Try to read the label on a medicine bottle. For over 285 million visually impaired people worldwide, this is not a thought experiment, it's everyday reality.
Globally, an estimated 285 million people live with visual impairment, including 39 million who are completely blind. Yet most assistive solutions remain painfully expensive, starting from thousands of dollars, tied to specialized hardware, or limited to a single function. A smart cane detects obstacles but can't read text. A text-to-speech device reads but can't find lost objects. Meanwhile, powerful AI models like YOLO and BLIP exist freely, but nobody had connected them into a unified, voice-controlled assistant for the blind.
We asked ourselves: What if we could turn a $20 smartphone camera and a laptop into a pair of intelligent eyes? What if AI could not just see, but speak, guiding users with nothing but their voice? Aura was born from this belief: Independence should not be a luxury. Technology at its best restores dignity.
This is not just a hackathon project. This is a mission to make AI-powered accessibility available to everyone who needs it, regardless of their economic circumstances or where they live in the world.
What it Does
Aura is a real-time, voice-controlled AI assistant that transforms any camera into an intelligent guide. It operates in two powerful modes, completely hands-free:
1. Scene Understanding Mode
Aura continuously analyzes the environment and provides natural voice descriptions every 10 seconds.
- Technology: Uses Salesforce BLIP image captioning.
- User Experience: Provides an instant mental picture (e.g., "A person sitting at a desk with a laptop").
- Benefit: No screen reading required. Just pure voice.
2. Object Search Mode
Users can activate search mode by pressing ENTER or speaking a command.
- Technology: Powered by YOLOv8 real-time object detection.
- Interaction: Aura asks, "What would you like to find?" The user speaks the object name (e.g., "keys," "water bottle").
- Guidance: Aura calculates the object's position relative to the camera center and provides directional audio guidance: "Move left," "Move right," or "Straight ahead."
Key Features
- Fully Voice-Based Interaction: Every response is spoken aloud. Designed ground-up for users who cannot use a display.
- Standard Hardware: Runs on a laptop webcam or mobile phone via DroidCam, making it affordable and portable.
How We Built It
Aura is built on a modular Python architecture integrating multiple state-of-the-art AI models.
Technical Implementation
- Computer Vision: We implemented Ultralytics YOLOv8. We optimized inference size to 416×416 pixels, achieving sub-2-second response times on standard hardware.
- Scene Understanding: Integrated Salesforce BLIP via Hugging Face Transformers, optimized for 40-token max length to prevent information overload.
- Voice Processing: Used the SpeechRecognition library with Google Speech API. We implemented ambient noise adjustment to improve accuracy in real-world conditions.
- Audio Output: Used pyttsx3 and Windows SpeechSynthesizer for clear text-to-speech conversion.
- Video Processing: OpenCV manages frame resizing (640×480) and buffer management to prevent lag.
The Tech Stack
Python 3.10 | OpenCV | Ultralytics YOLOv8 | Hugging Face Transformers | Salesforce BLIP | PyTorch | SpeechRecognition | PyAudio | pyttsx3
Challenges We Ran Into
- Real-time Latency: Sequential processing of YOLO and BLIP caused delays. We solved this by reducing YOLO inference size by 40%, resulting in a much more responsive feel.
- Speech Accuracy: Background noise and accents caused errors. We implemented ambient noise calibration and 5-second phrase limits to ensure the system didn't "hang."
- Resource Constraints: Loading multiple heavy models caused memory issues. We implemented sequential loading with progress indicators and graceful degradation so the system never crashes completely.
- Calibration: Finding the "sweet spot" for directional guidance required an 80-pixel threshold to prevent over-sensitive commands.
Accomplishments That We're Proud Of
- Full Integration: Successfully combined YOLOv8, BLIP, and Voice Processing into one seamless, unified pipeline.
- Democratized Access: Proved that complex AI can run on standard consumer hardware, drastically reducing the cost compared to $1,000+ specialized devices.
- Voice-First Design: Created a system that requires zero screen interaction, built specifically for the needs of the visually impaired.
- Reliability: Developed an error-resilient design—because in accessibility, reliability is a requirement, not a feature.
What We Learned
- User-First AI: Accessibility must drive design decisions. We learned the importance of confirmation feedback and audio pacing.
- Speed vs. Accuracy: A slightly less accurate model that responds instantly is far more useful than a perfect model that is slow.
- Real-World Conditions: Our initial assumption of quiet environments was wrong; we had to build for noise, accents, and hesitation.
- Open Source Power: Using tools like Hugging Face and OpenCV allowed us to build in days what would have previously taken months.
What's Next for Aura
Immediate (30 Days)
- Distance Estimation: Adding monocular depth estimation so Aura can say, "The chair is 2 meters ahead."
- Obstacle Detection: Detecting hazards at knee, waist, and head level.
Short-Term (3 Months)
- Mobile Porting: Bringing Aura to Android via TensorFlow Lite to remove the laptop requirement.
- Offline Mode: Using Vosk or Whisper.cpp for speech recognition without internet.
- Multilingual Support: Adding Spanish, Mandarin, Hindi, and Arabic.
Long-Term Vision
- Smart Glasses: Integration for discreet, hands-free assistance.
- Global Partnerships: Working with NGOs to distribute Aura to the 285 million people who need it most.
Aura: Vision through voice.
Log in or sign up for Devpost to join the conversation.