Aura | Devpost

Small camera. Big vision. Aura turns this tiny device into your eyes, anywhere, anytime.
Target acquired. Aura detects exactly what you are looking for, highlighted, guided by voice.
Aura turns any camera into eyes that never blink, mounted on a cap, ready to assist.

Inspiration

"Close your eyes and try to navigate the room you're sitting in."
That simple thought experiment became our starting point. When you close your eyes, the world doesn't disappear—it becomes sound, touch, and memory. For over 8 million visually impaired individuals in India alone, this is everyday reality.

We asked ourselves a question:
In an era of cutting-edge AI, why do assistive solutions remain so limited?

The devices available today are either prohibitively expensive, tied to specialized hardware, or restricted to a single function.

A smart cane costs ₹50,000 but only detects obstacles
A text-to-speech device costs ₹15,000 but only reads
A separate object identifier costs even more

What if we could build one unified solution using just a smartphone and the power of AI?
What if technology could turn voice into vision?

Aura was born from this idea—a belief that independence shouldn't be a luxury.

By combining computer vision, natural language processing, and voice AI, we've created a companion that truly listens and responds. One that empowers through conversation, not passive output.

Because technology at its best doesn't just solve problems—it restores dignity.

What It Does

Aura is a real-time, voice-controlled AI assistant that transforms any smartphone camera into a pair of intelligent eyes. Through simple conversation, users can access three life-changing capabilities—completely hands-free.

Detect

When a user says:

"Aura, find my water bottle"
"Detect a chair"

Aura:

Identifies the object using real-time object detection
Calculates its position relative to the user
Provides intuitive directional audio guidance:
- “Move left”
- “Move right”
- “Straight ahead”

Guidance continues until the object is found.
A connected display can optionally show visual bounding boxes for sighted assistants.

Describe

When a user says:

"Aura, describe my surroundings"

Aura captures the scene and generates rich natural language descriptions, such as:

“A living room with a brown couch and a wooden coffee table”
“A busy street with people walking and cars passing by”

Users receive an instant mental picture of their environment—without needing to see it.

Read

When a user says:

"Aura, read this text"

Aura:

Extracts printed text from the camera
Summarizes long documents when needed
Reads everything aloud clearly

Works for:

Medicine labels
Restaurant menus
Street signs
Documents
Any printed text sighted people take for granted

Voice-First Interaction

Activated using a simple wake word:
- “Aura”, “Hey Aura”, or accent-tolerant variations
Continuously listens but only responds when addressed
Ensures privacy, natural interaction, and zero friction

No touch. No complex setup. No expensive hardware.
Just a smartphone, your voice, and AI working together.

How We Built It

Aura is built on a modular Python architecture that integrates multiple AI models into a seamless real-time pipeline. Python was chosen for its extensive AI ecosystem and rapid prototyping capabilities.

Speech Recognition

Implemented via Google Speech Recognition API
Custom wake-word matching handles:
- Indian English accents
- Mispronunciations
- Background noise
Recognizes variations like:
- aura, ara, ora, aurora

Computer Vision

Real-time object detection using YOLOv8
Optimized inference size: 416 × 416
Achieves under 2-second response time
Calculates object position relative to frame center
Uses an 80-pixel threshold for left / right / center detection
Color-coded bounding boxes:
- Green → target object
- Blue → other objects

Scene Description

Uses a vision-language model (BLIP) for image captioning
Generates concise, natural descriptions of the surroundings
Optimized for accessibility-focused understanding rather than visual detail overload
Max length: 40 tokens (fast response + no cognitive overload)

Text Extraction & Summarization

OCR using Tesseract with grayscale preprocessing
Long text summarized using a lightweight summarizer
Graceful fallback:
- If summarization fails → read original text

Voice Output

Offline text-to-speech using pyttsx3
Engine recreated per speech event to avoid threading issues
Fully Windows compatible

Hardware Compatibility

Works with:
- Smartphone camera via DroidCam
- Built-in webcam fallback
Supports:
- Cap-mounted cameras
- Handheld phones
- Pocket cameras

System Architecture

Continuous loop:

Camera capture
Wake word detection
Intent classification
Mode execution (Detect / Describe / Read)
Voice feedback

Comprehensive error handling ensures graceful degradation instead of crashes.

Challenges We Ran Into

Latency

Running multiple heavy models sequentially caused delays.

Solutions:

Reduced inference size (640 → 416)
Achieved ~40% faster processing
Added 2.5-second voice cooldowns
Optimized frame capture to 20 FPS

Speech Recognition Accuracy

Indian accents + noise reduced accuracy.

Solutions:

Flexible wake-word variations
Ambient noise calibration
5-second phrase limits
Confidence thresholds

Memory Constraints

Loading multiple transformer models caused crashes.

Solutions:

Sequential model loading
Graceful fallbacks
TTS engine recreation
Extensive exception handling

Directional Guidance

Intuitive navigation required calibration.

Solutions:

80-pixel center threshold
60-second search timeout
Audio cooldowns
Color-coded visual feedback

Accomplishments We’re Proud Of

Built a fully integrated, multi-mode assistive AI system
Achieved real-time object detection under 2 seconds
Created a completely hands-free, voice-first experience
Designed for error resilience and reliability
Reduced cost by ~95% compared to existing assistive devices

Cost Comparison

Aura setup: ~₹22,000
Traditional assistive devices: ₹2–5 lakhs

Potential to help millions who currently have no access to affordable assistive technology.

What We Learned

Accessibility must drive design—not be an afterthought
Speed matters as much as accuracy in real-world systems
Reliability > raw model performance
Error handling is mission-critical in accessibility tech
Voice AI must adapt culturally and linguistically
Shipping a helpful product beats chasing perfection

What’s Next for Aura

Distance estimation using monocular depth
“The chair is 2 meters ahead” guidance
Obstacle detection at:
- Knee level
- Waist level
- Head level
Offline speech recognition (no internet dependency)
Multilingual support
Field testing with 50+ visually impaired users
Android app prototyping

Because independence shouldn't be a luxury, it should be a right.

Built With

blip
distilbart-cnn
opencv
pillow
python
pyttsx3
speechrecognition
tesseract
transformers
yolov8

Updates

Jessy Kiruba started this project — Feb 25, 2026 07:15 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.