Inspiration

"Close your eyes and try to navigate the room you're sitting in."
That simple thought experiment became our starting point. When you close your eyes, the world doesn't disappear—it becomes sound, touch, and memory. For over 8 million visually impaired individuals in India alone, this is everyday reality.

We asked ourselves a question:
In an era of cutting-edge AI, why do assistive solutions remain so limited?

The devices available today are either prohibitively expensive, tied to specialized hardware, or restricted to a single function.

  • A smart cane costs ₹50,000 but only detects obstacles
  • A text-to-speech device costs ₹15,000 but only reads
  • A separate object identifier costs even more

What if we could build one unified solution using just a smartphone and the power of AI?
What if technology could turn voice into vision?

Aura was born from this idea—a belief that independence shouldn't be a luxury.

By combining computer vision, natural language processing, and voice AI, we've created a companion that truly listens and responds. One that empowers through conversation, not passive output.

Because technology at its best doesn't just solve problems—it restores dignity.


What It Does

Aura is a real-time, voice-controlled AI assistant that transforms any smartphone camera into a pair of intelligent eyes. Through simple conversation, users can access three life-changing capabilities—completely hands-free.

Detect

When a user says:

  • "Aura, find my water bottle"
  • "Detect a chair"

Aura:

  • Identifies the object using real-time object detection
  • Calculates its position relative to the user
  • Provides intuitive directional audio guidance:
    • “Move left”
    • “Move right”
    • “Straight ahead”

Guidance continues until the object is found.
A connected display can optionally show visual bounding boxes for sighted assistants.


Describe

When a user says:

  • "Aura, describe my surroundings"

Aura captures the scene and generates rich natural language descriptions, such as:

  • “A living room with a brown couch and a wooden coffee table”
  • “A busy street with people walking and cars passing by”

Users receive an instant mental picture of their environment—without needing to see it.


Read

When a user says:

  • "Aura, read this text"

Aura:

  • Extracts printed text from the camera
  • Summarizes long documents when needed
  • Reads everything aloud clearly

Works for:

  • Medicine labels
  • Restaurant menus
  • Street signs
  • Documents
  • Any printed text sighted people take for granted

Voice-First Interaction

  • Activated using a simple wake word:
    • “Aura”, “Hey Aura”, or accent-tolerant variations
  • Continuously listens but only responds when addressed
  • Ensures privacy, natural interaction, and zero friction

No touch. No complex setup. No expensive hardware.
Just a smartphone, your voice, and AI working together.


How We Built It

Aura is built on a modular Python architecture that integrates multiple AI models into a seamless real-time pipeline. Python was chosen for its extensive AI ecosystem and rapid prototyping capabilities.

Speech Recognition

  • Implemented via Google Speech Recognition API
  • Custom wake-word matching handles:
    • Indian English accents
    • Mispronunciations
    • Background noise
  • Recognizes variations like:
    • aura, ara, ora, aurora

Computer Vision

  • Real-time object detection using YOLOv8
  • Optimized inference size: 416 × 416
  • Achieves under 2-second response time
  • Calculates object position relative to frame center
  • Uses an 80-pixel threshold for left / right / center detection
  • Color-coded bounding boxes:
    • Green → target object
    • Blue → other objects

Scene Description

  • Uses a vision-language model (BLIP) for image captioning
  • Generates concise, natural descriptions of the surroundings
  • Optimized for accessibility-focused understanding rather than visual detail overload
  • Max length: 40 tokens (fast response + no cognitive overload)

Text Extraction & Summarization

  • OCR using Tesseract with grayscale preprocessing
  • Long text summarized using a lightweight summarizer
  • Graceful fallback:
    • If summarization fails → read original text

Voice Output

  • Offline text-to-speech using pyttsx3
  • Engine recreated per speech event to avoid threading issues
  • Fully Windows compatible

Hardware Compatibility

  • Works with:
    • Smartphone camera via DroidCam
    • Built-in webcam fallback
  • Supports:
    • Cap-mounted cameras
    • Handheld phones
    • Pocket cameras

System Architecture

Continuous loop:

  1. Camera capture
  2. Wake word detection
  3. Intent classification
  4. Mode execution (Detect / Describe / Read)
  5. Voice feedback

Comprehensive error handling ensures graceful degradation instead of crashes.


Challenges We Ran Into

Latency

Running multiple heavy models sequentially caused delays.

Solutions:

  • Reduced inference size (640 → 416)
  • Achieved ~40% faster processing
  • Added 2.5-second voice cooldowns
  • Optimized frame capture to 20 FPS

Speech Recognition Accuracy

Indian accents + noise reduced accuracy.

Solutions:

  • Flexible wake-word variations
  • Ambient noise calibration
  • 5-second phrase limits
  • Confidence thresholds

Memory Constraints

Loading multiple transformer models caused crashes.

Solutions:

  • Sequential model loading
  • Graceful fallbacks
  • TTS engine recreation
  • Extensive exception handling

Directional Guidance

Intuitive navigation required calibration.

Solutions:

  • 80-pixel center threshold
  • 60-second search timeout
  • Audio cooldowns
  • Color-coded visual feedback

Accomplishments We’re Proud Of

  • Built a fully integrated, multi-mode assistive AI system
  • Achieved real-time object detection under 2 seconds
  • Created a completely hands-free, voice-first experience
  • Designed for error resilience and reliability
  • Reduced cost by ~95% compared to existing assistive devices

Cost Comparison

  • Aura setup: ~₹22,000
  • Traditional assistive devices: ₹2–5 lakhs

Potential to help millions who currently have no access to affordable assistive technology.


What We Learned

  • Accessibility must drive design—not be an afterthought
  • Speed matters as much as accuracy in real-world systems
  • Reliability > raw model performance
  • Error handling is mission-critical in accessibility tech
  • Voice AI must adapt culturally and linguistically
  • Shipping a helpful product beats chasing perfection

What’s Next for Aura

  • Distance estimation using monocular depth
  • “The chair is 2 meters ahead” guidance
  • Obstacle detection at:
    • Knee level
    • Waist level
    • Head level
  • Offline speech recognition (no internet dependency)
  • Multilingual support
  • Field testing with 50+ visually impaired users
  • Android app prototyping

Because independence shouldn't be a luxury, it should be a right.

Built With

Share this project:

Updates