Aura - Intelligent Vision Assistant for the Visually Impaired

Inspiration

Close your eyes for a moment. Now try to find your water bottle. Try to understand if someone is sitting in front of you. Try to read the label on a medicine bottle.

For over 8 million visually impaired individuals in India, this is not a thought experiment—it's everyday reality.

We realized that most assistive solutions are:

  • Extremely expensive (₹50,000 to ₹5,00,000)
  • Dependent on specialized hardware
  • Limited to a single function

A smart cane detects obstacles but can't read text. A text-to-speech device reads but can't locate objects.

Meanwhile, powerful AI models like YOLO and BLIP exist—but no one had unified them into a single, voice-controlled assistant.

We asked:

What if a ₹2,000 smartphone camera + a laptop could become intelligent eyes?

Aura was born from this belief:

Independence should not be a luxury. Technology should restore dignity.


What It Does

Aura is a real-time, voice-controlled AI assistant that transforms any camera into an intelligent guide, completely hands-free.

Scene Understanding Mode

  • Automatically analyzes surroundings every 10 seconds
  • Uses BLIP image captioning
  • Provides natural voice descriptions:
    • “I see a person sitting at a desk with a laptop”
    • “I see a busy street with cars and pedestrians”
  • No screen needed, pure voice interaction

Object Search Mode

  • Activated by pressing ENTER or voice command
  • User says what they want (e.g., “water bottle”)
  • YOLOv8 detects the object in real time
  • Provides directional guidance:
    • “Move left”
    • “Move right”
    • “Straight ahead”
  • Confirms when object is centered

Turns any room into a navigable space


Hardware

  • Laptop webcam OR smartphone via DroidCam
  • No specialized devices
  • Fully portable and affordable

How We Built It

Aura is built using a modular Python architecture integrating multiple AI systems:

Computer Vision

  • YOLOv8 (Ultralytics) for object detection
  • Optimized inference: 416×416 resolution
  • Sub-2-second response time

Scene Understanding

  • Salesforce BLIP via Hugging Face
  • Caption length: max 40 tokens (quick & clear)

Voice Interaction

  • SpeechRecognition + Google Speech API
  • Windows SpeechSynthesizer for audio output

System Flow

  1. Normal Mode → Scene descriptions every 10 seconds
  2. ENTER → Object Search Mode
  3. Voice input captured
  4. YOLO detects object
  5. Direction calculated (80-pixel threshold)
  6. Audio guidance provided
  7. Return to Normal Mode

Tech Stack

  • Python 3.10
  • OpenCV (camera + frame processing)
  • PyTorch (deep learning backend)
  • DroidCam (mobile camera streaming)

Challenges We Faced

Latency Issues

  • YOLO + BLIP caused delays
  • Solution:
    • Reduced resolution (416×416 → 40% faster)
    • Frame resizing (640×480)
    • Buffer optimization

Speech Recognition Noise

  • Background noise + accents
  • Solution:
    • Ambient noise adjustment
    • 5-second phrase limit
    • Confidence thresholds

Resource Constraints

  • Multiple models caused memory issues
  • Solution:
    • Sequential loading
    • Graceful fallback handling

Direction Accuracy

  • Needed precise guidance
  • Solution:
    • 80-pixel threshold calibration
    • Timeout for failed searches

Hardware Compatibility

  • Different camera behaviors
  • Solution:
    • Automatic fallback (DroidCam → Webcam)

Accomplishments

  • Built a multi-mode AI assistant combining scene understanding + object search
  • Achieved real-time detection (<2 seconds) on standard hardware
  • Created a 100% voice-first interface (no screen required)
  • Reduced cost by ~95% compared to existing solutions
  • Designed error-resilient system (never crashes mid-use)

What We Learned

  • Accessibility must be core design, not an afterthought
  • Speed matters → Slow AI = unusable AI
  • Reliability > Accuracy in real-world use
  • Voice systems must handle real-world speech diversity
  • Shipping a working product beats chasing perfection

What's Next

  • Distance estimation (e.g., “2 meters ahead”)
  • Multi-level obstacle detection (knee, waist, head)
  • Android app using TensorFlow Lite
  • Offline speech recognition (Vosk)
  • Multilingual support
  • Hardware-agnostic platform

Ultimate Goal

Independence shouldn't be a luxury, it should be a right.

Built With

Share this project:

Updates