Inspiration

"Close your eyes and look up at the night sky, and build what you see."

That prompt hit differently for us. Because when you close your eyes, the world doesn't disappear, it becomes sound, touch, and memory. For millions of visually impaired individuals, this is everyday reality.

We asked ourselves: What if technology could turn voice into vision?

Independence is often limited not by ability, but by accessibility. Visually impaired individuals face daily challenges including finding lost objects like keys, phone, water bottle, or medication; understanding unfamiliar surroundings such as new rooms, streets, and public spaces; and reading printed text like medicine labels, restaurant menus, signs, and documents.

While AI has advanced rapidly, most assistive solutions remain expensive with specialized devices costing between ₹50,000 to ₹5,00,000. They are hardware-specific, requiring proprietary equipment, and are typically limited to a single function.

We wanted to build something different. Something that truly listens and responds. Something that empowers through conversation, not passive output.

Aura was born from a simple belief*: AI should be a companion, not just a tool*.

What it does

Aura is a real-time, voice-controlled AI assistant that performs three core functions, all hands-free:

1. Detect

  • User: "Detect a chair" or "Find my bottle."
  • Aura:
    • Identifies the object in real time using computer vision.
    • Calculates its position relative to the user.
    • Provides clear, directional audio guidance:
      • "Move left"
      • "Move right"
      • "Straight ahead"

2. Describe

  • User: "Describe my surroundings."
  • Aura: Analyzes the live camera feed and generates a natural-language explanation of the scene, creating a mental image for the user.

3. Read

  • User: "Read this text."
  • Aura:
    • Extracts printed text from the camera feed using Optical Character Recognition (OCR).
    • Converts it into digital text.
    • Reads it aloud clearly and naturally.

How we built it

Aura is built using a modular Python architecture that integrates multiple AI models into a real-time assistive pipeline:

Core Technologies & Models

1. Speech Recognition & Synthesis

  • Google Speech Recognition API via speech_recognition library for converting voice commands to text
  • Custom wake word detection ("Aura", "hey Aura", "ara", etc.) with flexible pattern matching for accessibility
  • pyttsx3 for offline text-to-speech with optimized rate (165) and volume settings
  • Ambient noise adjustment for improved accuracy in real-world environments

2. Computer Vision Models

  • YOLOv8 (Ultralytics) for real-time object detection with directional guidance
    • Optimized with imgsz=416 for faster inference
    • Confidence threshold of 0.4 for reliable detection
    • Custom bounding box visualization and target tracking
  • BLIP (Bootstrapping Language-Image Pre-training) from Salesforce for scene description
    • Generates natural language captions (max_length=40) of surroundings
    • Converts camera frames to PIL images for processing

3. Text Recognition & Processing

  • Tesseract OCR for extracting printed text from images
    • Grayscale conversion for improved text detection
    • PIL integration for image preprocessing
  • DistilBART-CNN summarization model (sshleifer/distilbart-cnn-12-6)
    • Fallback handling if model fails to load
    • Generates concise summaries (max_length=80) of extracted text

System Architecture

The application follows a continuous loop architecture:

  1. Wake Word Detection: Constantly listens for "Aura" activation
  2. Intent Classification: Parses user commands for three core modes
  3. Mode Execution: Switches between detection, description, or reading
  4. Real-time Feedback: Provides voice and visual feedback simultaneously

Challenges we ran into

1. Real-Time Processing Latency

Running multiple heavy AI models created delays. We implemented:

  • Reduced YOLO inference size to 416×416
  • 2.5-second cooldowns between voice instructions
  • Frame resizing (640×480) for consistency
  • Buffer management to prevent frame accumulation

2. Speech Recognition Accuracy

Background noise affected command detection:

  • Flexible wake word matching ("aura", "ara", "ora")
  • Ambient noise adjustment before listening
  • 5-second phrase time limits
  • Confidence thresholds for validation

3. Model Resource Constraints

Memory issues with multiple transformer models:

  • Graceful fallbacks when models fail
  • Sequential loading with progress indicators
  • Recreated TTS engine per call to prevent threading issues
  • Comprehensive exception handling

Accomplishments that we're proud of

  • Built a fully functional, multi-mode AI assistive system from the ground up.
  • Achieved real-time object detection with integrated, intuitive directional guidance.
  • Successfully unified detection, description, and reading capabilities into a single, seamless voice-driven assistant.
  • Designed a truly hands-free experience, putting accessibility first.
  • Demonstrated strong potential for creating a real-world social impact.

Aura is not just a prototype — it is a scalable framework for assistive intelligence.

What we learned

  • Accessibility must drive AI design decisions, not be an afterthought.
  • Real-time systems demand careful performance optimization; a slow solution is not a helpful solution.
  • Model accuracy alone is not enough — usability, clarity, and reliability determine the real-world impact.
  • Responsible AI must prioritize empowerment, giving users tools to interact with the world on their own terms.

Building Aura reinforced our belief that AI can meaningfully improve independence when it is designed with a clear purpose and human need.

What's next for Aura

We are committed to evolving Aura into a comprehensive, life-changing tool. Our roadmap includes:

  • Enhanced Spatial Awareness: Adding distance estimation and obstacle detection for safer navigation.
  • Mobile Optimization: Porting the system to mobile devices for true on-the-go assistance.
  • Offline Functionality: Enabling offline speech recognition and core AI models to work anywhere, without an internet connection.
  • Global Reach: Introducing multilingual support to serve diverse communities worldwide.
  • User-Centered Iteration: Conducting structured field testing with visually impaired individuals to refine the experience based on real-world feedback.

Our long-term goal is to scale Aura into an affordable, widely accessible AI companion that enhances independence for people around the globe.

Built With

Share this project:

Updates