Inspiration

According to the World Health Organization, approximately 2.2 billion people worldwide live with some form of vision impairment. While several assistive technologies exist today, many are expensive, limited in functionality, or fail to fully leverage recent advancements in Artificial Intelligence.

We were inspired by the opportunity to create a solution that combines multiple AI capabilities into a single affordable assistant. Even if our solution reaches a small percentage of the visually impaired community, the potential social impact is significant.

What it does

AI Assistant for Visually Impaired People is a real-time multimodal assistant that helps users better understand and interact with their surroundings through voice.

The system can detect and classify objects, read text from documents and signs, describe scenes, estimate depth, answer questions through a conversational AI interface, and retrieve live information from the internet. Users interact with the assistant completely hands-free using voice commands and receive responses through speech output.

How we built it

The system was built using a modular pipeline architecture that combines Computer Vision, Speech Processing, Natural Language Processing, and Large Language Models.

Camera frames are processed using OpenCV, object detection models, EasyOCR, and MiDaS for environmental understanding. Audio input is handled through Faster-Whisper and a wake-word listener. spaCy is used for intent classification, while Llama 4 Scout and the Groq API power scene understanding and conversational capabilities. Piper TTS converts responses into natural speech, creating a seamless voice-based experience for users.

Challenges we ran into

Building AI systems in controlled environments is relatively straightforward, but deploying them in real-world scenarios introduces significant challenges. Lighting conditions, background noise, camera movement, and unfamiliar objects can all affect system performance.

Since the project is intended to assist visually impaired individuals, reliability and safety were critical concerns. We had to carefully balance real-time performance with accuracy while ensuring the system avoids providing misleading information when confidence levels are low.

Another challenge was coordinating multiple AI models simultaneously while maintaining low latency. Achieving near real-time performance required optimization, parallel processing, and efficient task routing.

Accomplishments that we're proud of

  • Successfully integrating object detection, OCR, scene understanding, depth estimation, speech recognition, and conversational AI into a single assistant.
  • Building a cost-effective solution using mostly open-source technologies.
  • Achieving near real-time performance through a modular and parallelized architecture.
  • Developing a practical system that can provide meaningful assistance in everyday scenarios.
  • Being shortlisted by the National Incubation Center (NIC) Karachi, validating both the innovation and potential impact of our solution.

What we learned

This project taught us much more than individual technologies. We learned how to design end-to-end AI pipelines, integrate multiple models into a unified system, and optimize them for real-world deployment.

We also gained hands-on experience with Computer Vision, OCR, Speech-to-Text, Text-to-Speech, Intent Classification, Large Language Models, and concurrent processing. Most importantly, we learned how to transform an idea into a working product capable of solving a real-world problem.

What's next for AI Assistant For Visually Impaired People

We plan to continue improving the assistant in several ways:

  • Multilingual Support: Enable the assistant to communicate in multiple languages, making it accessible to a much wider audience.
  • Voice Authentication: Introduce speaker recognition based on voice characteristics, allowing the system to respond only to authorized users while reducing the impact of environmental conversations and background noise.
  • Person Recognition: Extend the vision system to recognize known individuals. Instead of simply announcing "person on the left" or "person ahead," the assistant could identify familiar people by name (e.g., "Ali is in front of you") and label unknown individuals as strangers.
  • Wearable Deployment: Transition from desktop-based hardware to lightweight wearable devices for greater mobility and convenience.

Built With

Share this project:

Updates