Vision for the Blind

Inspiration

I always knew that blind people had it tough, but it wasn't until I spent a month volunteering with a non-profit teaching blind kids that I truly understood their difficulties. Every day, I watched them navigate a world that wasn't designed for them. They faced challenges that I had never even considered. Simple tasks, like walking down a hallway or finding their seat in the classroom, were monumental obstacles.

One moment, in particular, stands out. I taught a young boy named Mohan who loved to ask questions about everything around him. His curiosity was boundless, but his inability to see often left him frustrated. I remember him asking me to describe the colors of the leaves on a tree, the pattern on a butterfly's wings, and the shape of the clouds. It broke my heart that something as basic as visual information was inaccessible to him. I wanted to do more than just describe the world to him—I wanted to help him experience it. At the same time, I was doing a course on the different types of AI models on HuggingFace and how to use them. Suddenly, through those models, all the dots connected for me and my project "Vision for the Blind" was born!

What It Does

Vision for the Blind is an AI-powered application designed to assist visually impaired individuals by providing real-time audio descriptions of their surroundings. Users can take pictures of their surroundings and upload the picture, then ask questions about it. The application uses advanced AI models to understand the user's question and surroundings to produce instant, informative audio responses that answer the question. Hence, it aims to make visually impaired people more independent and experience the joy we feel by seeing this beautiful world around us.

How I Built It

I built Vision for the Blind by integrating several advanced AI models:

Automatic Speech Recognition: I used an open-source model from HuggingFace to convert audio input into text.
Computer Vision: I implemented Google's Gemini Flash model via LangChain to answer questions about the surroundings using its computer vision capabilities.
Text-to-Audio Conversion: I used a model from Elevenlabs for natural-sounding speech synthesis, which converts text to speech instantaneously in a very human-like voice.
User Interface: I used Streamlit to build a user-friendly GUI and deploy the app as a website.

Challenges I Ran Into

I faced several challenges while coding the app:

Software Compatibility: Due to software compatibility issues on the Streamlit server, I had to switch from my original TTS model from HuggingFace to Elevenlabs, which ultimately improved performance.
Streamlit's Re-running Code: Streamlit's unique way of re-running the entire code on each user interaction caused many logic errors that required hours of debugging.
Automatic Speech Recognition Model: I debugged many errors related to the specific format the audio needed to be in and the additional software dependencies the model required.
Model Integration: Figuring out how to integrate the three models together was complex.

Despite these obstacles, I persevered, learning valuable problem-solving and debugging skills along the way.

Accomplishments That I'm Proud Of

I successfully built and deployed this app, making it accessible to anyone. This is my greatest accomplishment as I fulfilled my purpose of helping students like Mohan see in some way.

What I Learned

This experience taught me a lot:

Troubleshooting: I encountered numerous errors while coding my app, but ChatGPT helped me resolve most of them, teaching me to use it efficiently for troubleshooting.
AI Models and Libraries: I learned about many complex AI models, Python libraries, and programming concepts.
Deployment: It was my first time deploying code on the Streamlit server and on GitHub to make my app publicly accessible.
Problem-Solving: Creating the app greatly tested and improved my logical and problem-solving skills.
Community Engagement: When I couldn't resolve a deployment error on the Streamlit server, I posted about it on the Streamlit community platform and received helpful responses. This taught me the importance of communicating on coding forums, a crucial skill for a full-time programmer.

What's Next for Vision for the Blind

Response Time: I want to improve the total audio response generation time to the user's question, making it more instantaneous.
Real-Time Camera Integration: Instead of requiring image uploads, I want to extend the app to connect to a camera in real-time.
Voice Recognition: I want to integrate voice recognition into the app to automatically start and stop recording when the user asks a question, making it more usable for visually impaired people.

Thank You!

Thank you for spending your time reading about my project. I hope you enjoyed it!

Built With

asr
computer-vision
elevenlabs
huggingface
langchain
python
streamlit
torch
transformers
tts

Updates

Adnan Barwaniwala started this project — Aug 02, 2024 04:43 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.