Inspiration In India, there are over 40 million visually impaired individuals, many of whom struggle with daily independence. While navigating the digital world has become easier with screen readers, the physical world remains a black box. Simple tasks like reading a medicine label, finding a dropped key, or knowing if a street is safe to cross still require human assistance.
We asked ourselves: With the power of Multimodal GenAI (GPT-4o), why should anyone have to wait for a human volunteer to "see" for them?
We wanted to build a "Digital Eye"—an always-available, intelligent companion that doesn't just describe images but understands context, safety, and urgency for the user.
What it does Drishti-AI is a multimodal accessibility assistant that converts visual data into spoken insights in real-time. It features three distinct modes tailored for the daily lives of the visually impaired:
Scene Understanding (Safety Mode): Instantly analyzes the user's surroundings and prioritizes safety hazards (e.g., "There is a staircase 5 steps ahead" or "A dog is approaching from the left").
Intelligent OCR (Reading Mode): Reads vernacular text from anywhere—newspapers, medicine bottles, or street signs—and speaks it out loud. It doesn't just read; it explains. (e.g., "This is a Dolo-650 tablet, used for fever.")
Object Finder: A "Find My X" feature for the real world. The user asks, "Where are my keys?", and the AI scans the video feed to guide them: "Your keys are on the table, to the immediate right of the laptop."
How we built it We built Drishti-AI as a Zero-Latency MVP using:
Frontend: Streamlit for a high-contrast, accessible mobile UI with massive touch targets.
Vision Engine: OpenAI GPT-4o, which allows us to process images with near-human understanding of context and depth.
Voice Interface: OpenAI TTS (Text-to-Speech) to ensure the output is natural, empathetic, and clear, rather than robotic.
Backend Logic: Python, handling the orchestration between image capture, compression (for speed), and API calls.
Challenges we ran into Latency vs. Accuracy: GPT-4o is powerful but can be slow on mobile networks. We had to implement aggressive image compression algorithms to reduce upload times without losing the details needed for reading small text.
Hallucinations: Early versions would "invent" objects that weren't there. We solved this by using strict system prompting (e.g., "You are a safety assistant. Do not guess.") to force the model to be conservative and accurate.
Audio Feedback Loops: Since the app listens and speaks, the microphone would sometimes pick up the AI's own voice. We implemented a "stop-listen" logic to ensure clean interactions.
Accomplishments that we're proud of The "Medicine" Test: We successfully used the app to read a crumpled medicine strip and identify the drug name and expiry date correctly—something standard OCR tools failed at.
3-Second Response Time: We optimized the pipeline to get the "Time to Speak" down to under 3 seconds, which is crucial for a blind user waiting for instructions.
Accessible UI: We didn't just build an app; we built an accessible app. The buttons are 80% of the screen width, and the color contrast adheres to WCAG AAA standards.
What we learned Context is King: A blind user doesn't want a poetic description of a sunset; they want to know if there is a bench to sit on. We learned to prompt-engineer for utility over verbosity.
Audio First Design: Building for eyes-free users requires a completely different mindset. Visual loading spinners are useless; audio cues (beeps, haptics) are essential.
What's next for Drishti-AI Offline Mode: Integrating edge-based models (like YOLO) for basic obstacle detection when the internet is down.
Vernacular Voice Support: Adding support for Hindi, Telugu, and Tamil so the app can speak to users in their mother tongue.
Smart Glasses Integration: Moving the software from a phone to a wearable camera (like Raspberry Pi + Camera Module) for a truly hands-free experience.
Built With
- openai
- openai-gpt-4o-(vision)
- python
- streamlit
- voice-to-text)
- whisper
Log in or sign up for Devpost to join the conversation.