💡 InspirationTraditional screen readers often fail to convey the "soul" of an interface. They read lists of labels but miss the layout, the imagery, and the hierarchy that sighted users take for granted. We were inspired to build Visionary to bridge this "Semantic Gap." Our goal was to create an assistant that doesn't just read text, but understands the context of the entire desktop—helping visually impaired users navigate complex sites like Amazon or YouTube by describing them as a human would.👁️ What it doesVisionary is a system-wide AI assistant for the visually impaired. It performs three core functions:Multimodal Vision: It "sees" the screen using Amazon Nova Pro, identifying buttons, images, and complex layouts.Conversational Querying: Users can ask voice questions like "What's the price of the first item?" or "Summarize this email."Neural Narration: It responds using Amazon Polly’s human-like neural voices, providing a natural and fluid interaction rather than a robotic one.🛠️ How we built itWe engineered a high-performance Sequential Speech-to-Speech (STS) loop:The Vision Engine: Built with PyAutoGUI and Pillow to capture the entire Fedora desktop environment.The Brain: Powered by Amazon Bedrock, specifically the amazon.nova-pro-v1 model, chosen for its superior multimodal reasoning.The Ears: A local OpenAI Whisper instance transcribes user intent with zero latency.The Mouth: Amazon Polly (Neural Engine) generates the final audio output.Live Dashboard: We used Streamlit to create a real-time monitor showing the AI's visual capture and narrative logs simultaneously.🚧 Challenges we ran intoSynchronizing the Loop: We had to carefully manage the "Rhythmic Cadence" of the app to prevent the AI from talking over the user or capturing its own voice as input.Linux Audio Drivers: Configuring PyAudio and FFmpeg on Fedora 41 required significant low-level troubleshooting of ALSA and PulseAudio drivers.Cloud Latency: Sending high-res images to the cloud can be slow. We solved this by optimizing image compression and implementing a "Sequential Processing" flow that prioritizes user voice commands. 🏆 Accomplishments that we're proud ofSystem-Wide Integration: Unlike browser extensions, Visionary works across any application—from the Terminal to VS Code.High-Detail Reasoning: We successfully tuned the AI to ignore "visual noise" and focus on actionable elements, making the narration genuinely useful.Production-Ready Structure: We moved from a simple hackathon script to a modular architecture with a dedicated real-time monitoring dashboard .🧠 What we learnedWe gained deep insights into Multimodal Prompt Engineering and the intricacies of AWS Bedrock. We learned how to handle asynchronous audio streams and the importance of "Visual Diffing" in accessibility tools. Most importantly, we learned that empathy-driven design—building for a user who cannot see—completely changes how you structure software logic. 🚀 What's next for Visual-Semantic-Screen ReaderThe roadmap for Visionary is focused on privacy and efficiency:Local PII Redaction: Implementing a local CV model to blur sensitive information (like passwords) before it ever leaves the machine. Delta Capturing: Optimizing AWS costs by only analyzing the screen when a significant visual change $\Delta$ is detected. We calculate this by comparing pixel variance against a defined threshold: Specialized App Profiles: Creating custom AI "lenses" for specific tools like Gmail, Excel, or IDEs to provide even more granular assistance.
Built With
- 3.13
- act)
- amazon-web-services
- api
- aws-bedrock-(nova-multimodal
- fedora
- gateway
- lambda
- localhost.run
- nova
- nova-2-sonic
- playwright
- pyaudio
- pyautogui
- python
- sounddevice
- streamlit
Log in or sign up for Devpost to join the conversation.