About the Project
I often consume videos without looking at the screen—while walking, multitasking, or doing other activities. However, simply listening to the audio leaves out crucial visual information. This led me to realize that visually impaired users face an even greater challenge, as traditional screen readers are not powerful enough to describe complex video scenes. I wanted to change that.
This project is an AI-powered assistant that narrates video visuals in real-time, providing synchronized descriptions of what’s happening on the screen. By pairing AI-generated narration with the original audio, it enables users to fully experience video content without needing to watch. Additionally, for those short on time, the AI can generate a full summary of the video instead.
To build this, I used:
- Gemini’s Multimodal Live API to extract scene descriptions in real-time.
- Together AI via LlamaIndex to generate a full summary of the video.
- Browser’s Screen Capture API to capture video frames, specifically leveraging Element Capture to restrict the stream to a specific DOM tree.
- A web frontend and Python backend to handle the data flow and playback.
What I Learned
- First-time use of Gemini’s Multimodal Live API—It’s incredibly powerful and performed well for real-time scene analysis.
- Element Capture in the Screen Capture API—I discovered that it allows capturing a specific DOM element instead of the entire screen, making it highly flexible for different use cases.
Next Steps
I wanted to make the system more interactive, allowing users to ask follow-up questions, request details about specific objects, or control video playback with voice commands, but due to time constraints, I couldn't implement these features. This is an exciting area for future improvements.
This project showcases how multimodal AI can bridge the accessibility gap, making video content more inclusive for visually impaired users and more convenient for multitaskers.
Log in or sign up for Devpost to join the conversation.