Inspiration
Watching movies with friends who are blind made us realize how much visual detail never makes it into the usual audio-description tracks. We wanted to build something that turns any video—whether it’s a blockbuster, a vlog, or a Saturday-morning cartoon—into a richly narrated audiobook so that blind and low-vision audiences can enjoy the full story without waiting for an official audio-described release.
What it does
- Takes a YouTube URL link or file
- Extracts the core story – Gemini’s API with video parsing summarizes plot points, scene changes, character actions, and dialogue.
- Generates natural-language narration that links scenes together smoothly.
- Adds emotion and vocal variety – We pass the text through Hume.ai and ElevenLabs to produce a ready-to-listen audiobook track.
- Outputs a single MP3 (or WAV) file you can play on any device or splice back into the original video as an alternate audio track.
How we built it
- Backend: Python + Django handles video and audio files.
- Video parsing: Gemini Vision extracts frame-level captions and scene metadata.
- Narration engine and Text-To-Speech (TTS): Hume.ai and ElevenLabs converts the tagged script to high-quality speech.
Challenges we ran into
- Keeping the story engaging and detailed – We tweaked prompts to for LLM to make the story as interesting as possible.
- API rate limits – We Implemented the best performing Text-to-Speech LLM, but it is too expensive to utilize.
- Emotion markup standards – Hume and ElevenLabs use different tags, so we built a small mapping layer.
- Implementation of Different APIs for TST – We Implemented google, Hume, and ElevenLabs to test which performs the best. We discovered Hume and ElevenLabs performs the best.
Accomplishments that we're proud of
- Turned a 5-minutes cartoon into a highly engaging audiobook.
- End-to-end pipeline (upload → MP3).
What we learned
- Good narration is about context, not just describing every frame.
- Voice synthesis APIs are powerful, but emotion cues make or break the final experience.
- Accessibility tools benefit people who wanted to listen the audiobooks and don't have the time to watch the long movie.
What's next for BlindTube
- Multi-voice casting (different speakers for characters and narrator).
- Mobile app with AirPods-friendly playback controls.
- Lower cost cheaper API calls by training and fine-tuning our own model.
Built With
- elevenlabs
- gemini
- html5
- hume.ai
- javascript
- python

Log in or sign up for Devpost to join the conversation.