Inspiration

Watching movies with friends who are blind made us realize how much visual detail never makes it into the usual audio-description tracks. We wanted to build something that turns any video—whether it’s a blockbuster, a vlog, or a Saturday-morning cartoon—into a richly narrated audiobook so that blind and low-vision audiences can enjoy the full story without waiting for an official audio-described release.

What it does

  1. Takes a YouTube URL link or file
  2. Extracts the core story – Gemini’s API with video parsing summarizes plot points, scene changes, character actions, and dialogue.
  3. Generates natural-language narration that links scenes together smoothly.
  4. Adds emotion and vocal variety – We pass the text through Hume.ai and ElevenLabs to produce a ready-to-listen audiobook track.
  5. Outputs a single MP3 (or WAV) file you can play on any device or splice back into the original video as an alternate audio track.

How we built it

  • Backend: Python + Django handles video and audio files.
  • Video parsing: Gemini Vision extracts frame-level captions and scene metadata.
  • Narration engine and Text-To-Speech (TTS): Hume.ai and ElevenLabs converts the tagged script to high-quality speech.

Challenges we ran into

  • Keeping the story engaging and detailed – We tweaked prompts to for LLM to make the story as interesting as possible.
  • API rate limits – We Implemented the best performing Text-to-Speech LLM, but it is too expensive to utilize.
  • Emotion markup standards – Hume and ElevenLabs use different tags, so we built a small mapping layer.
  • Implementation of Different APIs for TST – We Implemented google, Hume, and ElevenLabs to test which performs the best. We discovered Hume and ElevenLabs performs the best.

Accomplishments that we're proud of

  • Turned a 5-minutes cartoon into a highly engaging audiobook.
  • End-to-end pipeline (upload → MP3).

What we learned

  • Good narration is about context, not just describing every frame.
  • Voice synthesis APIs are powerful, but emotion cues make or break the final experience.
  • Accessibility tools benefit people who wanted to listen the audiobooks and don't have the time to watch the long movie.

What's next for BlindTube

  • Multi-voice casting (different speakers for characters and narrator).
  • Mobile app with AirPods-friendly playback controls.
  • Lower cost cheaper API calls by training and fine-tuning our own model.

Built With

Share this project:

Updates