BlindTube

Our Home Page with Video Upload, Link Upload, and Audio Generation

Inspiration

Watching movies with friends who are blind made us realize how much visual detail never makes it into the usual audio-description tracks. We wanted to build something that turns any video—whether it’s a blockbuster, a vlog, or a Saturday-morning cartoon—into a richly narrated audiobook so that blind and low-vision audiences can enjoy the full story without waiting for an official audio-described release.

What it does

Takes a YouTube URL link or file
Extracts the core story – Gemini’s API with video parsing summarizes plot points, scene changes, character actions, and dialogue.
Generates natural-language narration that links scenes together smoothly.
Adds emotion and vocal variety – We pass the text through Hume.ai and ElevenLabs to produce a ready-to-listen audiobook track.
Outputs a single MP3 (or WAV) file you can play on any device or splice back into the original video as an alternate audio track.

How we built it

Backend: Python + Django handles video and audio files.
Video parsing: Gemini Vision extracts frame-level captions and scene metadata.
Narration engine and Text-To-Speech (TTS): Hume.ai and ElevenLabs converts the tagged script to high-quality speech.

Challenges we ran into

Keeping the story engaging and detailed – We tweaked prompts to for LLM to make the story as interesting as possible.
API rate limits – We Implemented the best performing Text-to-Speech LLM, but it is too expensive to utilize.
Emotion markup standards – Hume and ElevenLabs use different tags, so we built a small mapping layer.
Implementation of Different APIs for TST – We Implemented google, Hume, and ElevenLabs to test which performs the best. We discovered Hume and ElevenLabs performs the best.

Accomplishments that we're proud of

Turned a 5-minutes cartoon into a highly engaging audiobook.
End-to-end pipeline (upload → MP3).

What we learned

Good narration is about context, not just describing every frame.
Voice synthesis APIs are powerful, but emotion cues make or break the final experience.
Accessibility tools benefit people who wanted to listen the audiobooks and don't have the time to watch the long movie.