VidBuddy — AI Audio Descriptions for Blind & Low-Vision Viewers
The Problem
300 million people worldwide are blind or have low vision. Most online video was built for sighted viewers.
When a teacher points at a whiteboard, a presenter switches slides, or a character reacts silently — that visual moment is invisible to anyone relying on audio. Existing captions only transcribe speech. They do nothing for the visual context that sighted viewers take for granted.
Audio Description fills that gap. A narrator quietly describes what is happening on screen during natural pauses in dialogue:
🎙 "Sarah just walked in. She looks a bit tired, but she's smiling at you from the left corner."
Creating AD tracks manually is slow and expensive. VidBuddy automates the entire workflow — from raw MP4 to a fully described, accessible video — without a backend server.
What is Audio Description?
Audio Description is a technique for describing what is happening during a video, to benefit audience members who are blind or have low vision. It generally takes the form of a second audio track, and is available on TV, streaming services, and at movie theaters.
The narration is timed to fit within silent parts of the video, so it does not overlap the dialogue and does not increase the length of the programme — unlike pausing to provide a description, which would break the viewing experience.
Providing content with audio description tracks is a legal requirement in several countries (USA, EU, Canada, Australia), and demand for AD-compliant content is accelerating as regulations tighten.
How VidBuddy Uses AI to Help
VidBuddy leverages AI to assist the AD authoring process end to end:
- Scene analysis — Azure Content Understanding generates a visual description for each shot and transcribes all dialogue
- Silent gap detection — VidBuddy identifies windows where narration can be inserted without overlapping speech
- Description rewriting — Azure OpenAI (or Gemini/Kimi) rewrites raw descriptions to fit precisely within each silent window
- Human review — the AI-generated script is presented to an AD editor as a draft to review, correct, and approve before any audio is synthesised
- Voice synthesis + export — Azure Neural TTS renders the final narration, and WebAssembly FFmpeg mixes it into a downloadable described MP4
We believe that making the AD authoring process faster, and thus less expensive, will result in more inclusive content being created — benefiting the 300 million people worldwide who are blind or have low vision.
Who This Is For
| Role | How they use VidBuddy |
|---|---|
| The viewer | A blind or low-vision person who receives the exported described video and can now follow it with full context |
| The creator | A teacher, journalist, media team, nonprofit, or content owner who runs the VidBuddy studio to generate, review, and export the described version |
| The evaluator | A hackathon judge or accessibility reviewer exploring the automated → human-review → export pipeline |
The viewer does not need to use the studio. The creator uses VidBuddy to produce a better version of the video for them.

Log in or sign up for Devpost to join the conversation.