Transcribe Video to Text

Inspiration

I kept hitting the same wall: I'd record an interview, a lecture, or a long video call and then lose hours scrubbing through it to find the one quote I needed. Existing transcription tools were either expensive, locked behind desktop installs, or quietly used my uploads to train their models. I wanted something I'd actually trust and want to use — fast, accurate, browser-based, and private by default. So I built it.

What it does

Transcribe Video turns any video or audio file into accurate, timestamped text right in the browser — no install required. You upload a file and get back a transcript with:

Up to 99% accuracy with automatic speaker identification
Word-level timestamps with click-to-play (jump to any moment in the media)
100+ languages with automatic detection
Export to TXT, PDF, SRT, and VTT
Support for 20+ formats (MP4, MOV, WebM, MP3, WAV, M4A, and more)
A privacy guarantee: your files are never used for model training

It's built for podcasters, journalists, researchers, students, educators, and video creators who need clean text out of spoken content fast.

How I built it

The frontend is a Next.js app handling file upload and the interactive transcript editor, where word-level timestamps are mapped to a media player for click-to-play. Uploaded media is processed by a speech-to-text model that returns word-level timing and speaker diarization, which I normalize into a single transcript format and render into TXT/PDF/SRT/VTT on export. Files are processed and not retained for training.

Challenges I ran into

Word-level timestamp alignment. Getting timestamps accurate enough that click-to-play feels instant — not "close to the right spot" — took real tuning, especially across languages with different word boundaries.
Speaker identification on noisy, overlapping audio is genuinely hard; balancing accuracy against speed was a constant trade-off.
Format sprawl. Reliably handling 20+ input formats and exporting clean SRT/VTT (which have strict timing/line-length rules) meant a lot of edge cases.
Privacy without compromise — designing the pipeline so files are never used for training while keeping processing fast.

Accomplishments that I'm proud of

Hitting up-to-99% accuracy while keeping the whole thing browser-based and install-free.
Click-to-play word-level timestamps that make a transcript feel alive instead of like a wall of text.
Genuine privacy as a default, not an upsell.
100+ language support with automatic detection.

What I learned

That the hard part of transcription isn't getting "good enough" text — it's the last mile: accurate timing, clean speaker labels, correct export formatting, and a UI that lets people use the transcript, not just read it. I also learned how much trust matters; "we don't train on your files" turns out to be a feature people care about as much as accuracy.