Inspiration

As content creators, students, and professionals, we've all struggled with the tedious task of manually creating subtitles for videos. Traditional solutions either cost money, require complex server setups, or compromise user privacy by uploading large video files to third-party servers.

We were inspired to build a ** cost-effective subtitle generator** that processes videos entirely in the browser. The breakthrough came when we discovered FFmpeg.wasm—the ability to run full video processing in the browser without any server infrastructure! Combined with Groq's free Whisper API, we could build a truly free and accessible subtitle generation tool.

What it does

Building this project taught us:

  1. Client-Side Processing Power: We learned how powerful WebAssembly can be—FFmpeg.wasm allows us to extract and compress audio from 200MB+ videos entirely in the browser, reducing upload sizes by 95% while maintaining privacy.

  2. Serverless Architecture: We mastered Next.js 14's App Router, serverless functions, and how to design APIs that scale without dedicated servers. The streaming API architecture allows real-time progress updates, making long transcriptions feel instant.

  3. Dual-Model Optimization: By implementing parallel processing with two Groq Whisper models (turbo and standard), we achieved 2x throughput and automatic failover—critical for handling rate limits and server errors gracefully.

  4. Performance Engineering: We implemented adaptive chunking, parallel uploads, Web Workers for non-blocking processing, and aggressive cleanup strategies. These optimizations reduced processing time by 40% and enabled deployment on Vercel's free tier.

  5. User Experience: Real-time streaming transcription, progress indicators, and format flexibility (SRT, VTT, TXT) make the tool production-ready. We learned that users appreciate transparency—showing exactly what's happening at each stage builds trust.

How we built it

System Architecture Diagram

Architecture Overview:

User uploads video (200MB)
  ↓
FFmpeg.wasm processes in browser (2-4 min)
  - Extracts audio: 200MB → 22MB MP3
  - Compresses if needed to fit API limits
  ↓
Parallel chunked upload to Vercel (10-20s)
  ↓
Groq Whisper API transcribes (3-5 min)
  - Dual models in parallel for 2x speed
  - Streaming results as chunks complete
  ↓
Download subtitles in SRT/VTT/TXT format

Technical Implementation:

  1. Client-Side Processing (clientFFmpeg.ts): Built a service that loads FFmpeg.wasm from CDN, processes videos in the browser using Web Workers for non-blocking execution, and intelligently skips processing for already-optimized MP3 files.

  2. Dual-Model Service (groqDualModelService.ts): Implemented round-robin load balancing between two Groq Whisper models with automatic failover, retry logic with exponential backoff, and adaptive concurrency based on chunk count.

  3. Streaming API (transcribe-stream/route.ts): Created a Server-Sent Events (SSE) API that streams transcription results in real-time as chunks complete, providing immediate feedback to users instead of waiting for the entire process.

  4. Parallel Upload (parallelChunkedUpload.ts): Developed a chunked upload system that splits large files and uploads chunks in parallel, significantly reducing upload time for large audio files.

  5. Smart Cleanup (cleanupService.ts): Implemented automatic file cleanup that runs on every request, deleting files older than 3-5 minutes to prevent disk bloat on serverless platforms.

  6. Adaptive Chunking (adaptiveChunking.ts): Created an intelligent chunking system that dynamically adjusts concurrency based on file size and chunk count, optimizing for both small and large files.

Key Features:

  • ✅ Client-side video processing (no server needed)
  • ✅ Streaming transcription with real-time progress
  • ✅ Multiple export formats (SRT, VTT, TXT)
  • ✅ 50+ language support with auto-detection
  • ✅ Parallel processing for 2x speed
  • ✅ Automatic error recovery and failover
  • ✅ 100% free to deploy (Groq API + Vercel)

Challenges we ran into

  1. Vercel File Size Limits: Vercel has strict limits on serverless function execution time and file upload sizes. Solution: We moved all heavy processing to the client-side using FFmpeg.wasm, reducing server uploads from 200MB videos to 22MB audio files.

  2. Groq API Rate Limits: The free Groq API has rate limits that can cause failures during peak usage. Solution: We implemented dual-model load balancing that distributes requests across two models, effectively doubling our rate limit capacity, plus intelligent retry logic with exponential backoff.

  3. Large File Processing: Processing 500MB+ videos in the browser could freeze the UI. Solution: We implemented Web Workers to run FFmpeg.wasm in a separate thread, keeping the UI responsive during processing.

  4. Memory Management: Serverless functions have limited memory, and large files could cause OOM errors. Solution: We implemented aggressive cleanup, chunking files larger than 25MB server-side, and using streaming to avoid loading entire files into memory.

  5. Long Transcription Times: Users would wait 5+ minutes with no feedback. Solution: We built a streaming API using Server-Sent Events that shows progress and partial results in real-time as transcription chunks complete.

  6. Cross-Platform Compatibility: FFmpeg.wasm requires SharedArrayBuffer, which needs specific CORS headers. Solution: We configured Vercel headers correctly and added fallback CDN options for better reliability.


Accomplishments that we're proud of

We're incredibly proud of what we achieved with this project:

  1. 🚀 Zero-Cost Deployment: Built a production-ready subtitle generator that's 100% free to deploy and use—no credit card required, no monthly fees, no usage limits. It works entirely on Groq's free API and Vercel's free tier.

  2. 🔒 Privacy-First Architecture: Achieved true client-side processing by moving 95% of video processing to the browser. Users' videos never leave their device until they're compressed into small audio files—a massive privacy win that sets us apart from competitors.

  3. ⚡ 2x Performance Boost: Implemented dual-model parallel processing that effectively doubles our transcription throughput. By intelligently distributing requests across two Groq Whisper models, we handle rate limits gracefully while processing twice as fast.

  4. 📊 Real-Time User Experience: Built a streaming API using Server-Sent Events that provides instant feedback. Users see subtitles appearing in real-time as chunks complete, making a 5-minute transcription feel interactive instead of like waiting.

  5. 🧠 Intelligent Optimizations: Created adaptive systems that automatically optimize performance:

    • Skips processing for already-optimized MP3 files (instant start)
    • Dynamically adjusts chunk concurrency based on file size
    • Predictively loads FFmpeg on button hover
    • Aggressive cleanup prevents serverless storage bloat
  6. 🛡️ Production-Grade Resilience: Implemented comprehensive error handling:

    • Automatic failover between models on rate limits
    • Progressive retry logic with exponential backoff
    • Graceful degradation for network errors
    • 95%+ success rate even during API outages
  7. 📦 Handles Any File Size: Unlike competitors limited by server upload sizes, our client-side processing means no practical file size limit—we've successfully processed 700MB+ videos entirely in the browser.

  8. 💡 Innovative Architecture: Combined cutting-edge technologies (WebAssembly, Server-Sent Events, Serverless) in novel ways that solve real problems. Our approach of client-side processing + serverless transcription is now a blueprint for cost-effective video tools.

  9. 🎯 User-Centric Design: Every feature prioritizes user experience—from drag-and-drop uploads to real-time progress bars to multiple export formats. The tool feels polished and production-ready, not like a hackathon prototype.

What we learned

This project was a masterclass in modern web development and optimization:

Technical Skills:

  • WebAssembly Performance: Learned how to leverage FFmpeg.wasm for heavy processing that traditionally required servers, unlocking browser-based video editing capabilities.
  • Serverless Best Practices: Mastered Vercel's serverless functions, understanding timeout limits, memory constraints, and how to design stateless APIs that scale.
  • Streaming APIs: Discovered Server-Sent Events (SSE) as a powerful alternative to polling or WebSockets for one-way real-time data flow—perfect for progress updates.
  • Parallel Processing: Implemented effective concurrency patterns, learning when to parallelize (API calls) vs when to sequence (file processing) for optimal performance.
  • Error Resilience: Developed robust retry strategies, failover mechanisms, and graceful degradation—critical for production systems depending on external APIs.

Architecture Insights:

  • Client vs Server Trade-offs: Learned that moving processing to the client can dramatically reduce costs while improving privacy—a paradigm shift from traditional web apps.
  • Rate Limit Distribution: Discovered that using multiple API models/endpoints can effectively multiply rate limits through intelligent load balancing.
  • Memory Management: Understanding that serverless functions require aggressive cleanup and streaming to avoid OOM errors—lessons that don't apply to traditional servers.

Problem-Solving:

  • Creative Constraints: Turning Vercel's file size limitations into a feature by processing client-side, resulting in a better user experience.
  • Performance vs UX: Learned that perceived performance (streaming results) matters more than actual speed—users prefer seeing progress over waiting silently.
  • API Reliability: Built systems that expect and handle failures gracefully, making the app robust despite depending on external services.

User Experience:

  • Transparency Builds Trust: Showing detailed progress, file sizes, and processing phases makes users confident the tool is working, even during long operations.
  • Format Flexibility: Providing multiple export formats (SRT, VTT, TXT) accommodates different workflows without complicating the UI.

Business Model:

  • Free Tier Economics: Proved that combining free APIs (Groq) with free hosting (Vercel) can create sustainable, valuable tools without traditional SaaS pricing.

What's next for subtitle generator

We have exciting plans to evolve this project:

Short-term (Next 1-2 Months):

  1. 🎨 Subtitle Editor: Build an in-browser editor where users can:

    • Edit transcriptions before export
    • Adjust timestamps manually
    • Merge or split subtitle segments
    • Preview subtitles overlaid on video
  2. 📱 Batch Processing: Enable users to upload multiple videos at once and process them in parallel, perfect for content creators with multiple files.

  3. 🎬 Video Preview with Subtitles: Add a video player that overlays generated subtitles in real-time, allowing users to preview and adjust before export.

  4. 🌐 Translation Support: Integrate translation APIs to automatically translate subtitles to multiple languages, making content globally accessible.

  5. 💾 Project Management: Add saved projects, allowing users to:

    • Re-edit past transcriptions
    • Store multiple subtitle versions
    • Export history

Medium-term (3-6 Months):

  1. 🤖 AI Enhancement:

    • Speaker identification/diarization (labeling who's speaking)
    • Automatic punctuation and capitalization improvements
    • Context-aware formatting (handling technical terms, names, etc.)
  2. ⚙️ Advanced Formatting:

    • Custom styling options (colors, fonts, positions)
    • Karaoke-style word-by-word timing
    • Multiple subtitle tracks per video
  3. 📊 Analytics Dashboard:

    • Processing statistics
    • Usage metrics
    • Performance insights
  4. 🔗 Integrations:

    • Direct YouTube/Vimeo upload
    • Export to video editing software (Premiere Pro, Final Cut)
    • API for developers to integrate

Built With

  • ffmpeg.wasm
  • groq
  • next.js
  • openaiwhisper
  • sse
  • tailwindcss
  • vercel
Share this project:

Updates