Inspiration

Content creators spend hours editing rough recordings into polished tutorials. Professional voiceover requires expensive equipment, voice talent, or cloud subscription services. We asked: what if your phone could transform any video into a professionally narrated piece using only on-device AI?

The emergence of Apple's MLX framework—purpose-built for Apple Silicon's unified memory architecture—made this vision achievable. MLX enables running billion-parameter models on mobile devices with performance rivaling cloud APIs, all while keeping user data completely private.

What it does

ScriptCraft processes videos through a complete AI pipeline:

  1. Transcribe - Extracts speech from video using on-device speech recognition with automatic chunking for long-form content

  2. Enhance - Transforms rough transcripts into polished narration scripts using Qwen LLM (4-bit quantized). Removes filler words, improves sentence structure, and optimizes pacing for voiceover delivery

  3. Narrate - Generates natural-sounding speech using Kokoro neural TTS (82M parameters). Produces broadcast-quality audio at 24kHz with multiple voice options

  4. Export - Composites the AI-generated narration with the original video, creating a new video file with professional voiceover. Supports saving to Photos and sharing.

All processing happens on-device. No data leaves the phone. No API keys required.

How we built it

Architecture:

┌─────────────────────────────────────────────────┐
│              iOS App (Swift/SwiftUI)            │
├─────────────────────────────────────────────────┤
│  Transcribe: SFSpeechRecognizer (on-device)     │
│  Enhance:    MLX Swift LLM (Qwen 0.5B 4-bit)    │
│  Narrate:    Kokoro TTS via mlx-audio           │
│  Export:     AVMutableComposition               │
└─────────────────────────────────────────────────┘

Tech Stack:

  • MLX Swift (ml-explore/mlx-swift) - Native tensor operations optimized for Apple Silicon
  • MLX Swift LM (ml-explore/mlx-swift-lm) - LLM inference with 4-bit quantization
  • mlx-audio - Kokoro 82M neural TTS model
  • AVFoundation - Non-destructive video/audio composition
  • SwiftUI - Native iOS interface

Key Implementation Details:

  • LLM Enhancement - Uses ModelConfiguration for flexible model selection with callback-based token generation and early stopping. Automatic model download via Hugging Face Hub with conditional compilation for simulator compatibility.

  • Video Export - AVMutableComposition enables non-destructive editing—it's a "recipe" that doesn't load entire video into RAM. Selectively extracts video track, mutes original audio, and stitches AI-generated narration as new audio track.

  • TTS Integration - FastAPI wrapper handles long-text segmentation (Kokoro splits at ~500 tokens) with WAV concatenation for seamless multi-segment audio.

Challenges we ran into

  1. MLX Simulator Incompatibility - MLX requires real Apple Silicon GPU access. Solution: Conditional compilation (#if targetEnvironment(simulator)) with passthrough fallback for UI testing.

  2. Long-Text TTS Segmentation - Kokoro splits long text into multiple audio files. Solution: Detect segment files, concatenate WAV data with matching audio parameters.

  3. Memory Management - Running multiple AI models requires careful memory handling. Solution: Lazy model loading, 4-bit quantization, and efficient buffer reuse.

  4. Physical Device Testing - TTS server runs on Mac but iPhone can't reach localhost. Solution: Bind server to 0.0.0.0 and configure iOS app with Mac's local IP address.

  5. LLM Response Cleanup - Model sometimes adds preambles like "Sure, here's the rewritten transcript...". Solution: Pattern matching to strip common preamble phrases from responses.

Accomplishments that we're proud of

  • On-Device LLM Pipeline - Native MLX Swift integration running Qwen 0.5B with 4-bit quantization on iPhone
  • End-to-End Solution - Complete workflow from video input to exported narrated video
  • Production-Quality TTS - Kokoro produces natural speech with real-time factor of 0.1-0.25x
  • Efficient Memory Footprint - ~3.2GB peak memory suitable for mobile devices with 4GB+ RAM
  • Real-Time Performance - Full pipeline completes in under 30 seconds for 1-minute videos
  • Privacy-First Design - User content never leaves the device

What we learned

  • MLX's unified memory model eliminates CPU-GPU transfer overhead, crucial for mobile AI
  • 4-bit quantization maintains quality while dramatically reducing memory requirements
  • Apple's SFSpeechRecognizer rivals Whisper for English transcription without model downloads
  • Neural TTS has reached the quality threshold where on-device generation is viable
  • AVMutableComposition provides memory-efficient video editing by working with references rather than loading entire videos

What's next for ScriptCraft

  • Fully On-Device TTS - Integrate CoreML-converted Kokoro model to eliminate Mac server dependency
  • Multi-Language Support - Leverage Kokoro's multilingual capabilities for global users
  • Voice Cloning - Fine-tune TTS on user's voice samples for personalized narration
  • Real-Time Processing - Stream transcription and TTS for live scenarios
  • Video Understanding - Add vision models to describe visual content in narration
  • Larger LLM Models - Support Qwen 1.8B and Ministral 3B for improved enhancement quality

Links

Built With

Share this project:

Updates