Inspiration
Content creators spend hours editing rough recordings into polished tutorials. Professional voiceover requires expensive equipment, voice talent, or cloud subscription services. We asked: what if your phone could transform any video into a professionally narrated piece using only on-device AI?
The emergence of Apple's MLX framework—purpose-built for Apple Silicon's unified memory architecture—made this vision achievable. MLX enables running billion-parameter models on mobile devices with performance rivaling cloud APIs, all while keeping user data completely private.
What it does
ScriptCraft processes videos through a complete AI pipeline:
Transcribe - Extracts speech from video using on-device speech recognition with automatic chunking for long-form content
Enhance - Transforms rough transcripts into polished narration scripts using Qwen LLM (4-bit quantized). Removes filler words, improves sentence structure, and optimizes pacing for voiceover delivery
Narrate - Generates natural-sounding speech using Kokoro neural TTS (82M parameters). Produces broadcast-quality audio at 24kHz with multiple voice options
Export - Composites the AI-generated narration with the original video, creating a new video file with professional voiceover. Supports saving to Photos and sharing.
All processing happens on-device. No data leaves the phone. No API keys required.
How we built it
Architecture:
┌─────────────────────────────────────────────────┐
│ iOS App (Swift/SwiftUI) │
├─────────────────────────────────────────────────┤
│ Transcribe: SFSpeechRecognizer (on-device) │
│ Enhance: MLX Swift LLM (Qwen 0.5B 4-bit) │
│ Narrate: Kokoro TTS via mlx-audio │
│ Export: AVMutableComposition │
└─────────────────────────────────────────────────┘
Tech Stack:
- MLX Swift (ml-explore/mlx-swift) - Native tensor operations optimized for Apple Silicon
- MLX Swift LM (ml-explore/mlx-swift-lm) - LLM inference with 4-bit quantization
- mlx-audio - Kokoro 82M neural TTS model
- AVFoundation - Non-destructive video/audio composition
- SwiftUI - Native iOS interface
Key Implementation Details:
LLM Enhancement - Uses ModelConfiguration for flexible model selection with callback-based token generation and early stopping. Automatic model download via Hugging Face Hub with conditional compilation for simulator compatibility.
Video Export - AVMutableComposition enables non-destructive editing—it's a "recipe" that doesn't load entire video into RAM. Selectively extracts video track, mutes original audio, and stitches AI-generated narration as new audio track.
TTS Integration - FastAPI wrapper handles long-text segmentation (Kokoro splits at ~500 tokens) with WAV concatenation for seamless multi-segment audio.
Challenges we ran into
MLX Simulator Incompatibility - MLX requires real Apple Silicon GPU access. Solution: Conditional compilation (
#if targetEnvironment(simulator)) with passthrough fallback for UI testing.Long-Text TTS Segmentation - Kokoro splits long text into multiple audio files. Solution: Detect segment files, concatenate WAV data with matching audio parameters.
Memory Management - Running multiple AI models requires careful memory handling. Solution: Lazy model loading, 4-bit quantization, and efficient buffer reuse.
Physical Device Testing - TTS server runs on Mac but iPhone can't reach localhost. Solution: Bind server to
0.0.0.0and configure iOS app with Mac's local IP address.LLM Response Cleanup - Model sometimes adds preambles like "Sure, here's the rewritten transcript...". Solution: Pattern matching to strip common preamble phrases from responses.
Accomplishments that we're proud of
- On-Device LLM Pipeline - Native MLX Swift integration running Qwen 0.5B with 4-bit quantization on iPhone
- End-to-End Solution - Complete workflow from video input to exported narrated video
- Production-Quality TTS - Kokoro produces natural speech with real-time factor of 0.1-0.25x
- Efficient Memory Footprint - ~3.2GB peak memory suitable for mobile devices with 4GB+ RAM
- Real-Time Performance - Full pipeline completes in under 30 seconds for 1-minute videos
- Privacy-First Design - User content never leaves the device
What we learned
- MLX's unified memory model eliminates CPU-GPU transfer overhead, crucial for mobile AI
- 4-bit quantization maintains quality while dramatically reducing memory requirements
- Apple's SFSpeechRecognizer rivals Whisper for English transcription without model downloads
- Neural TTS has reached the quality threshold where on-device generation is viable
- AVMutableComposition provides memory-efficient video editing by working with references rather than loading entire videos
What's next for ScriptCraft
- Fully On-Device TTS - Integrate CoreML-converted Kokoro model to eliminate Mac server dependency
- Multi-Language Support - Leverage Kokoro's multilingual capabilities for global users
- Voice Cloning - Fine-tune TTS on user's voice samples for personalized narration
- Real-Time Processing - Stream transcription and TTS for live scenarios
- Video Understanding - Add vision models to describe visual content in narration
- Larger LLM Models - Support Qwen 1.8B and Ministral 3B for improved enhancement quality
Links
- GitHub: https://github.com/ooiyeefei/scriptcraft
- Setup Instructions: See README.md
Built With
- avfoundation
- fastapi
- kokoro-tts
- mlx
- mlx-audio
- mlx-swift
- mlx-swift-lm
- python
- qwen-llm
- sfspeechrecognizer
- swift
- swiftui
Log in or sign up for Devpost to join the conversation.