ScriptCraft

Inspiration

Content creators spend hours editing rough recordings into polished tutorials. Professional voiceover requires expensive equipment, voice talent, or cloud subscription services. We asked: what if your phone could transform any video into a professionally narrated piece using only on-device AI?

The emergence of Apple's MLX framework—purpose-built for Apple Silicon's unified memory architecture—made this vision achievable. MLX enables running billion-parameter models on mobile devices with performance rivaling cloud APIs, all while keeping user data completely private.

What it does

ScriptCraft processes videos through a complete AI pipeline:

Transcribe - Extracts speech from video using on-device speech recognition with automatic chunking for long-form content
Enhance - Transforms rough transcripts into polished narration scripts using Qwen LLM (4-bit quantized). Removes filler words, improves sentence structure, and optimizes pacing for voiceover delivery
Narrate - Generates natural-sounding speech using Kokoro neural TTS (82M parameters). Produces broadcast-quality audio at 24kHz with multiple voice options
Export - Composites the AI-generated narration with the original video, creating a new video file with professional voiceover. Supports saving to Photos and sharing.

All processing happens on-device. No data leaves the phone. No API keys required.

How we built it

Architecture:

┌─────────────────────────────────────────────────┐
│              iOS App (Swift/SwiftUI)            │
├─────────────────────────────────────────────────┤
│  Transcribe: SFSpeechRecognizer (on-device)     │
│  Enhance:    MLX Swift LLM (Qwen 0.5B 4-bit)    │
│  Narrate:    Kokoro TTS via mlx-audio           │
│  Export:     AVMutableComposition               │
└─────────────────────────────────────────────────┘

Tech Stack:

MLX Swift (ml-explore/mlx-swift) - Native tensor operations optimized for Apple Silicon
MLX Swift LM (ml-explore/mlx-swift-lm) - LLM inference with 4-bit quantization
mlx-audio - Kokoro 82M neural TTS model
AVFoundation - Non-destructive video/audio composition
SwiftUI - Native iOS interface

Key Implementation Details:

LLM Enhancement - Uses ModelConfiguration for flexible model selection with callback-based token generation and early stopping. Automatic model download via Hugging Face Hub with conditional compilation for simulator compatibility.
Video Export - AVMutableComposition enables non-destructive editing—it's a "recipe" that doesn't load entire video into RAM. Selectively extracts video track, mutes original audio, and stitches AI-generated narration as new audio track.
TTS Integration - FastAPI wrapper handles long-text segmentation (Kokoro splits at ~500 tokens) with WAV concatenation for seamless multi-segment audio.

Challenges we ran into

MLX Simulator Incompatibility - MLX requires real Apple Silicon GPU access. Solution: Conditional compilation (#if targetEnvironment(simulator)) with passthrough fallback for UI testing.
Long-Text TTS Segmentation - Kokoro splits long text into multiple audio files. Solution: Detect segment files, concatenate WAV data with matching audio parameters.
Memory Management - Running multiple AI models requires careful memory handling. Solution: Lazy model loading, 4-bit quantization, and efficient buffer reuse.
Physical Device Testing - TTS server runs on Mac but iPhone can't reach localhost. Solution: Bind server to 0.0.0.0 and configure iOS app with Mac's local IP address.
LLM Response Cleanup - Model sometimes adds preambles like "Sure, here's the rewritten transcript...". Solution: Pattern matching to strip common preamble phrases from responses.

Accomplishments that we're proud of

On-Device LLM Pipeline - Native MLX Swift integration running Qwen 0.5B with 4-bit quantization on iPhone
End-to-End Solution - Complete workflow from video input to exported narrated video
Production-Quality TTS - Kokoro produces natural speech with real-time factor of 0.1-0.25x
Efficient Memory Footprint - ~3.2GB peak memory suitable for mobile devices with 4GB+ RAM
Real-Time Performance - Full pipeline completes in under 30 seconds for 1-minute videos
Privacy-First Design - User content never leaves the device

What we learned

MLX's unified memory model eliminates CPU-GPU transfer overhead, crucial for mobile AI
4-bit quantization maintains quality while dramatically reducing memory requirements
Apple's SFSpeechRecognizer rivals Whisper for English transcription without model downloads
Neural TTS has reached the quality threshold where on-device generation is viable
AVMutableComposition provides memory-efficient video editing by working with references rather than loading entire videos

What's next for ScriptCraft

Fully On-Device TTS - Integrate CoreML-converted Kokoro model to eliminate Mac server dependency
Multi-Language Support - Leverage Kokoro's multilingual capabilities for global users
Voice Cloning - Fine-tune TTS on user's voice samples for personalized narration
Real-Time Processing - Stream transcription and TTS for live scenarios
Video Understanding - Add vision models to describe visual content in narration
Larger LLM Models - Support Qwen 1.8B and Ministral 3B for improved enhancement quality

Built With

avfoundation
fastapi
kokoro-tts
mlx
mlx-audio
mlx-swift
mlx-swift-lm
python
qwen-llm
sfspeechrecognizer
swift
swiftui

Updates

Yee Fei Ooi started this project — Dec 03, 2025 03:27 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.