What's next for LipSync Studio

Inspiration

Video content dominates the internet — 82% of all traffic — yet producing even a simple promotional music video costs $500–$5,000 and takes weeks. Independent artists and small labels often skip video entirely because the cost is prohibitive, even though videos drive 3–5x more streams than audio-only releases. We wanted to make music video creation as easy as taking a selfie.

What it does

LipSync Studio generates AI-powered lip-synced music videos from just two inputs: a selfie photo and an audio file. Upload a portrait, pick your song, and in under 3 minutes you get a 1080x1920 video with natural lip sync, expressive facial movements, and subtle head motion — ready to share on TikTok, Instagram Reels, or YouTube Shorts. No filming, no editing, no VFX expertise required.

How we built it

  • Frontend: React Native Web via Expo SDK 54, giving us a single codebase that works on iOS Safari, Android Chrome, and desktop browsers
  • Backend: Three Vercel serverless functions that proxy all calls to the LTX Video API — keeping the API key server-side only and solving browser CORS restrictions
  • Video Generation: LTX audio-to-video endpoint with the ltx-2-3-pro model, which takes a face image and audio track and produces synchronized lip movement
  • Audio Handling: Client-side codec conversion using the Web Audio API and MediaRecorder — automatically converts unsupported formats (WAV/PCM) to Opus before upload
  • Deployment: Static export to Vercel with npm run deploy injecting the git commit hash into the footer for version tracking

Challenges we ran into

  • CORS everywhere: The LTX API doesn't set CORS headers, and neither does the Google Cloud Storage pre-signed upload URL it returns. We had to proxy every external call through serverless functions — first with a local Express proxy, then properly via Vercel functions
  • Audio codec rejection: Our first test upload was a WAV file. LTX returned "codec pcm_s16le not supported." We built a real-time browser-based audio transcoding pipeline using OfflineAudioContext and MediaRecorder to convert to Opus on the fly
  • Safari file picker: iOS Safari grayed out .m4a files when using type: "audio/*". We had to add explicit file extensions (.m4a, .mp3, etc.) alongside MIME types for the document picker to recognize them
  • Expo SDK 54 breaking changes: expo-file-system completely changed its API from the legacy readAsStringAsync/documentDirectory to a new File class with arrayBuffer() and writableStream(). No Stack Overflow answers existed yet — we read the type definitions directly

Accomplishments that we're proud of

  • End-to-end in one session: From create-expo-app to a deployed, working production app on Vercel — including API integration, CORS solutions, codec conversion, and cross-platform compatibility
  • Zero API key exposure: The LTX key never touches the client bundle. All sensitive calls go through serverless functions
  • It actually works on a phone: You can open the Vercel URL on an iPhone, pick a selfie from your camera roll, choose a song, and get a lip-synced video back — no app install required

What we learned

  • Video AI APIs are surprisingly accessible — the hard part isn't the AI, it's the plumbing (CORS, codecs, file formats, upload flows)
  • Browser APIs are more powerful than we expected — converting audio codecs entirely client-side with Web Audio API and MediaRecorder was eye-opening
  • Expo's new file system API is clean but undocumented — reading .d.ts files is sometimes the only way forward with bleeding-edge SDKs

What's next for LipSync Studio

  • Style control: Let users choose video styles — cinematic, anime, vintage — via prompt customization
  • Longer videos: Chain multiple LTX extend API calls to support full-length songs beyond the current 20-second limit
  • Batch generation: Upload one photo and generate videos for an entire album, with different moods and camera motions per track
  • TwelveLabs integration: Use Marengo to analyze generated videos for quality scoring and semantic consistency, automatically re-generating segments that don't match the audio energy
  • Direct social publishing: One-tap posting to TikTok, Instagram, and YouTube directly from the results screen

Built With

Share this project:

Updates