Reelize

Inspiration

Short-form video is the most effective retention format ever invented - creators on TikTok and YouTube Shorts have engineered pacing, hooks, and beat structures that hold viewers for 60 straight seconds. Meanwhile, educators and anyone trying to explain a complex idea online are still losing people in the first three seconds.

We kept asking the same question: what if you could borrow the exact retention structure of a video that already works, and use it as a template to explain something hard? Not "make a generic explainer" - literally reverse-engineer the hooks, beat drops, scene rhythm, and voice cadence of a Short you love, and have AI generate a brand-new video on that scaffold about any topic you choose.

That's Reelize.

What it does

You drop in any TikTok, Reel, or YouTube Short whose pacing works on you, and you pick a topic you want to teach. Reelize:

Analyzes the reference - separates audio stems, transcribes dialogue, identifies background music, maps the beat grid and energy envelope, and detects every scene change with keyframe extraction.
Extracts the retention blueprint - hook timing, pacing curve, visual rhythm, voice cadence.
Generates a full video on that blueprint - AI-written script aligned to the original's pacing, ElevenLabs voice synthesis, matched music and SFX, and visuals composed to the beat grid.
Renders end-to-end - the final video is produced in Remotion and delivered to your phone, ready to publish.

The output isn't a script or a slideshow. It's a fully rendered short-form video engineered to hold attention the same way the reference did.

How we built it

Two of us - Eliyahu Mizrahi and Isaac Sasson - built Reelize end-to-end in 48 hours for Hack Brooklyn. The scope was honestly insane for a two-person weekend: a mobile app, a GPU-backed analysis pipeline, a multi-model generation stack, and a deterministic video renderer. We split the work across the stack and barely slept.

Client - Expo / React Native app (iOS, Android, web) for upload, preview, and download.
Backend - FastAPI orchestrating a serialized job queue so GPU-heavy analysis and render stages don't contend for memory.
Audio analysis - Demucs for stem separation, Whisper for transcription, Shazam for music ID, librosa for beat grid and energy envelope, pyannote for speaker diarization.
Video analysis - Gemini for scene detection with a multi-pass keyframe refinement loop.
Generation - Gemini produces the script and shot timeline from the combined audio/video manifest; ElevenLabs synthesizes the voiceover.
Render - Remotion (React-based video framework) composes the final video with layered voice, music, SFX, and footage timed to the original's beat grid.
Infra - Supabase Postgres for job state, Supabase Storage + Cloudflare R2 for media assets.

Challenges we ran into

Building the pipeline itself was the hardest part. Reelize is seven specialized services - Demucs, Whisper, Shazam, librosa, pyannote, Gemini, ElevenLabs, Remotion - and getting them to hand off cleanly was a design problem, not just a coding one. Each stage produces data in its own shape and timing space, so we had to design a single shared manifest that every stage could read from and write into. Half of the 48 hours was spent on the glue: defining that manifest, wiring retries and partial failures, and making sure a crash in the Whisper stage didn't corrupt the beat-grid data already on disk. The AI models were the easy part; the orchestration was the project.
GPU contention. Running Demucs, Whisper, pyannote, and Gemini concurrently on one machine melted memory. We moved to a serialized job queue with an explicit worker that processes one job end-to-end rather than parallelizing stages across requests.
Aligning a generated script to a foreign beat grid. Teaching a model to produce narration that lands its emphasis on the reference video's existing beat drops required feeding the beat grid and energy envelope directly into the prompt - not just the transcript.
Scene detection accuracy on fast-cut Shorts. Single-pass Gemini detection missed half-second cuts. We added a keyframe refinement pass that re-checks boundaries against pixel-level diffs.
Remotion render determinism. Making voice, music, and footage line up to the millisecond inside a React-based renderer meant normalizing every timing source to the same clock before composition.
Mobile upload flow for long videos. Expo's file handling on iOS vs Android vs web needed three separate paths; signed-URL uploads to R2 replaced direct multipart to the API.

Accomplishments we're proud of

Shipped a working end-to-end pipeline - upload in, finished rendered video out - inside the 48-hour window.
Got the script generator to actually respect the reference video's beat grid, which was the feature we almost cut twice.
Mobile, web, and backend all talking to each other with signed-URL flows by the deadline.

What we learned

Short-form retention is structural, not stylistic. Once you have the beat grid and scene rhythm, the "vibe" follows.
Orchestrating a pipeline of specialized models is harder than any individual model - 80% of the work was in the manifest, the glue, and the failure recovery. The models are commodities; the pipeline is the product.
Remotion was the right call. Being able to describe the final composition in React (instead of shelling out to ffmpeg scripts) made the timing logic debuggable at 3am.
Scoping matters more than effort. The only reason we finished was because we cut features on Saturday morning that we both really wanted to build.

What's next for Reelize

Template library - save a reference video's blueprint and reuse it across many topics.
Creator profiles - fine-tune voice and visual style per user.
Classroom mode - teachers upload a lesson topic, students get a 60-second video they'll actually watch.