Type a topic. Get a fully structured educational video plan script, scenes, voiceover, and animation JSON generated by Gemini AI with RAG grounding in seconds. ═══════════════════════════════════════════════ WHAT INSPIRED ME? ═══════════════════════════════════════════════ Two things came together to inspire this project. First, a practical problem: teachers and educators with great ideas can't easily turn them into animated explainer videos. The production pipeline scripting, scene planning, voiceover, animation is too complex and expensive. I wanted to collapse all of that into a single prompt. Second, a research paper: Google DeepMind's DART (Denoising Autoregressive Transformer) showed me where multimodal AI generation is heading. DART introduced a hybrid diffusion + autoregressive approach that unifies image understanding and generation in a single model a framework that naturally extends toward video synthesis. Reading that paper made it clear that the architectural pieces for AI-generated video already exist; what's missing is a structured pipeline that chains them together for real educational use cases. Prompt-to-Video is my attempt to build that pipeline today, using Gemini as the reasoning core, while the rendering layer catches up. ═══════════════════════════════════════════════ WHAT IT DOES ═══════════════════════════════════════════════ Prompt-to-Video takes a natural language topic and runs it through a full AI pipeline to produce a structured, renderable educational video plan: RAG (Retrieval-Augmented Generation) ChromaDB vector search retrieves relevant context using sentence-transformers before any generation happens. This grounds the entire output in real knowledge and dramatically reduces hallucination compared to pure model generation. Script Generation Gemini AI writes a clear, accurate, educational narration script using the retrieved context as its factual foundation. Scene Decomposition the script is broken into timed scenes with descriptions, each matched to a specific visual concept. Animation Planning each scene is converted into structured JSON with object types (diagrams, text overlays) and animations (draw, fade-in, slide), ready to pass directly to a rendering engine. Voice Narration Script a separate voiceover script is generated per scene, synchronized to scene timing. The final output is a complete, structured video plan for a 45-second animated educational video deployable via FastAPI on Google Cloud Run. ═══════════════════════════════════════════════ HOW I BUILT IT ═══════════════════════════════════════════════ The backend is a FastAPI application deployed on Google Cloud Run, built around a modular async pipeline: Gemini AI (gemini-2.5-flash) via the Google GenAI SDK handles all generation: script writing, scene decomposition, and animation JSON planning ChromaDB powers the RAG layer with sentence-transformers (all-MiniLM-L6-v2) for vector search context is retrieved before generation, not after, so Gemini reasons from grounded facts rather than pure parametric memory The pipeline chains sequentially: prompt → vector retrieval → script → scene breakdown → animation JSON → voiceover script Frontend is React + Vite + TypeScript + Tailwind CSS + Framer Motion, served via Nginx Express.js handles middleware and routing between frontend and backend Deployed end-to-end on Google Cloud Run with Docker, exposing a clean REST API (POST /generate-video) The RAG architecture was a deliberate design choice inspired by the DART paper's insight that structured, grounded generation outperforms unconstrained generation for factual content. Just as DART uses a structured diffusion process to constrain image generation, our vector retrieval step constrains Gemini's text generation to factually grounded territory. ═══════════════════════════════════════════════ CHALLENGES I RAN INTO ═══════════════════════════════════════════════ Orchestrating a multi-step async pipeline where each stage feeds the next script generation feeds scene decomposition, which feeds animation planning while keeping latency low enough to feel responsive was the hardest engineering challenge. Streaming Gemini responses helped significantly. Getting ChromaDB vector search to return genuinely relevant context (not just semantically similar noise) required careful prompt engineering around the retrieval query. Poor retrieval directly causes hallucination downstream, so this layer needed the most iteration. Deploying a multi-service architecture React frontend + FastAPI backend + ChromaDB vector store on Cloud Run with correct environment variable handling and Docker networking took considerable debugging. Keeping the animation JSON output structurally consistent across wildly different input topics required very precise system prompting Gemini tends to be creative with schema unless you constrain it tightly. ═══════════════════════════════════════════════ ACCOMPLISHMENTS I'M PROUD OF ═══════════════════════════════════════════════ Building a working end-to-end pipeline — from a raw text prompt to a fully structured, renderable animation plan — as a solo developer. The RAG layer genuinely reduces hallucination: outputs about scientific topics are grounded in retrieved facts, not invented ones. The structured JSON animation output is directly passable to a rendering engine like Remotion or FFmpeg without any post-processing the schema is clean and consistent. Deploying a multi-service AI application on Google Cloud Run with modular architecture that's genuinely extensible toward real video rendering. The system already handles the hardest part: reasoning about what to show, when, and how. ═══════════════════════════════════════════════ WHAT I LEARNED ═══════════════════════════════════════════════ RAG is not optional for factual educational content — it's essential. Without retrieval grounding, Gemini confidently generates plausible-sounding but subtly wrong explanations, especially for scientific and mathematical topics. Adding ChromaDB vector search before generation was the single biggest quality improvement in the pipeline. Gemini's multi-step chained reasoning is robust each pipeline stage maintains coherent context from the previous step without re-explaining the original prompt. But structured JSON output requires very precise prompting to stay schema-consistent across diverse inputs. Google DeepMind's DART paper was a conceptual anchor throughout: the idea that structured, constrained generation (whether via diffusion masks or RAG retrieval) produces better outputs than unconstrained generation applies equally to text pipelines. Architecture decisions in research translate directly to production systems. ═══════════════════════════════════════════════ WHAT'S NEXT ═══════════════════════════════════════════════ Connecting the animation JSON output to a real rendering engine Remotion for React-based animations or FFmpeg for frame assembly to produce actual playable MP4 video files. Adding multimodal Gemini capabilities to generate scene-specific diagrams and images inline, moving from text-planned visuals to AI-generated ones the direction DART's unified generation architecture points toward. Building a teacher dashboard where educators can review, edit, and export videos for classroom use with one click. Long-term: an SDK that ed-tech platforms can embed to auto-generate explainer videos from any curriculum content turning every textbook into an animated video library. ═══════════════════════════════════════════════

Built With

Share this project:

Updates