Inspiration

I've always believed that the best way to learn math and science is through animated visual explanations — watching abstract concepts come alive through motion and color. But what if you didn't have to search for the right video? What if you could just ask, and a custom animated explainer would be generated for you in real time?

That's what inspired Lucent: a personal AI tutor that not only talks to you naturally via voice but also generates animated educational videos on the fly, using ManimGL — a powerful mathematical animation engine.

The Gemini Live API made this possible — real-time, interruptible voice conversation combined with the creative power of generative AI.

What it does

  1. Talk naturally — Lucent uses the Gemini Live API for real-time voice conversation. Ask it to teach you anything about math, physics, or computer science.
  2. Generates animated videos — A multi-agent pipeline plans scenes, writes ManimGL animation code, renders on Cloud Run, generates narration via Gemini TTS, and stitches everything into a polished video — all automatically.
  3. Interactive learning — Pause the video at any point to ask a follow-up question. Lucent sees the current frame (via screenshots sent to Gemini) and answers in context, then resumes.
  4. Multiple formats — YouTube landscape (16:9), YouTube Shorts/TikTok/Reels (9:16), Instagram square (1:1), or quick 15-second doubt-clearers.

How I built it

The system is a multi-agent pipeline running entirely on Google Cloud:

  • Gemini Live API (gemini-live-2.5-flash-native-audio) powers the real-time voice interface, proxied through a WebSocket endpoint embedded in the FastAPI backend.
  • Scene Planner uses gemini-3-flash-preview via Vertex AI to break a topic into animated scenes with narration scripts, timing, and animation direction.
  • Code Generator uses gemini-3-flash-preview to write ManimGL Python code for each scene, guided by a comprehensive API reference built from the ManimGL source and 200+ curated golden examples.
  • ManimGL Renderer runs on a separate Cloud Run service with headless OpenGL (EGL), rendering animations at configurable resolutions.
  • TTS uses gemini-2.5-flash-tts to generate natural narration audio.
  • FFmpeg handles audio/video synchronization (freezing last frame if narration is longer, padding silence if animation is longer) and final stitching.
  • The entire per-scene pipeline (codegen → render → error recovery) runs in parallel using ThreadPoolExecutor for speed.
  • Final artifacts are persisted to Google Cloud Storage so they survive Cloud Run container recycling.

The frontend is a React 19 + Vite + Tailwind CSS SPA with a voice visualizer, cumulative progress bar, and integrated video player.

Challenges I faced

  • ManimGL API surface: ManimGL has many undocumented behaviors. I had to build a comprehensive API reference from the source code and create extensive anti-pattern rules in the prompts to ensure the generated code actually renders correctly.
  • Headless rendering on Cloud Run: Getting ManimGL to render with EGL/OpenGL in a containerized environment without a display required careful Docker configuration with Xvfb and EGL libraries.
  • Audio/video synchronization: Narration and animation durations don't always match. I implemented bidirectional sync — FFmpeg freezes the last video frame when audio is longer, and pads silence when animation is longer.
  • Cloud Run container recycling: Cloud Run can restart containers at any time, losing all local files. I added GCS persistence for final artifacts and frontend retry logic to handle transient 404s.
  • Aspect ratio for portrait/square videos: ManimGL's config parser uses literal_eval on YAML values, so resolution had to be passed as a quoted Python tuple string — a subtle bug that took time to diagnose.

What I learned

  • The Gemini Live API is incredibly powerful for building natural conversational interfaces — the ability to be interrupted mid-sentence makes the interaction feel genuinely human.
  • Prompt engineering for code generation requires far more than just "write code" — the quality of the system prompt (API reference, anti-patterns, golden examples, style hints) determines whether the output actually renders.
  • Parallelizing per-scene pipelines (codegen + render + retry) dramatically reduces end-to-end latency, especially when Cloud Run auto-scales renderer instances.

Built With

Share this project:

Updates