ManimGen

input section
main hero section

Inspiration

Creating high-quality educational content, like the animations from channels like 3Blue1Brown, is incredibly difficult and time-consuming. It demands a rare combination of deep subject matter expertise, pedagogical skill, and advanced programming knowledge in animation libraries like Manim.

Our inspiration was to democratize this process. We wanted to build an autonomous AI agent that could empower educators, students, and creators to produce captivating, clear, and accurate educational videos from a simple text prompt. We envisioned a system that could handle the entire creative and technical pipeline, turning a single idea into a finished, narrated animation, making high-quality educational content accessible to everyone.

What it does

manimGen is a fully autonomous, self-improving AI agent that generates educational videos from a single prompt. It takes a topic and an optional description, and automatically handles the entire video production workflow:

AI-Powered Video Planning: It designs a pedagogical outline, breaking the topic into logical scenes, each with a clear learning objective.
Autonomous Code Generation: For each scene, it generates the Python code required to create a Manim animation, complete with narration and timing cues.
Self-Improving Error Correction: This is where ManimGen truly shines. When its generated code fails, it doesn't just give up. It enters a multi-layered debugging loop:
- Agent Memory: It searches its memory (powered by Mem0.ai) for similar past errors and their successful fixes.
- Web-Informed Debugging: It uses Tavily AI to perform targeted web searches on official documentation and community forums to find solutions for novel errors.
- It learns from every mistake. Successful fixes are stored in its long-term memory, ensuring it becomes progressively more robust and efficient over time.
Distributed Cloud Rendering: It offloads the resource-intensive video rendering process to a distributed fleet of free GitHub Actions runners, using a custom-built Docker image to ensure a consistent and optimized environment.
Final Video Assembly: Once all scenes are rendered, it combines them, synchronizes a high-quality, AI-generated voiceover using ElevenLabs, and produces a final, ready-to-watch MP4 video.

How we built it

ManimGen is a complex system built with a modern, cloud-native architecture that combines multiple cutting-edge technologies:

Core Agent Logic: Python serves as the backbone, orchestrating the entire workflow.
AI Models & Orchestration: We use Large Language Models, with a primary focus on the powerful and cost-effective gemini-2.5-flash-lite-preview-06-17.
Animation Engine: The core visuals are powered by the Manim Community Edition, the same library used by 3Blue1Brown.
Backend & State Management: Appwrite is the central nervous system. We use its Databases to manage the state of every video and scene, Storage for the final video assets, and Appwrite Functions for asynchronous task handling. Appwrite Realtime pushes live progress updates to the frontend.
Self-Healing & Web Search:
- Tavily AI: Integrated for intelligent, error-driven web searches to find real-world solutions to coding problems.
- Agent Memory: Implemented using Mem0 to create a persistent, learning agent that remembers past failures and successes.
Text-to-Speech: ElevenLabs is integrated to provide high-quality, dynamic voiceovers for the generated narration scripts, bringing the educational content to life.
Distributed Rendering Farm: We built a novel, cost-effective rendering solution using GitHub Actions and Docker. A pre-built Docker image containing all dependencies is pushed to GitHub Container Registry (GHCR), which allows GitHub's free runners to pick up and render scenes in parallel.
Frontend Example: A Next.js and TypeScript application demonstrates the full user experience, including real-time progress bars and a history of generated videos.

Challenges we ran into

LLM Hallucinations & Unreliable Code: Initially, the AI would frequently generate plausible but incorrect Manim code, leading to constant rendering failures.
- Solution: We built a multi-layered, self-correcting loop. The agent now validates its code, and on failure, it first checks its own memory, then uses Tavily to search the web for solutions. This has made the system incredibly resilient. [link](
Cost-Effective Video Rendering: Video rendering is extremely CPU-intensive. Running it on a single server is slow and expensive.
- Solution: We engineered a distributed rendering system using GitHub Actions as a free compute farm. By containerizing our rendering environment with Docker and dispatching jobs via the GitHub API, we can render multiple scenes in parallel without any server costs.
Dependency Hell in CI/CD: Manim has a complex set of system-level dependencies (LaTeX, FFmpeg, Cairo) that are difficult to manage in standard CI/CD environments.
- Solution: We created a highly-optimized, multi-stage Dockerfile that pre-installs and bakes all dependencies. This reduced our GitHub Actions setup time from over 10 minutes to under 60 seconds and eliminated environment-related failures.
Managing Long-Running Asynchronous Tasks: A video can take 20+ minutes to generate. We couldn't leave a user waiting on a spinning loader.
- Solution: We architected the system around Appwrite. The frontend makes a single API call which immediately returns a video_id. The backend then updates the status of the video in the Appwrite database. The frontend subscribes to real-time updates for that video_id, providing a seamless, non-blocking user experience.

Accomplishments that we're proud of

A Truly Self-Improving Agent: The agent's ability to learn from its mistakes is our biggest accomplishment. It's not just a script; it's a learning system that gets more reliable with every video it creates. Watching it fail, search the web, and fix its own code is magical.
The GitHub Actions Rendering Farm: We're incredibly proud of this novel approach to distributed computing. Using free, parallel CI/CD runners as a video rendering farm is a powerful and cost-effective solution to a very expensive problem.
The Multi-Layered Debugging System: The combination of internal memory and web search (Tavily) makes our agent one of the most robust autonomous coders out there. It can solve problems it has never seen before.
End-to-End Automation: We successfully automated the entire creative pipeline. A user can go from a one-sentence idea to a fully narrated, animated video without writing a single line of code or opening an editing tool.

What we learned

LLMs are Powerful, but Need Guardrails: An LLM alone is not a reliable programmer. We learned that true autonomy comes from building a robust system around the LLM with feedback loops, verification steps, and access to external tools and memory.
Infrastructure is as Important as the Agent: The agentic logic is only half the battle. Building a scalable, asynchronous, and stateful infrastructure with tools like Appwrite and Docker was critical to making the agent's long-running tasks feasible and user-friendly.
Don't Be Afraid of Unconventional Solutions: Using GitHub Actions as a rendering farm is unorthodox, but it solved a major technical and financial hurdle, proving that creative engineering can be just as important as the AI itself.
The Future is Stateful Agents: A stateless, one-shot prompt-and-response model is limited. By giving our agent memory, we unlocked a new level of capability where the system learns and grows, just like a human developer would.

What's next for ManimGen

We are just scratching the surface of what's possible with autonomous educational content creation. Our roadmap is focused on making ManimGen even more intelligent, interactive, and powerful:

Collaborative Agent Memory: Allow multiple instances of the agent (or even different users' agents) to share their learnings, creating a collective intelligence that benefits everyone.
Interactive Planning & Editing: Enhance the frontend to allow users to review and edit the AI's video plan, storyboard, and even the generated code before rendering, creating a human-in-the-loop collaborative experience.
Expanding the Tool-Belt: Integrate more tools for the agent to use, such as a WolframAlpha integration for complex calculations, real-time data APIs for dynamic content, and advanced image generation models for non-procedural assets.
Deeper Visual Understanding: Move beyond just fixing code errors. We plan to use advanced Vision-Language Models (VLMs) to have the agent critique its own work, analyzing the final video for aesthetic quality, pedagogical clarity, and visual pacing, and then iterating to improve it.
Hyper-Personalized Learning: Allow the agent to generate different versions of a video tailored to different learning styles or knowledge levels, based on user profiles.