Inspiration

Recent advances in generative AI have made it possible to create videos from text, but when we looked at existing solutions for educational content, we found a fundamental limitation.

Most AI-generated “videos” today fall into one of two categories.
Some are essentially animated slide decks—static layouts with minimal motion and weak visual reasoning. Others rely on image- or video-generation models that produce visually appealing results, but suffer from blurry text, unstable geometry, and a lack of structural consistency. These outputs may look impressive, but they are not reliable enough for serious teaching or technical explanation.

Educational videos require a much higher standard. Mathematical formulas must stay sharp. Diagrams must remain consistent across frames. Motion should reflect underlying logic, not just visual style. For this reason, we realized that directly generating pixels is not sufficient for building trustworthy instructional videos.

This insight led us to a different approach: instead of generating videos as images, we generate them as code. By using programmatic animation, we can enforce clarity, precision, and reproducibility—qualities that are essential for learning but missing from most AI-generated video pipelines.

GenTutor was born from this idea: to combine modern AI reasoning with code-driven visualization, creating educational videos that are not only generated automatically, but also reliable enough to be used as real teaching material.

What it does

GenTutor transforms technical documents such as PDFs or Markdown files into high-quality, explainer-style educational videos.

Given a document, the system automatically analyzes its structure, generates a teaching-oriented script, and produces code-driven animations that visualize key concepts using precise, continuous motion. The resulting video can be played in a web interface where learners can pause at any moment, point to a specific visual element, and ask questions through text or voice.

Instead of passively watching a video, users interact with it—asking questions exactly where confusion occurs and receiving explanations grounded in the current visual and narrative context.

How we built it

To turn code-driven video generation into a system that is actually usable for learning, we focused on three core engineering challenges: visual reliability, audio–animation alignment, and accurate in-video interaction.

Code-Driven Layouts for Reliable Visual Structure

When animations are generated programmatically, unconstrained object placement can easily lead to visual clutter and overlapping elements. This is especially problematic for educational content, where diagrams, formulas, and annotations must remain readable at all times.

To address this, we built a Manim layout template library that encapsulates common educational visual patterns—side-by-side comparisons, and multi-panel scenes—into reusable layout functions.

Instead of letting the model position visual elements freely, Gemini selects and calls these predefined templates. This ensures spatial consistency across scenes, prevents overlap, and allows complex animations to be assembled quickly while maintaining a clean visual structure.

Timestamp-Free Audio–Animation Alignment

Another major challenge was synchronizing animation with narration. The generated audio does not initially contain timing information, making precise alignment difficult if handled manually.

We solved this by transcribing the narration using Google Cloud Speech-to-Text, extracting sentence-level timestamps and assigning each line a stable line_id. These timestamps are stored in an alignment module that can be reused across renders.

As a result, the animation code never hardcodes timing values. Instead, animations reference narration by line_id, and timing is resolved automatically at runtime. This guarantees that visual emphasis and transitions remain aligned with spoken explanations, even when narration changes.

Timeline-Aware Multimodal Interaction on the Frontend

For interactive Q&A, we first experimented with screenshot-based questioning. However, isolated frames often lacked sufficient context and led to ambiguous interpretations.

We therefore designed an intelligent timeline system on the frontend. This timeline exposes structured artifacts produced during the backend pipeline, including scene boundaries, narration segments, and concept descriptions.

When a user pauses the video and asks a question, the system combines the current video frame with the corresponding script segment retrieved from the timeline. By grounding each question in both visual context and structured textual context, GenTutor significantly improves the accuracy and relevance of in-video explanations.

Together, these design choices allow GenTutor to generate educational videos that are not only visually precise, but also temporally aligned and interactively explainable—bridging the gap between automated generation and real teaching quality.

Challenges we ran into

Many of the challenges we encountered emerged naturally from our decision to generate educational videos as code rather than images.

Once animations became programmatic, we had to explicitly manage spatial structure. Without constraints, even correct animations could become unreadable due to overlapping equations, plots, or annotations. This forced us to rethink layout as a first-class system component rather than an afterthought.

Temporal alignment introduced another challenge. Narration audio is generated independently and contains no timing information by default, yet educational animations rely heavily on precise synchronization. Solving this required building an alignment mechanism that could adapt automatically as narration changed, without manually tuning timestamps.

Finally, enabling interactive Q&A exposed the limitations of naive context selection. Screenshot-only questioning often lacked sufficient grounding. Making interactions reliable required connecting frontend questions to the structured artifacts produced during video generation, such as scene boundaries and script segments.

These challenges reinforced the idea that high-quality educational video generation is fundamentally a systems problem, not just a modeling problem.

Accomplishments that we're proud of

  • Building a fully automated pipeline that generates code-driven educational videos, not static slides or blurry images
  • Designing a reusable layout system that prevents visual overlap in AI-generated animations
  • Achieving audio–animation alignment without manual timing or hardcoded delays
  • Enabling true in-video interaction through timeline-aware, multimodal Q&A

What we learned

One of our biggest learnings was how effective Gemini is as an API-first model for building complex, agent-driven systems. Its long-context reasoning, code generation, and multimodal capabilities were straightforward to integrate and flexible enough to support rapid iteration.

We also learned how to collaborate with AI as a system component rather than a black box. By designing clear intermediate representations—such as layout templates, line-level narration IDs, and structured timelines—we were able to guide the model toward reliable outcomes instead of relying on unconstrained generation.

This project taught us that successful AI-assisted development depends as much on system design as on model capability. When AI is treated as a collaborator operating within well-defined structures, it becomes significantly more powerful and dependable.

What's next for GenTutor

Our long-term goal is to lower the barrier to deep understanding, while preserving correctness and rigor.

We envision GenTutor as a system that allows anyone to experience the beauty of knowledge—not just by consuming information, but by truly understanding it. By combining visual explanation, interaction, and personalization, learners can explore concepts at their own pace, ask questions where confusion arises, and build intuition rather than memorizing results.

Going forward, we plan to expand GenTutor’s interactivity and personalization capabilities, enabling more adaptive explanations tailored to individual learners. Ultimately, we hope to make high-quality, interactive learning accessible to everyone, helping more people learn thoroughly, confidently, and with curiosity.

Built With

Share this project:

Updates