Notebook Manga

Inspiration

The inspiration for Notebook Manga came from a common frustration: Video is linear, but Learning is non-linear.

When learning how to fix a bicycle or write a React hook, we often find ourselves scrubbing through a 40-minute YouTube timeline just to find the 30 seconds of crucial information. We realized that the Japanese art of Manga (Koma-wari) is actually a highly efficient information compression technology. It allows the brain to scan, skip, and focus at its own pace.

We hypothesized that the Information Density ($D$) of a learning medium can be expressed as:

$$ D_{learning} = \frac{\sum_{i=1}^{n} I(f_i)}{T_{consume}} $$

Where $I(f_i)$ is the information contained in a key frame and $T_{consume}$ is the time required to consume it. Since reading a comic panel takes a fraction of the time of watching a video segment ($T_{read} \ll T_{watch}$), transforming video to manga theoretically increases the learning efficiency by an order of magnitude.

What it does

Notebook Manga is a web application that takes any educational video file and automatically converts it into a paginated comic book.

Smart Extraction: It automatically scans the video to find the most visually distinct and important "Key Frames."
Visual Storytelling: It analyzes the context of each scene to generate narrative captions, speech bubbles, and even sound effects (like CLICK! or WHIRR!).
Manga Stylization: It uses generative AI to redraw noisy, blurry video frames into crisp, black-and-white inked manga panels.
Interactive Learning: The resulting comic isn't just a static image—it's interactive. Clicking any panel instantly plays the original video at that exact timestamp, bridging the gap between quick scanning and deep diving.

How we built it

The architecture is designed as a pipeline that mimics a real Manga Studio (Editor, Writer, and Artist), powered entirely by the Gemini 2.5 Flash family.

Frame Extraction (The Camera): We process the video entirely client-side using the HTML5 Canvas API, extracting frames at dynamic intervals based on video duration to fit within Gemini's context window.
Narrative Construction (The Writer - gemini-2.5-flash): We use Adaptive Batching to feed frames into Gemini.
- Warm-up Batch: We send a small batch (3 frames) first to generate the title and intro panels immediately ($t < 2s$).
- Standard Batch: We then process larger chunks (10 frames) to maintain narrative flow. Gemini analyzes the temporal context and outputs a JSON structure defining the "script" (Captions, Dialogue, SFX).
Artistic Rendering (The Artist - gemini-2.5-flash-image): Once the layout is decided, we use a Queue-based Async Loop to send individual frames to gemini-2.5-flash-image. We prompt it to "redraw" the noisy video frame into a clean, high-contrast ink style while strictly preserving the object composition.
The Viewer (The UX): The React frontend renders a "Paper" UI with page-turning animations. We implemented a "Lazy Inking" system where the story (text/layout) appears first, and the images "ink" themselves (transform from video frame to manga style) in real-time as the API returns results.

Challenges we ran into

The 429 (Rate Limit) Wall: Generating 20+ images for a comic strip hits API quotas instantly.
- Solution: We built a Smart Queue System with exponential backoff. If we hit a 429, the app pauses the "Inking" process, shows a "Cooling down..." badge to the user, and retries automatically after a delay, ensuring the experience never crashes.
Video Context vs. Token Limits: Even with 1M tokens, sending 60fps video is impossible.
- Solution: We implemented a dynamic sampling rate. For a 1-hour video, we sample aggressively (1 frame per 4s), whereas for a short clip, we sample frequently. We also pass "Timestamp" metadata in the text prompt so Gemini understands the passage of time between frames.
Multimodal "Gaps": We learned that while Gemini 2.5 Flash is excellent at reasoning ("What is happening?"), it needs very specific constraints to output consistent JSON for UI rendering. We had to implement a robust "JSON Repair" function to handle cases where the model got too creative with its output format.

Accomplishments that we're proud of

The "Lazy Inking" UX: We successfully decoupled the narrative generation from the image generation. This allows users to start reading the "Storyboard" (text + raw frames) almost immediately, while the AI "Inks" the final manga panels in the background. It turns a 2-minute wait into an instant interaction.
Robust Error Recovery: Building a system that gracefully handles API rate limits (429) and malformed JSON responses without crashing or showing generic error screens. The app feels stable and production-ready.
Aesthetic Consistency: Achieving a consistent visual style where the UI (fonts, colors, shadows) matches the generated manga content, creating an immersive experience.

What we learned

Gemini as a Director: We learned that the gemini-2.5-flash model is incredibly capable at "Directing"—understanding pacing, identifying key moments, and even suggesting sound effects based on visual context.
Prompting for Consistency: Getting an AI to keep a character's face consistent across panels is hard. We found that feeding the raw video frame as a strong reference (inlineData) was far more effective than text-only descriptions for maintaining visual continuity.
The Power of Multimodality: The ability to mix text, code, and vision in a single API call allowed us to build features that would have required 3-4 separate disparate machine learning models just a year ago.

What's next for Notebook Manga

PDF/EPUB Export: Allowing users to download their generated manga to read offline on e-readers like Kindle or iPad.
Audio Transcription Integration: Using Gemini's audio capabilities to extract exact quotes from the video and place them into the speech bubbles, rather than generated summaries.
Style Customization: Letting users choose their art style—Shojo, Shonen, American Comic, or Technical Sketch.
Community Library: A platform where users can share their generated "Video Mangas" so others can learn without processing the video themselves.

Built With

Updates

takashi-uchida Uchida started this project — Feb 09, 2026 07:09 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.