Home Screen

Chronos Cinema

Inspiration

I grew up learning from documentaries.

Channels like Discovery and National Geographic were my classroom outside school. A good documentary does something textbooks rarely achieve. It combines narration, imagery, pacing, and music to make complex ideas feel alive.

Most digital learning tools still miss that feeling.

Textbooks are static.
YouTube only works if someone has already made the video.
LLMs generate walls of text.

I wanted to see what would happen if a documentary experience could be generated on demand for any topic.

Chronos Cinema is my attempt to make that possible.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What it does

Chronos Cinema generates a short 4-beat documentary experience on any topic.

The system orchestrates several multimodal models at once to produce a cinematic learning experience.

Narration
Gemini Live streams voice narration in real time, beat by beat.
Scenes
Gemini's image model generates still frames that animate using Ken Burns pan-zoom motion.
Music
Lyria composes a custom background score that automatically ducks under the narrator's voice and restores between beats.
Learning check
After the documentary, the narrator cues an upcoming quiz. Five questions are generated from what was actually said, and the quiz appears only after the narrator finishes speaking.
Offline replay
Completed sessions replay instantly from IndexedDB without any network calls.

The result feels closer to a mini documentary than a typical AI response.

The narrator never speaks to a blank screen. The first frame appears, the music rises, and the story begins.

Architecture

Chronos Cinema runs as a real-time multimodal pipeline.

Backend (FastAPI + Python)

Each user session creates a ChronosAgent tied to a WebSocket connection and a Gemini Live session on Vertex AI.

The agent coordinates parallel workers for slow generation tasks:

All 4 scene images render simultaneously (semaphore-limited to prevent API bursts)
Lyria music generation runs concurrently in the background
Quiz generation runs concurrently with the closing reflection narration

A script blueprint is generated first using Gemini 2.5 Flash, which produces a 4-beat arc (Hook → Foundation → Mechanism → Reflection) with narration scripts and image prompts for all workers.

Each worker streams its result to the client the moment it finishes.

I use two isolated Vertex AI clients: one with v1 for all generate_content calls (script, images, TTS fallback), and a separate one with v1beta1 exclusively for the Gemini Live WebSocket. Using the same API version for both caused 404 errors.

Frontend (Next.js + Web Audio API)

The browser schedules Gemini Live PCM audio chunks inside an AudioContext using a manual scheduler (nextNarrationTime) to ensure gapless playback.

A separate gain node manages background music with exponential ramps for automatic voice ducking.

During playback the system accumulates all generated assets locally:

narration PCM
visuals
subtitles
music
quiz data

These assets are stored in IndexedDB, enabling full offline replay of the generated documentary.

Key challenges

Gemini Live session dropping mid-documentary

The Live WebSocket timed out with a 1006 abnormal closure while waiting for images to generate before narration started. Gemini Live has a short idle timeout, so sending nothing for around 10 seconds kills the session.

I fixed this by replacing the single asyncio.timeout wait with a loop that sends a keepalive to the Live session every 5 seconds while waiting for the first image to be ready.

Quiz appearing before narration finished

story_complete arrived from the backend while reflection audio chunks were still in flight over the WebSocket. Checking isNarrationActive() at that moment returned false because no chunks had been enqueued yet, so the quiz transitioned immediately.

I fixed this with a two-phase drain: first waiting until no new audio_chunk messages have arrived for 1 second (stream fully received), then waiting for isNarrationActive() to return false (playback fully finished).

Gapless audio from streamed PCM

Web Audio does not provide a native streaming queue.

I implemented a manual scheduler that:

decodes incoming Int16 chunks
converts them to float32 audio buffers
schedules playback at precise offsets using nextNarrationTime

Even very small timing gaps create audible clicks, so accurate sample math was critical.

API version mismatch causing 404s everywhere

The genai.Client was initially initialized with api_version: v1alpha. This version does not support generate_content on Vertex AI, so every call to generate scripts, images, and TTS returned a 404.

The Live API requires v1beta1. These two requirements are mutually exclusive on a single client, so I solved this by creating two isolated clients with the correct version for each use case.

Lyria response key mismatch

The Lyria predict API returns audio under the key bytesBase64Encoded, not audio_bytes as documented in older references. My extractor was silently returning None, so no BGM was ever sent to the client.

Model routing

Different tasks are routed to different models:

Gemini Live (gemini-live-2.5-flash-native-audio) for narration
Gemini 2.5 Flash for script and quiz generation
gemini-2.5-flash-image for scene imagery
Lyria 002 for music generation

What makes this interesting

Chronos Cinema demonstrates true multimodal interleaving.

While narration is streaming:

the next scene image is pre-rendering
the soundtrack is being composed by Lyria
the quiz is being generated concurrently with the closing reflection

Three model families operate concurrently within the same session, all coordinated through a single WebSocket.

Instead of returning a static answer, the system generates a cinematic learning experience in real time.

What's next

Chapter navigation
Allow viewers to jump to any beat in the documentary timeline.

Style presets
Generate documentaries in formats such as nature, true crime, or historical storytelling.

Multi-language narration
Leverage Gemini Live voices to generate documentaries in different languages.

Shareable documentaries
Serialize sessions from IndexedDB into shareable links or Cloud Storage objects.

Long-form generation
The current 4-beat structure is a design constraint. The architecture could support much longer documentary formats.

Cinematic video with Veo Replace still images with Veo-generated cinematic clips. The architecture already supports this, each beat's image task can be upgraded to a Veo render that silently replaces the still once the clip is ready.

Note on Live Access:

To maintain multimodal API costs, the live production environment is currently restricted to private sessions. Please reach out and I'll be happy to share the link to the deployed project.

Built With

gcp
gemini
nextjs
python

Updates

Rahul Baxi started this project — Mar 16, 2026 04:17 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.