Inspiration
History is full of incredible stories, but reading about them often feels like homework. Articles about the fall of Constantinople, the discovery of penicillin, or the construction of the Panama Canal are rich with drama — but they sit flat on a page. We asked: what if you could hand an article to an AI filmmaker and get back a documentary you can watch, narrate, and talk to?
What it does
Give AI Historian a historical article. It reads it, researches every claim with Google Search, and starts generating a cinematic documentary — narration, photorealistic visuals, and a living historian you can interrupt mid-sentence to ask questions.
Drop in an article. Paste a URL or upload a PDF. AI Historian breaks it into narrative scenes and dispatches parallel research agents.
Watch the research unfold. The Expedition Log narrates the AI's process as a field journal — translating, dispatching agents, cross-referencing sources, composing scripts. Each agent appears as a living card that transitions through states in real time.
The documentary starts before research finishes. The first scene becomes playable in under 45 seconds. Remaining segments generate in the background while you watch.
A self-generating film. Imagen 3 photorealistic frames, Veo 2 video clips, Gemini-generated storyboard illustrations, Ken Burns animation that breathes with the narrator's voice, word-by-word caption reveals. An antique-styled map tracks locations as the story moves through geography.
Talk to the historian — anytime. Mid-documentary, just speak. The historian stops, answers your question with grounded evidence, and resumes. Ask a follow-up and the documentary branches — a mini-pipeline researches, scripts, and illustrates a new segment on the fly.
Choose your narrator. Three historian personas with different voices and styles. Each has an AI-generated portrait with canvas-based lip sync driven by audio energy analysis.
How we built it
11-Phase ADK Pipeline
The engine is a SequentialAgent built on Google's Agent Development Kit:
- Phase I — Translation & Scan: Document AI OCR, semantic chunking, narrative curation with Gemini 2.0 Pro
- Phase II — Field Research: ParallelAgent spawns N google_search research agents per scene, then an aggregator synthesizes findings
- Phase III — Synthesis: Gemini 2.0 Pro writes narration scripts with visual descriptions, segment by segment
- Phase IV — Creative Direction: Gemini TEXT+IMAGE generates storyboard illustrations alongside creative direction notes — reasoning about narrative and visuals together in a single call
- Phase V — Interleaved Composition: Pre-generates narration beats with TEXT+IMAGE for the player. Beat 0 ships immediately (fast path)
- Phase VI — Visual Interleave: Assigns each beat a visual type — illustration, cinematic still, or video clip
- Phase VII — Fact Validation: An LLM-judge cross-references every narration sentence against research evidence. Unsupported claims get removed or softened automatically
- Phase VIII — Geographic Mapping: Extracts locations, geocodes via Google Maps grounding, streams coordinates to the frontend map
- Phase IX — Visual Storyboard: Plans unique visual territory for each scene to avoid repetition
- Phase X — Visual Composition: A 6-stage pipeline researches period-accurate visual details — Google Search grounding discovers references, fetches and evaluates them, then merges era markers, color palettes, and negative prompts into a visual manifest
- Phase XI — Generation: Imagen 3 generates 4 photorealistic frames per segment. Veo 2 generates video clips asynchronously. Scene 0 runs first for fast playback; the rest stagger with concurrency control
Gemini's Interleaved TEXT+IMAGE
Phases IV–V use response_modalities=["TEXT", "IMAGE"] — Gemini produces both narrative text and a storyboard illustration in one call. This is different from calling Imagen separately: the model reasons about story and image together, producing more coherent visual storytelling.
Fact Validation
Phase VII classifies every sentence: SUPPORTED (keep), UNSUPPORTED SPECIFIC (remove and bridge), UNSUPPORTED PLAUSIBLE (soften language), NON-FACTUAL (keep — rhetoric and atmosphere). The script is rewritten in place only when segment structure is preserved.
Visual Research Pipeline
Before generating any image, Phase X grounds visual prompts in real historical detail. Google Search discovers sources, classifies them by type, fetches content (webpages, Wikipedia, PDFs, images), and Gemini evaluates each for period accuracy. The result: Imagen 3 generates from historically-researched descriptions, not generic guesses.
Voice System
Gemini 2.5 Flash Native Audio runs via a Cloud Run WebSocket relay. The browser captures mic audio at 16kHz PCM through an AudioWorkletNode, streams it over WebSocket, and plays responses at 24kHz. Interruption is server-detected. Context window compression enables unlimited session length. Before each response, a retrieval endpoint searches the document's content and injects relevant context into Gemini.
Frontend as Cinema
The frontend is designed to feel like a documentary experience, not a web app:
- Expedition Log — The research pipeline is narrated as a field journal with typewriter text and staggered reveals, turning "loading" into the first act
- Iris reveal — A CSS
@property-animated radial-gradient mask transitions from workspace to player like a camera iris - Audio-reactive visuals — An AnalyserNode drives Ken Burns animation speed, glow intensity, and vignette spread in sync with the narrator's voice
- Living Portrait — Multi-layer canvas composites a portrait with audio-driven lip sync, natural blinks, and candlelight flicker
- Ambient color extraction — Each scene image's dominant color becomes the player's background atmosphere
- SSE drip buffer — Parallel agent events are released at 150ms intervals so the research panel flows smoothly instead of jumping
Challenges we ran into
- ADK's
google_searchtool can't combine with other tools in the same agent — we had to build single-purpose research agents orchestrated by a ParallelAgent - Veo 2 takes 1–2 minutes per clip — progressive delivery solves this: Imagen 3 makes segments playable in ~5 seconds, Veo 2 video overlays arrive asynchronously
- 11 phases streaming SSE simultaneously — parallel agent events arriving in bursts needed drip buffering and careful event choreography to feel smooth
- Gemini TEXT+IMAGE isn't deterministic — graceful degradation ensures that if no image is returned, the scene falls back to Imagen 3 in Phase XI
- First-segment latency — Scene 0 gets fewer research sources, early exits from evaluation, and priority scheduling to hit the <45-second target
Accomplishments that we're proud of
- The Expedition Log transforms loading time into storytelling — users watch the research happen, not a spinner
- The fact validator catches real hallucinations: unsupported specific claims are automatically removed or softened
- Live illustration generation during voice conversation through Gemini's non-blocking function calling
- Audio-reactive Ken Burns creates an unconscious link between the narrator's energy and visual movement
What we learned
- Gemini's interleaved TEXT+IMAGE output produces more coherent storyboards than separate text-then-image pipelines — the model reasons about narrative and visual together
- Progressive delivery changes perceived speed dramatically — the documentary feels instant even though full generation takes minutes
- ADK's SequentialAgent + ParallelAgent pattern enables genuine multi-stage AI workflows with clean separation of concerns
- Audio-reactive visuals create an unconscious bond between voice and imagery that makes the experience feel alive, not mechanical
- Turning the research pipeline into visible narrative (the Expedition Log) changes user perception from "waiting" to "participating"
What's next for AI Historian
- Multi-document cross-referencing — upload opposing accounts of the same event and the historian presents both perspectives
- Collaborative viewing — multiple users watch and steer the same documentary simultaneously
- Export as video — render the complete documentary with captions and credits as a shareable file
Built With
- cloud-run
- cloud-storage
- document-ai
- fastapi
- firestore
- gemini-2.0-flash
- gemini-2.0-pro
- gemini-2.5-flash-image
- gemini-2.5-flash-native-audio
- google-adk
- google-genai
- imagen-3
- maplibre
- motion
- node.js
- pub-sub
- python
- react
- secret-manager
- tailwind-css
- terraform
- typescript
- veo-2
- vertex-ai
- vite
- web-audio-api
- websocket
- zustand

Log in or sign up for Devpost to join the conversation.