AI Historian

Landing Page 1
Landing Page 2
Upload document and choose your historian
Architecture Overview

Inspiration

History is full of incredible stories, but reading about them often feels like homework. Articles about the fall of Constantinople, the discovery of penicillin, or the construction of the Panama Canal are rich with drama — but they sit flat on a page. We asked: what if you could hand an article to an AI filmmaker and get back a documentary you can watch, narrate, and talk to?

What it does

Give AI Historian a historical article. It reads it, researches every claim with Google Search, and starts generating a cinematic documentary — narration, photorealistic visuals, and a living historian you can interrupt mid-sentence to ask questions.

Drop in an article. Paste a URL or upload a PDF. AI Historian breaks it into narrative scenes and dispatches parallel research agents.

Watch the research unfold. The Expedition Log narrates the AI's process as a field journal — translating, dispatching agents, cross-referencing sources, composing scripts. Each agent appears as a living card that transitions through states in real time.

The documentary starts before research finishes. The first scene becomes playable in under 45 seconds. Remaining segments generate in the background while you watch.

A self-generating film. Imagen 3 photorealistic frames, Veo 2 video clips, Gemini-generated storyboard illustrations, Ken Burns animation that breathes with the narrator's voice, word-by-word caption reveals. An antique-styled map tracks locations as the story moves through geography.

Talk to the historian — anytime. Mid-documentary, just speak. The historian stops, answers your question with grounded evidence, and resumes. Ask a follow-up and the documentary branches — a mini-pipeline researches, scripts, and illustrates a new segment on the fly.

Choose your narrator. Three historian personas with different voices and styles. Each has an AI-generated portrait with canvas-based lip sync driven by audio energy analysis.

How we built it

11-Phase ADK Pipeline

The engine is a SequentialAgent built on Google's Agent Development Kit:

Phase I — Translation & Scan: Document AI OCR, semantic chunking, narrative curation with Gemini 2.0 Pro
Phase II — Field Research: ParallelAgent spawns N google_search research agents per scene, then an aggregator synthesizes findings
Phase III — Synthesis: Gemini 2.0 Pro writes narration scripts with visual descriptions, segment by segment
Phase IV — Creative Direction: Gemini TEXT+IMAGE generates storyboard illustrations alongside creative direction notes — reasoning about narrative and visuals together in a single call
Phase V — Interleaved Composition: Pre-generates narration beats with TEXT+IMAGE for the player. Beat 0 ships immediately (fast path)
Phase VI — Visual Interleave: Assigns each beat a visual type — illustration, cinematic still, or video clip
Phase VII — Fact Validation: An LLM-judge cross-references every narration sentence against research evidence. Unsupported claims get removed or softened automatically
Phase VIII — Geographic Mapping: Extracts locations, geocodes via Google Maps grounding, streams coordinates to the frontend map
Phase IX — Visual Storyboard: Plans unique visual territory for each scene to avoid repetition
Phase X — Visual Composition: A 6-stage pipeline researches period-accurate visual details — Google Search grounding discovers references, fetches and evaluates them, then merges era markers, color palettes, and negative prompts into a visual manifest
Phase XI — Generation: Imagen 3 generates 4 photorealistic frames per segment. Veo 2 generates video clips asynchronously. Scene 0 runs first for fast playback; the rest stagger with concurrency control

Gemini's Interleaved TEXT+IMAGE

Phases IV–V use response_modalities=["TEXT", "IMAGE"] — Gemini produces both narrative text and a storyboard illustration in one call. This is different from calling Imagen separately: the model reasons about story and image together, producing more coherent visual storytelling.

Fact Validation

Phase VII classifies every sentence: SUPPORTED (keep), UNSUPPORTED SPECIFIC (remove and bridge), UNSUPPORTED PLAUSIBLE (soften language), NON-FACTUAL (keep — rhetoric and atmosphere). The script is rewritten in place only when segment structure is preserved.

Visual Research Pipeline

Before generating any image, Phase X grounds visual prompts in real historical detail. Google Search discovers sources, classifies them by type, fetches content (webpages, Wikipedia, PDFs, images), and Gemini evaluates each for period accuracy. The result: Imagen 3 generates from historically-researched descriptions, not generic guesses.

Voice System

Gemini 2.5 Flash Native Audio runs via a Cloud Run WebSocket relay. The browser captures mic audio at 16kHz PCM through an AudioWorkletNode, streams it over WebSocket, and plays responses at 24kHz. Interruption is server-detected. Context window compression enables unlimited session length. Before each response, a retrieval endpoint searches the document's content and injects relevant context into Gemini.

Frontend as Cinema

The frontend is designed to feel like a documentary experience, not a web app:

Expedition Log — The research pipeline is narrated as a field journal with typewriter text and staggered reveals, turning "loading" into the first act
Iris reveal — A CSS @property-animated radial-gradient mask transitions from workspace to player like a camera iris
Audio-reactive visuals — An AnalyserNode drives Ken Burns animation speed, glow intensity, and vignette spread in sync with the narrator's voice
Living Portrait — Multi-layer canvas composites a portrait with audio-driven lip sync, natural blinks, and candlelight flicker
Ambient color extraction — Each scene image's dominant color becomes the player's background atmosphere
SSE drip buffer — Parallel agent events are released at 150ms intervals so the research panel flows smoothly instead of jumping

Challenges we ran into

ADK's google_search tool can't combine with other tools in the same agent — we had to build single-purpose research agents orchestrated by a ParallelAgent
Veo 2 takes 1–2 minutes per clip — progressive delivery solves this: Imagen 3 makes segments playable in ~5 seconds, Veo 2 video overlays arrive asynchronously
11 phases streaming SSE simultaneously — parallel agent events arriving in bursts needed drip buffering and careful event choreography to feel smooth
Gemini TEXT+IMAGE isn't deterministic — graceful degradation ensures that if no image is returned, the scene falls back to Imagen 3 in Phase XI
First-segment latency — Scene 0 gets fewer research sources, early exits from evaluation, and priority scheduling to hit the <45-second target

Accomplishments that we're proud of

The Expedition Log transforms loading time into storytelling — users watch the research happen, not a spinner
The fact validator catches real hallucinations: unsupported specific claims are automatically removed or softened
Live illustration generation during voice conversation through Gemini's non-blocking function calling
Audio-reactive Ken Burns creates an unconscious link between the narrator's energy and visual movement

What we learned

Gemini's interleaved TEXT+IMAGE output produces more coherent storyboards than separate text-then-image pipelines — the model reasons about narrative and visual together
Progressive delivery changes perceived speed dramatically — the documentary feels instant even though full generation takes minutes
ADK's SequentialAgent + ParallelAgent pattern enables genuine multi-stage AI workflows with clean separation of concerns
Audio-reactive visuals create an unconscious bond between voice and imagery that makes the experience feel alive, not mechanical
Turning the research pipeline into visible narrative (the Expedition Log) changes user perception from "waiting" to "participating"

What's next for AI Historian

Multi-document cross-referencing — upload opposing accounts of the same event and the historian presents both perspectives
Collaborative viewing — multiple users watch and steer the same documentary simultaneously
Export as video — render the complete documentary with captions and credits as a shareable file

Built With

cloud-run
cloud-storage
document-ai
fastapi
firestore
gemini-2.0-flash
gemini-2.0-pro
gemini-2.5-flash-image
gemini-2.5-flash-native-audio
google-adk
google-genai
imagen-3
maplibre
motion
node.js
pub-sub
python
react
secret-manager
tailwind-css
terraform
typescript
veo-2
vertex-ai
vite
web-audio-api
websocket
zustand

Updates

EFE ÇELİK started this project — Mar 16, 2026 07:49 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.