Inspiration Kurzgesagt -- In a Nutshell takes 1,200 hours of human work to produce a single video. Researchers, writers, animators, voice directors, sound designers, video editors -- a full production team working across months. The result is some of the most-watched educational content on the internet. But that production model means only a handful of organizations in the world can make content at that quality. Everyone else gets worse education.

We asked one question: what if the bottleneck wasn't talent -- it was tooling?

What if a researcher, educator, or independent creator had access to a production pipeline that could do in minutes what used to require a team and months? That question became Project Atlas.

What it does Project Atlas takes a single topic idea and produces a fully composed, broadcast-quality animated educational video -- narration, cinematic visuals, character animation, professional sound design, and multilingual subtitles -- entirely from one workflow.

It orchestrates 9 AI modalities across 8 ADK-powered agent stages:

Topic Discovery (gemini-2.5-flash): Scouts 9 data sources and returns 10 scored topic candidates Deep Research (gemini-2.5-flash): Synthesizes a structured brief with source attribution and per-claim confidence scores Script Writing (gemini-3.1-pro-preview): Writes a 9-part narrative script with ElevenLabs v3 prosody tags embedded directly Voice Narration (ElevenLabs v3): Generates MP3 narration with word-level timestamps for frame-accurate subtitle sync Scene Planning (gemini-3.1-pro-preview): Maps the script to a visual scene sequence timed to the voiceover Frame Generation (Gemini 3 Pro Image): Generates start and end frames per scene; each scene's end frame becomes the style reference for the next scene's start frame Video Generation (Veo 3.1 / Kling O3 / SeDance 1.5 Pro / Replicate variants): Renders 5--15 second clips per scene with AI-enriched motion direction. If generation fails: automatic Ken Burns pan/zoom via FFmpeg at ( $0 ) cost -- guaranteed output Sound Design (gemini-2.5-flash + ElevenLabs SFX): Designs per-scene ambient soundscapes in natural language, generates them with AI, applies professional sidechain ducking automatically Composition (FFmpeg): Assembles clips, voiceover, transitions, SFX, and subtitle burn-in into an export-ready H.264 MP4

Topic scoring is multi-dimensional. Each candidate is evaluated across 7 weighted criteria: Score=0.25⋅Smomentum+0.20⋅Sedutainment+0.15⋅Svisual+0.15⋅Scuriosity+0.10⋅Severgreen+0.10⋅Sfacts+0.05⋅Sfeasibility\text{Score} = 0.25 \cdot S_{\text{momentum}} + 0.20 \cdot S_{\text{edutainment}} + 0.15 \cdot S_{\text{visual}} + 0.15 \cdot S_{\text{curiosity}} + 0.10 \cdot S_{\text{evergreen}} + 0.10 \cdot S_{\text{facts}} + 0.05 \cdot S_{\text{feasibility}}Score=0.25⋅Smomentum​+0.20⋅Sedutainment​+0.15⋅Svisual​+0.15⋅Scuriosity​+0.10⋅Severgreen​+0.10⋅Sfacts​+0.05⋅Sfeasibility​

The visual style is Kurzgesagt-inspired: cinematic illustration, vibrant color palettes, expressive characters with round eyes and no mouths -- expression through body language only.

How we built it A Turborepo monorepo with 11 packages and 3 applications: @atlas/api (Express), @atlas/workers (BullMQ), and @atlas/web (Next.js). Agent Framework -- Google ADK All LLM-powered stages run through @google/adk v0.4.0 via a shared LlmAgent + InMemoryRunner abstraction. Eight distinct agent types split by task complexity:

gemini-2.5-flash -- Topic scoring, motion direction, audio design, script rewriting gemini-3.1-pro-preview -- Script architecture, scene planning

Worker Architecture -- BullMQ + Redis 12 BullMQ workers process AI generation jobs asynchronously. Concurrency is deliberately constrained:

Frame Generation -- concurrency 1: Scene ( N )'s end frame must complete before Scene ( N+1 )'s start frame is generated Video Generation -- concurrency 2: Independent per-scene, limited by API rate limits Render / Compose -- concurrency 1: CPU-intensive FFmpeg operation

Visual Grounding System Character consistency is enforced through canonical description injection -- a locked character brief is embedded in every scene image prompt, not just the first. Visual continuity is maintained by passing each scene's end frame as the style reference into the next scene's start frame prompt. Anti-hallucination rules (NEVER include text, NEVER open mouths) are applied across every generation call.

Audio Pipeline ElevenLabs v3 prosody tags are written directly into the script by the Script Architect agent -- [excited], [calm, gentle] -- alongside em dashes for pauses and CAPS for emphasis. Word-level TTS timestamps enable frame-accurate subtitle positioning. Sidechain ducking is applied at mix time: background audio lowers dynamically during narration and recovers between lines. Final encode: AAC 192kbps, H.264 CRF 18, 24fps.

Database and State Machine PostgreSQL 16 via Prisma ORM, 19 models. The Project.status field tracks 23 lifecycle states: draft → topics_generating → topics_ready → research_generating → research_ready → script_generating → script_ready → voice_generating → voice_ready → scenes_planning → scenes_ready → frames_generating → frames_ready → videos_generating → videos_ready → rendering → render_ready → exporting → complete

  • *failed error states at every generating stage Every AI API call is recorded as a CostEvent with stage, vendor, model, units, and unitCost. Total cost per project: Cproject=∑i=1nunitsi×unitCostiC{\text{project}} = \sum_{i=1}^{n} \text{units}_i \times \text{unitCost}_iCproject​=i=1∑n​unitsi​×unitCosti​ where ( n ) spans up to 17 tracked cost categories across 4 provider platforms.

Infrastructure Deployed on Google Cloud at video.trao.ai. 7 Docker containers orchestrated via Docker Compose: postgres, redis, api, workers, web, nginx, migrate. Nginx handles SSL termination with Let's Encrypt and 300-second proxy timeouts to accommodate long-running video generation. The frontend polls at 8-second intervals via React Query to surface live pipeline progress.

Challenges we ran into

Cross-scene visual continuity was the hardest problem in the visual pipeline. Without frame chaining, Gemini would subtly drift between scenes -- different character proportions, shifted color grading. The fix required restructuring frame generation to run at concurrency: 1, strictly sequential, passing each completed end frame as the style reference for the next scene's start frame before generation begins.

Character consistency at scale required canonical description injection. A character's appearance is locked once and embedded in every scene's image prompt as a hard constraint. The character looks the same in scene 1 and scene 14 because the exact same description string appears in both prompts.

Sidechain audio mixing was deceptively difficult. Getting the ducking envelope -- how quickly background audio lowers when narration begins, and how quickly it recovers between lines -- to feel natural rather than mechanical required per-segment tuning rather than a flat ratio applied globally.

Pipeline reliability across 5 video providers. Any provider can fail, rate-limit, or return clips outside the expected duration. The Ken Burns fallback, the Veo start-frame-only retry mode, and the ( \pm 10% ) duration variance tolerance all exist because a pipeline that fails silently is worse than no pipeline at all.

The 23-state project model. Keeping the UI and backend state machine in sync during long-running async jobs -- across 12 workers, 53 API endpoints, and React Query's 8-second polling cycle -- required careful cache invalidation strategy and explicit auto-navigation logic so users don't get stranded mid-pipeline.

Accomplishments that we're proud of

The full pipeline works end-to-end. A topic goes in; a composed, watchable video comes out. Orchestrating 9 AI modalities across 8 agent stages, each with its own failure modes and latency profile, into a single reliable artifact is a genuine systems engineering problem -- not just a prompting exercise. The sound design system. Atlas writes ambient soundscape descriptions in natural language, generates them with ElevenLabs SFX, and applies professional sidechain ducking at the mix stage -- automatically, without the user touching an audio editor. This approach doesn't exist in any comparable tool.

Motion enrichment. Before generating video, a dedicated Gemini agent analyzes each scene's narrative purpose and writes detailed camera direction -- zoom arcs, reveal timing, parallax movement. The difference between a clip that pans randomly and one directed to the scene's story beat is visible in the final output.

Graceful degradation that actually works. Every failure point in the pipeline has an explicit fallback with defined behavior. The system always produces a result -- never a blank screen.

Accountable output. Source attribution on every research claim. Quality scoring across 8 script dimensions. Anti-hallucination rules enforced at the prompt level on every visual generation call. Atlas doesn't just produce output -- it produces output you can trace.

What we learned Multi-modal orchestration is a sequencing and state management problem, not a prompting problem. The hard work was defining execution order, data contracts between pipeline stages, and recovery behavior when any stage fails -- not writing better prompts.

Prompt anchoring beats prompt quality for consistency. A mediocre prompt with a strong visual anchor -- the previous scene's end frame, the canonical character description -- produces more consistent output than a carefully crafted prompt with nothing to anchor to.

AI sound design is genuinely underexplored. Most AI video tools treat audio as an afterthought. Designing ambient soundscapes per scene from narrative content, then mixing them professionally, produces a qualitatively different viewer experience. The gap in the ecosystem here is larger than we expected.

The gap between a demo and a reliable pipeline is almost entirely error handling. Fallback logic, state validation guards, duration variance checks, and worker failure recovery took more total engineering time than the generative features themselves. That ratio surprised us.

What's next for Project Atlas The pipeline is proven. The next phase builds in five directions.

  1. Custom Visual Style Training -- StyleBible Atlas currently produces one aesthetic. The StyleBible model is already in the database schema. The next step is the training UI and injection mechanism: an organization uploads visual references -- a brand palette, an illustration style, a design system -- and Atlas extracts a visual DNA that propagates into every imagePrompt and videoPrompt it generates afterward. Every video produced becomes stylistically consistent with that brand, not Kurzgesagt's.

  2. Series Production with Persistent World Memory Every Atlas video today is a standalone artifact. The next version introduces cross-episode continuity: persistent character designs, established visual world-building, narrative threads that develop over time. A creator defines their universe once via ChannelProfile. Atlas maintains that continuity across every video that follows. This is the architectural difference between a one-shot tool and a genuine creative co-pilot.

  3. Real-Time Voice-Directed Editing via Gemini Live The current editing interface is visual -- a timeline editor, a ReactFlow scene graph, manual controls. The next version adds a Gemini Live voice agent running alongside the production pipeline. A creator speaks: "Make scene three more urgent -- the character should look more anxious." Gemini Live interprets intent, rewrites the script segment, adjusts voice stability and style parameters, triggers frame regeneration with updated emotional direction, and updates the timeline -- conversationally, in real time. This transforms Atlas from a form-based wizard into a director's chair.

  4. One-Click Multilingual Publishing The subtitle infrastructure is built. Word-level timestamp sync is live. The next step is voice cloning across languages: a creator selects a voice once, and Atlas re-narrates the full video in Spanish, Hindi, French, and Mandarin without re-recording. The caption translation endpoint already exists. The voice layer completes the loop. One video becomes a global content library. The marginal cost scales as: Cmultilingual≈CTTS×LC_{\text{multilingual}} \approx C_{\text{TTS}} \times LCmultilingual​≈CTTS​×L where ( L ) is the number of target languages -- a fraction of the cost of re-recording.

  5. LMS and EdTech Platform Integration The problem Atlas solves for individual creators is identical to the problem facing universities, corporate training departments, and course platforms: producing high-quality explanatory video at scale is expensive and slow. A professor uploads a lecture outline. A training manager uploads a skills framework. Atlas produces the videos and publishes them directly to the LMS -- captioned, formatted, and ready. This is the path from a creator tool to infrastructure for how the world learns.

Built With

  • amazon-web-services
  • bullmq
  • crossref
  • digitalocean-spaces-auth-&-security-better-auth
  • docker
  • docker-compose
  • elevenlabs-sfx
  • express.js
  • fal.ai
  • ffmpeg
  • gemini-2.5-flash
  • gemini-3-pro-image-preview
  • gemini-3.1-pro-preview
  • google-trends-api
  • hackernews-api-infrastructure-&-devops-google-cloud
  • kling-o3
  • languages-&-frameworks-typescript
  • let's-encrypt
  • next.js
  • nginx
  • openalex
  • postgresql-16
  • prisma
  • radix-ui
  • react
  • react-query
  • reactflow
  • reddit
  • redis
  • replicate-research-apis-brave-search
  • sedance-1.5-pro
  • semantic-scholar
  • sharp
  • tailwind-css
  • turborepo-ai-google-google-adk-(@google/adk)
  • veo-3.1
  • wikipedia-api
  • youtube-data-api-v3-ai-other-providers-elevenlabs-v3-tts
  • zod
  • zustand
Share this project:

Updates