Inspiration

The inspiration for Seeker stems from a deep-seated belief that learning should be a deeply personal and adaptable journey. We were guided by the Rumi quote: "What you seek, is also seeking you."

We envisioned a tool that moves away from the "one-size-fits-all" model of education. Instead of a static textbook, we wanted to build a patient, adaptable tutor—a "Private Academy"—that connects users with knowledge in a way that resonates with their individual psychological profile and learning style.

Technical Architecture

Seeker operates as a distributed system across two repositories to handle the intense computational requirements of generative media.

  • Frontend: Handles the real-time "Live Classroom," audio streaming, and direct user interaction.
  • Backend: Handles long-running agentic workflows (Video Rendering, FFmpeg processing, Comic generation) that cannot run in a browser environment.

Client-Server Interaction Flow

The Frontend manages the immediate learning experience. When a lesson concludes, it dispatches asynchronous triggers to the Backend. The Backend generates high-fidelity assets (using Veo and Gemini) and uploads them to Supabase. The Frontend listens for database changes via WebSockets to update the UI in real-time when the media is ready.

Client-Server Interaction Flow

What it does

Seeker is a hyper-personalized AI tutor that transforms raw information into an immersive, multimodal educational experience.

Live Multimodal Classroom

Utilizing the Gemini Live API, Seeker provides a voice-first environment where a virtual teacher speaks, draws diagrams, and writes on a blackboard in real-time.

The Live Multimodal Loop

Curriculum Architect

Users upload PDFs (textbooks, papers), and Seeker's "Grand Architect" (Gemini 3 Pro) restructures that raw data into a gamified dependency tree of modules and lesson plans.

Curriculum Architect Flow

Media Pipeline

After a lesson concludes, the frontend triggers asynchronous generation requests. While the user reviews their notes, the system generates high-fidelity media:

1. The Comic Book Factory

Goal: Convert abstract concepts into a 5-page graphic novel.

Innovation: Separates the "Director" (Text/Layout) from the "Artist" (Image Generation).

  • Director Agent (Gemini 3 Pro): Reads the lesson notes and outputs a JSON manifest containing panel descriptions, dialogue, and a specific "Visual Anchor" (e.g., "Vintage Ink Style, 1920s setting").
  • Artist Agent (Gemini 3 Image): Iterates through the manifest, generating images that strictly adhere to the Director's visual anchors.

Comics 2

2. The Podcast Creator

Goal: Convert a lesson into an engaging 2-person dialogue.

Innovation: Multi-speaker synthesis.

  • Scriptwriter: Converts notes into a transcript between "Alex" (curious) and "Sam" (expert).
  • Audio Synth: Uses Gemini 2.5 Flash TTS with multiSpeakerVoiceConfig to generate a single audio file with distinct voices.

Podcast

3. The Veo Cinematic Engine

Goal: Generate 8-second looping video explainers with consistent characters.

Innovation: Solves the "flickering character" problem in AI video by generating a "Character DNA" grid first.

  • Identity Agent: Analyzes the lesson to define a visual style and protagonist.
  • DNA Generator: Creates a 2x2 Character Reference Grid (Front, Side, Back views).
  • Veo Production: Passes the Reference Grid to Veo 3.1 for every scene generation to ensure the character looks the same in Scene 1 and Scene 4.

Watch the demo

4. The Slide Generator

Goal: Create a traditional video lecture (.mp4) from text notes automatically.

Innovation: Programmatic video editing. We use Gemini to generate the assets, but use FFmpeg to handling the timing and text rendering for perfect synchronization.

  • Manifest Agent: Breaks the lesson into 30-second blocks.
  • Asset Generation: Parallel generation of TTS Audio (Gemini 2.5) and Background Images (Gemini 3 Image).
  • FFmpeg Rendering: Uses complex filter graphs (drawtext, overlay) to burn text onto the video at specific timestamps.

Watch the demo

Socratic Examination

To ensure mastery, students must pass an oral exam where the AI challenges their logic before unlocking the next chapter.

How we built it

Seeker is built on a modern full-stack architecture designed for high-performance AI media processing:

  • Frontend: React (Vite) and TypeScript handle the real-time UI. We used WebSockets for low-latency communication and AudioContext for high-fidelity PCM audio streaming.
  • Backend: A NestJS server acts as the "Media Production Studio," orchestrating heavy FFmpeg tasks and AI generation requests.
  • The AI Engine:
    • Gemini 2.5 Flash (Native Audio): Powers the Live Classroom for <300ms latency.
    • Gemini 3 Pro (Reasoning): Acts as the "Architect" for curriculum design and "Showrunner" for video scripts.
    • Gemini 3 Image: Generates high-fidelity diagrams and comic panels.
    • Google Veo 3.1: Creates the cinematic video content with temporal consistency.
  • Infrastructure: Supabase manages the real-time database, media storage, and student XP/badge persistence.

Gemini 3 Model Usage Map

Feature Gemini 3 Model Specific Use Case Why Gemini 3? Code Reference
Curriculum Architect Gemini 3 Pro (gemini-3-pro-preview) Analyzes PDF textbooks and restructures them into pedagogical learning trees (Courses → Modules → Lessons) High-reasoning capabilities needed to understand academic prerequisites and logical dependencies between concepts CoursesPage.tsx (lines 74-115)
Uses Gemini 3 Pro with generateContent to process uploaded PDFs and extract structured course data
Comic Book Director Gemini 3 Pro (gemini-3-pro-preview) Creates visual identity guides and 5-page comic storyboards with consistent art direction Advanced reasoning to maintain narrative coherence and visual continuity across multiple panels slides.service.ts (lines 353-391)
Generates JSON manifest with thematic_era, style_guide, and visual_anchors for consistency
Comic Panel Artist Gemini 3 Pro Image (gemini-3-pro-image-preview) Renders individual comic panels following strict visual guidelines from the Director agent Superior image fidelity and style adherence compared to previous generations slides.service.ts (lines 395-419)
Generates sequential panels with character consistency using responseModalities: ["IMAGE"]
Cinematic Script Writer Gemini 3 Pro (gemini-3-pro-preview) Generates 4-scene video narratives with detailed camera directions and character actions Complex creative reasoning to balance educational content with cinematic storytelling video.service.ts (lines 113-137)
Outputs JSON with scene-by-scene breakdowns including action prompts and dialogue
Character DNA Generator Gemini 3 Pro Image (gemini-3-pro-image-preview) Creates 2x2 reference grids showing characters from 4 angles (front, side, 3/4, back) High-detail consistency needed for multi-view character sheets that serve as reference for Veo 3.1 video.service.ts (lines 86-110)
Generates professional character reference sheets with multi-angle views
Scene Thumbnail Creator Gemini 3 Pro Image (gemini-3-pro-image-preview) Produces anchor frames for each video scene to guide Veo composition Precise control over composition, lighting, and framing for cinematic quality video.service.ts (lines 155-173)
Creates key frames for video generation using scene action prompts
Educational Diagrams Gemini 3 Pro Image (gemini-3-pro-image-preview) Real-time diagram generation during live lessons (isolated cutouts on white backgrounds) Fast generation with professional illustration quality for classroom clarity LessonPage.tsx (lines 43-52)
Function declaration for generate_educational_diagram tool with Gemini 3 Pro Image
Course Banner Artist Gemini 3 Pro Image (gemini-3-pro-image-preview) Creates high-fidelity academic cover images for course cards Premium visual quality for professional course presentation NotesPage.tsx (lines 196-210)
Generates wide-format academic banners with IMAGE_MODEL_NAME constant
Notes Polishing Agent Gemini 3 Pro (gemini-3-pro-preview) Converts raw lesson transcripts into structured Markdown study notes with academic formatting Advanced text structuring and academic writing capabilities NotesPage.tsx (lines 216-228)
Transforms unstructured transcripts using TEXT_MODEL_NAME with expert academic prompt
Slide Content Architect Gemini 3 Pro (gemini-3-pro-preview) Breaks lessons into 30-second presentation blocks with narration, bullets, and image prompts Complex content decomposition and pedagogical timing slides.service.ts (lines 70-80)
Generates slide manifest with timing specifications using gemini-3-pro-preview
Podcast Script Generator Gemini 3 Flash (gemini-2.0-flash) Creates engaging 2-person dialogue scripts between "Alex" and "Sam" personalities Natural dialogue generation optimized for multi-speaker TTS slides.service.ts (lines 241-267)
Outputs conversational podcast format with character-based dialogue

Gemini 3 Innovation Highlights

🎯 The Director-Artist Pattern

We implemented a novel two-stage pipeline where Gemini 3 Pro acts as the Creative Director (defining style, continuity rules, and narrative structure) while Gemini 3 Pro Image acts as the Artist (executing the vision with pixel-perfect fidelity). This separation of concerns ensures:

  • Visual Consistency: The Director enforces strict style guides across all generated panels
  • Narrative Coherence: Story arcs maintain logical flow even with procedurally generated content
  • Production Scalability: The same Director manifest can be re-rendered with different art styles

🎬 Character DNA Technology

The 2x2 Character Reference Grid (generated by Gemini 3 Pro Image) solved the critical "shapeshifting protagonist" problem in AI video generation:

// Gemini 3 generates a multi-angle reference sheet
const charGridPrompt = `Professional 2x2 character reference:
- Top-left: Front view
- Top-right: Side profile
- Bottom-left: 3/4 view  
- Bottom-right: Back view
Art style: ${visualId.art_style}
Character: ${visualId.protagonist_description}`;

This grid is then passed to Veo 3.1 as a referenceImage for every scene, ensuring the character looks identical across all 8-second clips.

🧠 High-Reasoning Curriculum Design

Using Gemini 3 Pro's thinkingLevel: ThinkingLevel.HIGH configuration, the Curriculum Architect can:

  • Identify prerequisite relationships between topics
  • Create optimal learning sequences based on cognitive load theory
  • Generate age-appropriate difficulty progression

Challenges we ran into

1. Narrative Discontinuity (The "Shapeshifting" Protagonist)

Initially, generating 8 independent clips led to "Visual Hallucination." A character might look like a 3D general in Scene 1 and a watercolor robot in Scene 2.

  • The Fix: We implemented the "2x2 Character DNA Grid." Before filming, the AI generates a reference sheet (Front, Side, 3/4, and Back views). This grid is passed into every Veo request as a referenceImage, forcing visual continuity.

2. Audio Mismatch & Header Corruption

AI-generated audio (raw PCM) often lacks headers, causing FFmpeg to "screech" or fail when stitching media.

  • The Fix: We built an intermediate conversion step using -f s16le and -ar 24000 to define the bitrate and sample rate explicitly before merging. Furthermore, we moved to Veo 3.1 Native Audio Synthesis, using specific syntax ("Quotes" for dialogue and (Parentheses) for SFX) to generate synced audio inside the video file itself.

3. The SDK Schema Trap

We faced 400: INVALID_ARGUMENT errors because Veo 3.1 and Gemini Pro expect different data structures.

  • The Fix: We discovered that while Gemini Pro uses inlineData, the Video engine requires a flattened object with the key bytesBase64Encoded.
// ✅ Veo 3.1 Correct Schema
image: { 
  bytesBase64Encoded: charGridB64, 
  mimeType: "image/png" 
}

4. The "Media Pipe" Error

Relying on ffmpeg-static proved insufficient as it lacked the drawtext filter.

  • The Fix: We implemented a system-level check (getSystemFont) to locate the full FFmpeg installation and hard-coded font paths (e.g., /usr/share/fonts/...) to ensure text overlays would render correctly on the classroom slides.

Accomplishments that we're proud of

  • Cohesive Cinematic Pipeline: Successfully using Video Extension Chains so that Scene 2 physically "extends" the file from Scene 1, keeping backgrounds and camera positions identical.
  • The Showrunner AI: Moving from a simple "generator" to an AI that defines a "Visual Signature" (like 16mm grain) and maintains it throughout a production.
  • Multimodal Integration: Creating a system where a single PDF can be transformed into audio, video, text, and interactive quizzes without losing the "thread" of knowledge.
  • Temporal Consistency: Achieving character-locked video content that makes AI-generated media feel like a professional educational film.

What we learned

  • Schema Nuance: We learned that even within the same SDK, different models (Pro vs. Veo) have radically different data formatting requirements.
  • System-Level Dependencies: Building AI media tools requires deep knowledge of system tools like FFmpeg—static libraries aren't always enough for complex filters.
  • Thought Signature Circulation: To keep the AI from "losing its train of thought" across multiple media generation steps, we learned to save and recirculate the model's internal thought signatures in a history array.

What's next for Seeker

  • Linear Skill Trees: We are currently developing a visual RPG-style Skill Tree for courses, allowing students to see their progress as a literal "map" of knowledge.
  • Collaborative Academies: Allowing students to invite friends into a "Group Blackboard" session where the AI tutor moderates a debate.
  • Veo 4K Extension: Once API constraints expand, we plan to upscale our cinematic explainer videos to 4K resolution while maintaining current continuity features.
  • Offline Mode: Edge-based AI processing to allow "Academy" access in low-connectivity environments.

Built With

Share this project:

Updates