Seeker

Inspiration

The inspiration for Seeker stems from a deep-seated belief that learning should be a deeply personal and adaptable journey. We were guided by the Rumi quote: "What you seek, is also seeking you."

We envisioned a tool that moves away from the "one-size-fits-all" model of education. Instead of a static textbook, we wanted to build a patient, adaptable tutor—a "Private Academy"—that connects users with knowledge in a way that resonates with their individual psychological profile and learning style.

Technical Architecture

Seeker operates as a distributed system across two repositories to handle the intense computational requirements of generative media.

Frontend: Handles the real-time "Live Classroom," audio streaming, and direct user interaction.
Backend: Handles long-running agentic workflows (Video Rendering, FFmpeg processing, Comic generation) that cannot run in a browser environment.

Client-Server Interaction Flow

The Frontend manages the immediate learning experience. When a lesson concludes, it dispatches asynchronous triggers to the Backend. The Backend generates high-fidelity assets (using Veo and Gemini) and uploads them to Supabase. The Frontend listens for database changes via WebSockets to update the UI in real-time when the media is ready.

Client-Server Interaction Flow

What it does

Seeker is a hyper-personalized AI tutor that transforms raw information into an immersive, multimodal educational experience.

Live Multimodal Classroom

Utilizing the Gemini Live API, Seeker provides a voice-first environment where a virtual teacher speaks, draws diagrams, and writes on a blackboard in real-time.

The Live Multimodal Loop

Curriculum Architect

Users upload PDFs (textbooks, papers), and Seeker's "Grand Architect" (Gemini 3 Pro) restructures that raw data into a gamified dependency tree of modules and lesson plans.

Curriculum Architect Flow

Media Pipeline

After a lesson concludes, the frontend triggers asynchronous generation requests. While the user reviews their notes, the system generates high-fidelity media:

1. The Comic Book Factory

Goal: Convert abstract concepts into a 5-page graphic novel.

Innovation: Separates the "Director" (Text/Layout) from the "Artist" (Image Generation).

Director Agent (Gemini 3 Pro): Reads the lesson notes and outputs a JSON manifest containing panel descriptions, dialogue, and a specific "Visual Anchor" (e.g., "Vintage Ink Style, 1920s setting").
Artist Agent (Gemini 3 Image): Iterates through the manifest, generating images that strictly adhere to the Director's visual anchors.

Comics 2

2. The Podcast Creator

Goal: Convert a lesson into an engaging 2-person dialogue.

Innovation: Multi-speaker synthesis.

Scriptwriter: Converts notes into a transcript between "Alex" (curious) and "Sam" (expert).
Audio Synth: Uses Gemini 2.5 Flash TTS with multiSpeakerVoiceConfig to generate a single audio file with distinct voices.

Podcast

3. The Veo Cinematic Engine

Goal: Generate 8-second looping video explainers with consistent characters.

Innovation: Solves the "flickering character" problem in AI video by generating a "Character DNA" grid first.

Identity Agent: Analyzes the lesson to define a visual style and protagonist.
DNA Generator: Creates a 2x2 Character Reference Grid (Front, Side, Back views).
Veo Production: Passes the Reference Grid to Veo 3.1 for every scene generation to ensure the character looks the same in Scene 1 and Scene 4.

4. The Slide Generator

Goal: Create a traditional video lecture (.mp4) from text notes automatically.

Innovation: Programmatic video editing. We use Gemini to generate the assets, but use FFmpeg to handling the timing and text rendering for perfect synchronization.

Manifest Agent: Breaks the lesson into 30-second blocks.
Asset Generation: Parallel generation of TTS Audio (Gemini 2.5) and Background Images (Gemini 3 Image).
FFmpeg Rendering: Uses complex filter graphs (drawtext, overlay) to burn text onto the video at specific timestamps.

Socratic Examination

To ensure mastery, students must pass an oral exam where the AI challenges their logic before unlocking the next chapter.

How we built it

Seeker is built on a modern full-stack architecture designed for high-performance AI media processing:

Frontend: React (Vite) and TypeScript handle the real-time UI. We used WebSockets for low-latency communication and AudioContext for high-fidelity PCM audio streaming.
Backend: A NestJS server acts as the "Media Production Studio," orchestrating heavy FFmpeg tasks and AI generation requests.
The AI Engine:
- Gemini 2.5 Flash (Native Audio): Powers the Live Classroom for <300ms latency.
- Gemini 3 Pro (Reasoning): Acts as the "Architect" for curriculum design and "Showrunner" for video scripts.
- Gemini 3 Image: Generates high-fidelity diagrams and comic panels.
- Google Veo 3.1: Creates the cinematic video content with temporal consistency.
Infrastructure: Supabase manages the real-time database, media storage, and student XP/badge persistence.

Gemini 3 Model Usage Map

Feature	Gemini 3 Model	Specific Use Case	Why Gemini 3?	Code Reference
Curriculum Architect	Gemini 3 Pro (`gemini-3-pro-preview`)	Analyzes PDF textbooks and restructures them into pedagogical learning trees (Courses → Modules → Lessons)	High-reasoning capabilities needed to understand academic prerequisites and logical dependencies between concepts	`CoursesPage.tsx` (lines 74-115) Uses Gemini 3 Pro with `generateContent` to process uploaded PDFs and extract structured course data
Comic Book Director	Gemini 3 Pro (`gemini-3-pro-preview`)	Creates visual identity guides and 5-page comic storyboards with consistent art direction	Advanced reasoning to maintain narrative coherence and visual continuity across multiple panels	`slides.service.ts` (lines 353-391) Generates JSON manifest with `thematic_era`, `style_guide`, and `visual_anchors` for consistency
Comic Panel Artist	Gemini 3 Pro Image (`gemini-3-pro-image-preview`)	Renders individual comic panels following strict visual guidelines from the Director agent	Superior image fidelity and style adherence compared to previous generations	`slides.service.ts` (lines 395-419) Generates sequential panels with character consistency using `responseModalities: ["IMAGE"]`
Cinematic Script Writer	Gemini 3 Pro (`gemini-3-pro-preview`)	Generates 4-scene video narratives with detailed camera directions and character actions	Complex creative reasoning to balance educational content with cinematic storytelling	`video.service.ts` (lines 113-137) Outputs JSON with scene-by-scene breakdowns including action prompts and dialogue
Character DNA Generator	Gemini 3 Pro Image (`gemini-3-pro-image-preview`)	Creates 2x2 reference grids showing characters from 4 angles (front, side, 3/4, back)	High-detail consistency needed for multi-view character sheets that serve as reference for Veo 3.1	`video.service.ts` (lines 86-110) Generates professional character reference sheets with multi-angle views
Scene Thumbnail Creator	Gemini 3 Pro Image (`gemini-3-pro-image-preview`)	Produces anchor frames for each video scene to guide Veo composition	Precise control over composition, lighting, and framing for cinematic quality	`video.service.ts` (lines 155-173) Creates key frames for video generation using scene action prompts
Educational Diagrams	Gemini 3 Pro Image (`gemini-3-pro-image-preview`)	Real-time diagram generation during live lessons (isolated cutouts on white backgrounds)	Fast generation with professional illustration quality for classroom clarity	`LessonPage.tsx` (lines 43-52) Function declaration for `generate_educational_diagram` tool with Gemini 3 Pro Image
Course Banner Artist	Gemini 3 Pro Image (`gemini-3-pro-image-preview`)	Creates high-fidelity academic cover images for course cards	Premium visual quality for professional course presentation	`NotesPage.tsx` (lines 196-210) Generates wide-format academic banners with `IMAGE_MODEL_NAME` constant
Notes Polishing Agent	Gemini 3 Pro (`gemini-3-pro-preview`)	Converts raw lesson transcripts into structured Markdown study notes with academic formatting	Advanced text structuring and academic writing capabilities	`NotesPage.tsx` (lines 216-228) Transforms unstructured transcripts using `TEXT_MODEL_NAME` with expert academic prompt
Slide Content Architect	Gemini 3 Pro (`gemini-3-pro-preview`)	Breaks lessons into 30-second presentation blocks with narration, bullets, and image prompts	Complex content decomposition and pedagogical timing	`slides.service.ts` (lines 70-80) Generates slide manifest with timing specifications using `gemini-3-pro-preview`
Podcast Script Generator	Gemini 3 Flash (`gemini-2.0-flash`)	Creates engaging 2-person dialogue scripts between "Alex" and "Sam" personalities	Natural dialogue generation optimized for multi-speaker TTS	`slides.service.ts` (lines 241-267) Outputs conversational podcast format with character-based dialogue

Gemini 3 Innovation Highlights

🎯 The Director-Artist Pattern

We implemented a novel two-stage pipeline where Gemini 3 Pro acts as the Creative Director (defining style, continuity rules, and narrative structure) while Gemini 3 Pro Image acts as the Artist (executing the vision with pixel-perfect fidelity). This separation of concerns ensures:

Visual Consistency: The Director enforces strict style guides across all generated panels
Narrative Coherence: Story arcs maintain logical flow even with procedurally generated content
Production Scalability: The same Director manifest can be re-rendered with different art styles

🎬 Character DNA Technology

The 2x2 Character Reference Grid (generated by Gemini 3 Pro Image) solved the critical "shapeshifting protagonist" problem in AI video generation:

// Gemini 3 generates a multi-angle reference sheet
const charGridPrompt = `Professional 2x2 character reference:
- Top-left: Front view
- Top-right: Side profile
- Bottom-left: 3/4 view  
- Bottom-right: Back view
Art style: ${visualId.art_style}
Character: ${visualId.protagonist_description}`;

This grid is then passed to Veo 3.1 as a referenceImage for every scene, ensuring the character looks identical across all 8-second clips.

🧠 High-Reasoning Curriculum Design

Using Gemini 3 Pro's thinkingLevel: ThinkingLevel.HIGH configuration, the Curriculum Architect can:

Identify prerequisite relationships between topics
Create optimal learning sequences based on cognitive load theory
Generate age-appropriate difficulty progression

Challenges we ran into

1. Narrative Discontinuity (The "Shapeshifting" Protagonist)

Initially, generating 8 independent clips led to "Visual Hallucination." A character might look like a 3D general in Scene 1 and a watercolor robot in Scene 2.

The Fix: We implemented the "2x2 Character DNA Grid." Before filming, the AI generates a reference sheet (Front, Side, 3/4, and Back views). This grid is passed into every Veo request as a referenceImage, forcing visual continuity.

2. Audio Mismatch & Header Corruption

AI-generated audio (raw PCM) often lacks headers, causing FFmpeg to "screech" or fail when stitching media.

The Fix: We built an intermediate conversion step using -f s16le and -ar 24000 to define the bitrate and sample rate explicitly before merging. Furthermore, we moved to Veo 3.1 Native Audio Synthesis, using specific syntax ("Quotes" for dialogue and (Parentheses) for SFX) to generate synced audio inside the video file itself.

3. The SDK Schema Trap

We faced 400: INVALID_ARGUMENT errors because Veo 3.1 and Gemini Pro expect different data structures.

The Fix: We discovered that while Gemini Pro uses inlineData, the Video engine requires a flattened object with the key bytesBase64Encoded.

// ✅ Veo 3.1 Correct Schema
image: { 
  bytesBase64Encoded: charGridB64, 
  mimeType: "image/png" 
}

4. The "Media Pipe" Error

Relying on ffmpeg-static proved insufficient as it lacked the drawtext filter.

The Fix: We implemented a system-level check (getSystemFont) to locate the full FFmpeg installation and hard-coded font paths (e.g., /usr/share/fonts/...) to ensure text overlays would render correctly on the classroom slides.

Accomplishments that we're proud of

Cohesive Cinematic Pipeline: Successfully using Video Extension Chains so that Scene 2 physically "extends" the file from Scene 1, keeping backgrounds and camera positions identical.
The Showrunner AI: Moving from a simple "generator" to an AI that defines a "Visual Signature" (like 16mm grain) and maintains it throughout a production.
Multimodal Integration: Creating a system where a single PDF can be transformed into audio, video, text, and interactive quizzes without losing the "thread" of knowledge.
Temporal Consistency: Achieving character-locked video content that makes AI-generated media feel like a professional educational film.

What we learned

Schema Nuance: We learned that even within the same SDK, different models (Pro vs. Veo) have radically different data formatting requirements.
System-Level Dependencies: Building AI media tools requires deep knowledge of system tools like FFmpeg—static libraries aren't always enough for complex filters.
Thought Signature Circulation: To keep the AI from "losing its train of thought" across multiple media generation steps, we learned to save and recirculate the model's internal thought signatures in a history array.

What's next for Seeker

Linear Skill Trees: We are currently developing a visual RPG-style Skill Tree for courses, allowing students to see their progress as a literal "map" of knowledge.
Collaborative Academies: Allowing students to invite friends into a "Group Blackboard" session where the AI tutor moderates a debate.
Veo 4K Extension: Once API constraints expand, we plan to upscale our cinematic explainer videos to 4K resolution while maintaining current continuity features.
Offline Mode: Edge-based AI processing to allow "Academy" access in low-connectivity environments.

Built With

ffmpeg
gemini3
nestjs
react
veo3.1
vite

Inspiration

Technical Architecture

Client-Server Interaction Flow

What it does

Live Multimodal Classroom

Curriculum Architect

Media Pipeline

1. The Comic Book Factory

2. The Podcast Creator

3. The Veo Cinematic Engine

4. The Slide Generator

Socratic Examination

How we built it

Gemini 3 Model Usage Map

Gemini 3 Innovation Highlights

🎯 The Director-Artist Pattern

🎬 Character DNA Technology

🧠 High-Reasoning Curriculum Design

Challenges we ran into

1. Narrative Discontinuity (The "Shapeshifting" Protagonist)

2. Audio Mismatch & Header Corruption

3. The SDK Schema Trap

4. The "Media Pipe" Error

Accomplishments that we're proud of

What we learned

What's next for Seeker

Built With

Updates