Project Cognito: The Flash-Powered AI Tutor

Inspiration

The inspiration for Cognito came from a simple but powerful question: What if I could create a comprehensive, high-quality course just by typing what I want to learn?

Traditional online learning is often a passive experience watching endless videos without interaction. I wanted to build something that felt alive, a platform where the content isn't just displayed, but explained by an AI tutor who knows exactly when to pause, when to ask a question, and how to guide you through a syllabus tailored specifically for you.

What it does

Cognito is an AI-first learning platform that transforms static sources into interactive, structured lessons.

Topic Mode: Instantly generates a structured syllabus from a simple text prompt. YouTube Mode: Takes any video URL and crafts a guided experience where the AI tutor, Ajibade, pauses the video at contextually relevant moments to explain complex concepts or quiz the learner. PDF Mode: Processes documents into digestible learning units, extracting deep insights and ensuring the material is understood before moving forward. Real-time Interaction: Using a sophisticated WebSocket bridge and Google Cloud Text-to-Speech, the AI "speaks" to you in real-time, creating a classroom-like atmosphere from your browser.

How we built it (The Gemini 3 Difference)

We built Cognito using a modern, high-performance stack designed for low-latency AI interactions, with Google Gemini 3 Flash as the core intelligence engine.

🧠 Leveraging Gemini 3 Flash We explicitly chose gemini-3-flash-preview for it's specific capabilities:

Multimodal Video Analysis: In YouTube Mode, we don't just transcribe text. We use Gemini's native video understanding. By passing FileData (video URI) and VideoMetadata (start/end offsets) directly to the model, Gemini "watches" 1 minute segments of the video. It analyzes visual cues, on-screen text, and speaker emotion to determine naturally pedagogical pause points.

Technical Detail: We reduced token usage and latency by slicing videos into granular contexts rather than feeding the entire file, allowing for parallel processing of lesson units.

Structured Output (JSON Schema): We enforce strict JSON schemas (responseSchema) in our API calls. This ensures Gemini returns a deterministic structure containing:

pauseAtSeconds (Integer): Exact timestamps for video interruption. textToSpeak (String): A conversational script for the TTS engine. quizzesJson (Array): Multiple-choice questions for user verification. By enforcing this schema, we eliminated parsing errors and created a robust "State Machine" where the frontend reacts predictably to the AI's output. Reasoning with Thinking Config: We utilized the ThinkingConfig (set to ThinkingLevel.Known.HIGH) for complex pedagogical decisions. This allows the model to internally "reason" about why a student might struggle with a concept before generating the explanation, resulting in much higher quality, empathetic tutoring compared to standard LLMs.

🏗️ Backend Architecture Framework: Spring Boot 3.4 creates a robust, multi-threaded environment. Async Processing: We use Spring's @Async executors to handle AI generation in the background while the main thread serves user traffic. Data Layer: PostgreSQL handles persistence, while Redis acts as a high-speed priority queue. Lesson steps are pushed to Redis as they are generated, decoupling the AI speed from the user consumption speed. Real-Time Streaming: A custom WebSocket handler (LessonSessionWsHandler) streams binary audio chunks (from Google Cloud TTS) and JSON instruction sets directly to the React frontend, ensuring <200ms perceived latency.

Challenges we ran into

Building a "teacher" persona like Ajibade was our greatest hurdle. We faced several deep technical challenges:

Agentic Behavior: Designing how the AI would structure a class and decide when to intervene required sophisticated prompt engineering. Gemini 3 Flash shone here its ability to understand complex systems instructions allowed us to build a reliable persona that transitions seamlessly between "Explaining" and "Testing." The Synchronicity Paradox: Keeping the Text-to-Speech (TTS) engine perfectly in sync with the Gemini output and the video player required a custom "choreography" logic on the frontend. We tried using simple delays, but they were unreliable. We solved this by creating a WebSocket Event Loop that signals the frontend exactly when the audio stream ends, triggering the quiz UI only then. Live API vs. Control: We initially considered using the Gemini Multimodal Live API. However, we discovered it will be beyond our skills to maintain the lesson session and curriculum. Instead, we chose Gemini 3 Flash with structured output. This gave us the best of both worlds: the "live" feel of a real-time tutor, but with the "reliability" of a structured curriculum. Accomplishments that we're proud of Flash-Speed Generation: Leveraging Gemini 3 Flash, we reduced syllabus generation time from minutes to seconds, allowing users to start learning almost immediately. Seamless Async Delivery: We successfully implemented a "Netflix-like" buffering system where the AI starts teaching the first unit of a course while it’s still generating the rest of the syllabus in the background. Dynamic YouTube Interruption: Creating a system that can accurately pause a video at precise timestamps to inject AI explanations felt like moving from "video player" to "digital classroom." Deployment Resilience: Successfully scaling the platform from a local environment to a GCP VM, overcoming complex database issues and cloud-native authentication mismatches.

What we learned

This was our first time building something of this scale. We learned that Gemini 3 Flash is a game-changer for agentic workflows where speed is synonymous with user experience. We also gained deep experience in modern system architecture from managing WebSocket heartbeats to handling binary LOB streams in PostgreSQL. Most importantly, we learned that building for education requires a unique balance of AI creativity (the "Spark") and strict logical flow (the "Structure").

What's next for Cognito

The journey for Cognito has just begun. Based on early feedback from students, we are already planning our next leap into Generative Visual Education.

Beyond the Whiteboard: While the current implementation allows Ajibade to use HTML5 Canvas to draw diagrams (for example, sketching the complex structure of a human pelvic bone during a medical lesson), we want to move toward real-time Image Generation. Generative Diagrams: We plan to integrate Imagen 4 fast to allow the AI to generate high-fidelity, real-time visual aids on-the-fly. Instead of just "drawing" code, the AI will be able to show as well as tell, creating medical-grade anatomical renders or engineering blueprints as the lecture progresses. Immersive Sandbox: We envision an "Infinite Canvas" where the tutor and student can interact with a 3D shared space, moving from 2D video into a fully interactive spatial learning environment. Voice Pool: We are exploring Google TTS integration to allow students to "clone" the voice of their favorite Agent, making the learning experience feel even more personal and familiar.

Built With

Share this project:

Updates