Pandio — YouTube-Native AI Tutor for Vietnamese Learners


What it does

There are dozens of YouTube summarizer tools — NoteGPT, Eightify, YouTube Summary with ChatGPT. They all do the same thing: fetch a transcript, run it through an LLM, and hand you a wall of translated text. You read it, maybe skim the bullet points, and move on. That's passive consumption, and research consistently shows it barely works for actual learning.

Pandio is fundamentally different. It doesn't summarize — it teaches.

When a Vietnamese student pastes a YouTube URL into Pandio, they don't get a summary. They get a live, interactive tutoring session with "Thầy Gấu Trúc" (Teacher Panda) — an AI tutor who has deeply analyzed the video content, prepared a structured lesson plan, and is ready to walk them through it step by step, entirely in Vietnamese.

Here's what that looks like:

  1. Paste a YouTube URL — Any English-language educational video: a lecture on Newton's Laws, a CS tutorial, a biology explainer, a math proof — anything.
  2. AI prepares a real lesson — Pandio doesn't just translate. It fetches the transcript, translates it to Vietnamese with a consistent technical glossary, analyzes the content for learning objectives and difficulty level, and generates a structured multi-segment lesson plan — the same way a real teacher would prepare before class.
  3. Interactive tutoring session — A whiteboard-based session launches where Thầy Gấu Trúc teaches the content using slides with rich visuals: markdown explanations, LaTeX math formulas, flowcharts and diagrams, and styled callout boxes — all in Vietnamese. Students interact by typing chat messages (voice input coming soon), and the tutor adapts in real time.
  4. Socratic teaching, not lecturing — Thầy Gấu Trúc never just gives the answer. He asks "Theo em, tại sao...?" (In your opinion, why...?), gives narrowing hints, offers binary choices, and only reveals the answer after guiding the student to think through it. If a student goes quiet, the tutor automatically simplifies — switching to examples or multiple-choice to re-engage them.
  5. Vietnamese voice narration — Thầy Gấu Trúc doesn't just type — he speaks. Using ElevenLabs with a native Vietnamese voice, explanations are narrated aloud, making the experience feel like sitting with a real tutor. This matters especially for younger students and auditory learners who struggle with text-heavy interfaces.
  6. Study toolkit — After the session, students can generate quizzes (multiple-choice + short-answer) to test their understanding, flashcards for spaced review, and visual mindmaps to see how concepts connect. Each tool is grounded in the actual content taught — not generic questions.
  7. Progress tracking — Streaks, per-concept mastery levels, study time tracking, and subject breakdowns keep students motivated and coming back.

The difference in one sentence: NoteGPT gives you a translated summary you'll forget in an hour. Pandio gives you a patient Vietnamese tutor who makes sure you actually understand.

Why it matters: 17 million Vietnamese students consume English-language YouTube content for learning. Most don't fully understand the material due to language barriers. Private tutoring costs $15–30/hour. Pandio gives every Vietnamese student access to a patient, knowledgeable AI tutor — for free.


How we built it

Architecture: Turborepo monorepo with 4 apps and 11 shared packages, all TypeScript.

Frontend — Next.js 16 (App Router, React 19) with Tailwind CSS v4 and 57+ shadcn/ui components. The whiteboard is a custom-built frame-based board system with:

  • Rich markdown rendering with KaTeX math, GFM tables, and emoji-detected callout boxes (formulas, tips, warnings, examples)
  • Mermaid-powered flowcharts, mindmaps, and process diagrams (lazy-loaded for performance)
  • ELK-based automatic graph layout with SVG rendering for complex concept maps
  • Keyboard navigation, slide animations, and touch pinch-zoom support

Backend — Fastify v5 with a plugin-based architecture: Clerk JWT auth, auto user sync with monthly quota reset, rate limiting, OpenAPI docs, and WebSocket support for real-time tutoring.

AI Pipeline — the core of Pandio:

This is where OpenRouter became essential. Our pipeline requires multiple AI models working together, each chosen for what it does best. OpenRouter let us route to the right model for each stage through a single API — no managing multiple API keys or SDKs, and the flexibility to swap models as better ones emerge without touching our codebase.

The 6-stage pipeline:

  1. Extract — Fetch transcript + metadata from YouTube (title, channel, duration)
  2. Chunk — Token-based splitting (~1,500 tokens/chunk, respects sentence boundaries so no concept gets cut in half)
  3. Translate — Sequential Vietnamese translation using Gemini 2.5 Flash (via OpenRouter). We chose Gemini Flash here because translation is a high-throughput task — a 30-minute video produces 10–15 chunks, and Flash handles them fast and cheaply while maintaining quality. Each chunk also produces glossary entries that feed into the next chunk, ensuring "gradient descent" is consistently translated as "hạ gradient" throughout the entire session.
  4. Analyze — Deep content analysis using Claude Sonnet 4 (via OpenRouter). This is where we need strong reasoning — the model extracts learning objectives, assesses difficulty level, identifies prerequisites, and maps out key concepts. Claude Sonnet 4 consistently produces the most pedagogically sound analysis.
  5. Plan — Structured lesson plan generation using Claude Sonnet 4 (via OpenRouter). The model generates 3–8 teaching segments (based on video length), each with its own learning objectives, teaching strategy, Bloom's taxonomy level, visual approach, analogies, common mistakes to address, and comprehension check questions. This mirrors how experienced Vietnamese teachers prepare their bài giảng (lesson plans).
  6. Enrich & Cache — Link lesson plan to video, cache everything. If another student watches the same video, the analysis is instant.

Real-time tutoring engine — WebSocket-based, also powered by Claude Sonnet 4 via OpenRouter for the conversational loop. We specifically chose Claude Sonnet 4 for tutoring because it excels at maintaining a consistent persona (Thầy Gấu Trúc's patient, encouraging teaching style) while doing complex pedagogical reasoning (when to hint vs. reveal, when to advance vs. review). The board rendering step — converting the tutor's visual intent into structured whiteboard JSON — uses Gemini 2.5 Flash via OpenRouter, since it's a structured transformation task where speed matters more than depth.

The teaching loop follows a segment state machine: each segment moves through PENDING → ACTIVE → CHECKING → COMPLETED. The tutor streams responses in real time, detects when students go silent (and auto-simplifies), and handles reconnections by restoring the full session state.

Vietnamese voice — ElevenLabs with the eleven_flash_v2 model and a native Vietnamese voice. The tutor's spoken explanations are generated per response and queued for playback. We chose ElevenLabs specifically because their Vietnamese voice quality is noticeably more natural than alternatives — the intonation and dấu (tonal marks) pronunciation actually sound like a Vietnamese teacher, not a robotic TTS. This is critical for our audience: if the voice sounds unnatural, students disengage immediately.

Database — Drizzle ORM with 10 tables (SQLite for dev, PostgreSQL for production), covering users, videos, lesson plans, sessions, messages, segment states, quizzes, progress, payments, and a full AI call trace log for debugging.

Payments — Stripe integration with checkout sessions and subscription management. Freemium model with a generous free tier for students.

Key tech stack: Next.js 16 · Fastify v5 · TypeScript 5.9 · Drizzle ORM · OpenRouter (Claude Sonnet 4 + Gemini 2.5 Flash) · ElevenLabs TTS · Vercel AI SDK · Clerk Auth · Stripe · React Query · Zustand · Zod · Sentry · Vitest


Challenges we ran into

Transcript translation consistency — Translating technical English content to Vietnamese while maintaining consistent terminology across chunks was surprisingly hard. A naive approach translates "gradient descent" differently in every chunk. We solved this with a glossary accumulation pattern: each chunk's translation produces new glossary entries that feed into the next chunk's context, building a consistent vocabulary that grows as the translation progresses. OpenRouter made iterating on this painless — we tested multiple models for translation quality and settled on Gemini 2.5 Flash for the best speed-to-quality ratio on Vietnamese.

Two-stage board rendering — Getting an LLM to simultaneously teach AND produce structured whiteboard JSON degraded both outputs. We split it into two stages: the tutor LLM focuses purely on pedagogy and outputs a text description of what should be visualized, then a separate, faster model call converts that description into precise board sections (markdown blocks, diagram nodes/edges, connection overlays). This separation improved both teaching quality and visual consistency.

Socratic teaching that doesn't frustrate — Balancing "never give the answer directly" with "don't let the student get stuck for too long" required careful prompt engineering. We implemented a 5-level scaffolding escalation: open question → narrowing hint → binary choice → multiple choice → guided self-explanation. Combined with silent round detection that automatically simplifies when students go quiet, the tutor adapts to each student's pace without becoming annoying.

Session resilience — Students lose connection on mobile networks, switch tabs, or close laptops. We built a full session restore system: on reconnect, the server replays the complete conversation history, board state, segment progress, and current position — the student picks up exactly where they left off, no repeated content.

Vietnamese math rendering — LLMs produce LaTeX in wildly varied formats. Mixing Vietnamese text with inline math formulas ($F = ma$ inside a Vietnamese sentence) required a normalization layer that handles every LaTeX variant we've encountered and converts them into a consistent format for rendering.


Accomplishments that we're proud of

  • Not another summarizer — We built a genuine tutoring experience. The AI doesn't dump information — it teaches through conversation, adapts to the student's level, and verifies understanding before moving on. This is the difference between reading a textbook and having a teacher.

  • Custom educational whiteboard — A frame-based board purpose-built for teaching: math formulas, flowcharts, diagrams, styled callout boxes, and slide-by-slide progression. Not a generic canvas — every rendering choice was made to support how Vietnamese students actually learn.

  • Structured lesson planning — Each session has a real lesson plan with segments, learning objectives, teaching strategies, and Bloom's taxonomy levels. This mirrors how experienced teachers prepare — and it means every session follows a pedagogically sound progression, not random AI rambling.

  • Production-grade architecture — 10-table database schema, 12 backend services, comprehensive error handling with Vietnamese error messages, rate limiting, observability, API documentation, 27 automated tests, payment integration, and a design system with 57+ components. This isn't a hackathon prototype — it's a product ready for real users.

  • Solo indie hacker — One person. Every line of code, every prompt, every UI component, every architectural decision. From zero to a fully functional AI tutoring platform.


What we learned

  • Structured AI output is everything — Forcing LLMs to output structured JSON via Zod schemas rather than freeform text dramatically improved consistency. Every tutor response follows a strict schema: the teaching message, a description of what to visualize, whether to speak, and how to adjust difficulty. This made the entire downstream pipeline (board rendering, audio, UI) predictable and debuggable.

  • Glossary accumulation is essential for translation — Naive chunk-by-chunk translation produces inconsistent terminology. Passing an evolving glossary between chunks — where each chunk contributes new terms — maintains coherence across long transcripts. This is a pattern we haven't seen documented elsewhere.

  • The right model for each job — Not every task needs the most powerful model. Translation and structured JSON transformation benefit from speed (Gemini 2.5 Flash). Pedagogical reasoning, content analysis, and Socratic tutoring need depth (Claude Sonnet 4). OpenRouter's model routing let us optimize each pipeline stage independently — a single config change to swap models, test, and compare.

  • Vietnamese EdTech is massively underserved — Throughout development, we couldn't find a single product that does what Pandio does: take English YouTube content and turn it into an interactive Vietnamese tutoring session. YouTube summarizers exist. Translation tools exist. But an AI that actually teaches the content in Vietnamese, with visual aids and adaptive difficulty? That gap is wide open.

  • Voice transforms the experience — Adding ElevenLabs Vietnamese TTS changed how testers interacted with the product. Reading text on screen feels like a tool. Hearing a patient Vietnamese teacher explain a concept feels like... having a teacher. Students stayed in sessions longer and engaged more with the Socratic questions when voice was enabled.


What's next for Pandio

  • Beyond YouTube — Right now Pandio works with single YouTube videos. Next: support for entire playlists (the AI builds a connected curriculum across videos), uploaded documents (PDFs, Word files, lecture slides), and even live lecture recordings. Any learning material, any format — Pandio turns it into an interactive tutoring session.
  • Voice input — Students will be able to speak their answers instead of typing, making the tutoring conversation feel truly natural — especially for younger learners who type slowly.
  • Mobile app — Learning on the bus, in bed, between classes. Offline session caching so students in areas with spotty internet can keep studying.
  • Classroom mode — Teachers paste a YouTube link, Pandio generates a lesson plan they can review and customize, then assign to an entire class with individual progress tracking per student.
  • Spaced repetition — Automatically resurface concepts from past sessions based on forgetting curves. If you scored low on "F = ma" two weeks ago, Pandio brings it back before you forget.
  • Multi-language expansion — Same platform, new languages. Thai, Indonesian, and Filipino students face the exact same problem: learning from English YouTube content without fully understanding it. The architecture is already built for this.
  • Community lesson library — Let teachers and students share their analyzed videos and lesson plans, building a crowdsourced curriculum of high-quality, localized educational content.
  • Drawing recognition — Students sketch a diagram on the whiteboard, and the AI tutor sees it, understands it, and responds. "Em vẽ đúng rồi, nhưng hãy thêm lực ma sát vào!" (You drew it right, but add the friction force!)

Built With

  • elevenlabs
  • fastify
  • nextjs
  • openai
  • openrouter
Share this project:

Updates