๐Ÿ’ก Inspiration

Earlier this year, I tried to build something similarvideolearnai.com โ€” a web-based version of NanoTutor. But it faced serious challenges: proxy requirements for downloading YouTube transcripts, LLM API credit management, and a poor user experience that forced people to leave YouTube entirely.

Iโ€™ve always wanted a tool that could turn passive video watching into active learning โ€” making YouTube videos interactive, engaging, and better for knowledge retention.

When Gemini Nano became available locally in Chrome, that dream finally became possible: no servers, no API costs, and no data leaving the browser.

โš™๏ธ What It Does

NanoTutor is a Chrome extension that transforms any YouTube video into an interactive AI tutor, powered entirely by Gemini Nano, Chromeโ€™s built-in AI model.

It allows users to:

  • ๐Ÿ’ฌ Chat with videos โ€” Ask questions about any part of the content.
  • ๐Ÿ“ Generate contextual quizzes โ€” (Prototype feature) Currently supports True or False quizzes to test comprehension.
  • ๐Ÿ” Smart transcript search โ€” Uses RAG (retrieval-augmented generation) to locate relevant info in long videos.
  • ๐Ÿ”’ Full privacy โ€” Everything runs locally. No data ever leaves your browser.

For long videos, NanoTutor uses a local RAG pipeline with WebGPU-powered embeddings (Transformers.js v3) to fetch only relevant transcript segments per query โ€” ensuring fast, context-aware answers.

โš ๏ธ Prototype Notice: The quiz system is still in an early prototype stage โ€” limited to True or False questions. But it lays the groundwork for richer quiz formats like multiple choice, fill-in-the-blank, and spaced repetition.


๐Ÿ› ๏ธ How I Built It

Tech Stack

  • Plasmo (Parcel) โ€” Chrome extension framework
  • React โ€” UI and state management
  • Gemini Nano โ€” On-device AI inference via Chromeโ€™s Prompt API
  • Transformers.js v3 โ€” WebGPU-accelerated embeddings for RAG
  • DOM manipulation โ€” Direct transcript extraction from YouTubeโ€™s native UI

Architecture

The extension interacts directly with YouTubeโ€™s DOM, automatically clicking the transcript button and parsing its text โ€” no APIs or proxies required. The transcript is then chunked intelligently for RAG processing, and Gemini Nano responses are streamed directly into the UI for a smooth experience.

The quiz generator uses the same transcript context to create simple, focused True/False questions that reinforce learning โ€” with plans to evolve into richer formats soon.


๐Ÿšง Challenges I Ran Into

1. Getting Transformers.js v3 to work with Plasmo (Parcel) The library relied on Node.js APIs unavailable in the browser. I had to write scripts to build configs and tweak module resolution to make it fully compatible inside a Chrome extension.

2. Implementing a streaming UI I wanted the real-time token streaming effect like in Vercelโ€™s AI SDK. Iโ€™ve achieved object-level streaming (partial responses streaming as theyโ€™re completed), but true token-level streaming is still in progress. Even so, itโ€™s already a huge improvement over waiting for full responses before rendering!


๐Ÿ† Accomplishments Iโ€™m Proud Of

โœ… Fully local AI โ€” Gemini Nano, RAG, and embeddings all run entirely on-device. โœ… Built in 12 days โ€” Intense but highly productive sprint. โœ… Live quiz generation โ€” Questions appear progressively, improving UX. โœ… Fast embeddings(using webgpu) โ€” Smooth performance even on longer videos.


๐Ÿ“š What I Learned

I learned how to design and optimize fully local AI workflows โ€” from on-device inference to RAG and streaming inside the browser. Also gained deeper insights into Transformers.js v3 and React streaming UIs for dynamic AI interactions.


๐Ÿš€ Whatโ€™s Next for NanoTutor

The current RAG implementation works but is still naive, and the quiz system is just a prototype โ€” thereโ€™s lots of room to grow.

Near-Term Goals:

  • ๐Ÿ“Š Better quizzes โ€” Add multiple choice, fill-in-the-blank, and spaced repetition. Improve UX and visual design.
  • ๐Ÿง  Smarter RAG โ€” Experiment with Graph RAG and Agentic RAG for deeper understanding.
  • โšก Token-level streaming โ€” True real-time text generation for smoother UX.

Future Ideas:

  • ๐Ÿ–ผ๏ธ Visual context โ€” Use video preview frames for visual understanding.
  • ๐ŸŽ™๏ธ Voice interaction โ€” Hands-free AI tutoring via speech-to-text.
  • โฑ๏ธ Automatic chapters โ€” AI-generated timestamps and navigation.

Hybrid AI Mode:

  • ๐Ÿ”„ Client + Cloud options โ€” Choose between fully local AI or cloud models when needed.
  • ๐Ÿ† Social learning โ€” Optional online features like leaderboards and quiz challenges.

Built With

Share this project:

Updates