Inspiration

The idea came from a simple frustration: watching a lecture and hitting a concept you don't understand, then having to pause, open five tabs, and piece together an explanation while the video sits frozen. A good tutor would just answer the question in context. Quality tutoring has always been expensive or geography-dependent. Atlas is the attempt to fix that.

What it does

Atlas lets you paste any YouTube URL and immediately start a conversation about it. The video gets indexed by TwelveLabs using multimodal AI that reads the visual frames, audio, and transcription together. When you ask a question, Atlas searches that index for the most relevant clips, passes those timestamped segments to Claude as context, and streams back an explanation grounded in the exact moments of the video. Every response includes clickable timestamp chips that jump the video player to the source clip so you can verify the answer yourself. Atlas also generates a full study suite automatically: structured notes, flashcards exportable to Quizlet, practice problems downloadable as a formatted PDF, and a chapter breakdown of the video. Bold terms in any response are checked against Wikipedia and the video transcript, and only become clickable if a real definition is found.

How we built it

The backend is FastAPI with SQLite, using TwelveLabs v1.3 for video ingestion and semantic search and Anthropic Claude Sonnet 4.6 for all generated content streamed over SSE. The frontend is Next.js with a custom HLS video player, KaTeX for math rendering, and a typewriter effect on responses so text arrives character by character rather than in chunks. The split screen is draggable so students can resize the video and chat panels. Study materials are pre-generated in the background the moment a video finishes indexing, so by the time the student opens the flashcards or notes tab the content is already ready.

Challenges we ran into

Getting TwelveLabs search to reliably return the right clips took the most debugging. The v1.3 API uses a rank field instead of a score field, the video_ids filter was silently ignored server side, and search options had to be sent as multipart form data rather than JSON. We ended up fetching 50 results per query and filtering by video ID in Python. The other major challenge was keeping Claude grounded in the video rather than drifting into general knowledge answers, which required careful system prompt design and explicit instructions to signal when it is teaching beyond what the retrieved clips contain.

Accomplishments that we're proud of

The timestamp citation system works exactly the way we wanted: every answer is traceable back to a specific moment in the video, and clicking a chip seeks the player to that second instantly. The study suite generating in the background while the user is still on the loading screen means there is no wait when they open the flashcards or problems tab. The bold term definition feature silently pre-checks every bolded word against Wikipedia and the video's own transcript after a message finishes, and only makes a word interactive if a real source exists for it, so nothing ever opens a fabricated tooltip.

What we learned

Multimodal video search is genuinely powerful but needs careful post-processing. TwelveLabs retrieves relevant clips across visual, audio, and transcription simultaneously, but the raw results need filtering, deduplication, and ranking before they are useful as LLM context. We also learned that the system prompt matters more than the model for keeping answers grounded. Claude is capable of staying within the retrieved context when the instructions are specific, but the default behavior is to fill gaps with general knowledge, which looks helpful but erodes the core value of a video-grounded tutor.

What's next for Atlas

Direct video file upload so students can index recordings from institutional platforms that YouTube cannot reach. Multilingual support, since the students who benefit most from free tutoring are often studying in languages where free educational video content is thinner. A bias and retrieval quality audit across accents and subject domains to measure how often thin transcription causes wrong answers. And user accounts with private conversation history, because right now all sessions share the same database and that is not acceptable for real students.

Built With

  • anthropic-claude-sonnet-4.6
  • fastapi
  • hls.js
  • katex
  • next.js
  • python
  • react
  • react-markdown
  • sqlite
  • sqlmodel
  • sse-starlette
  • tailwind
  • twelvelabs-v1.3
  • typescript
  • yt-dlp
Share this project:

Updates