Inspiration
What it does## Inspiration
In an era of information overload, YouTube has become a vast library of knowledge. However, finding specific information within a video can be tedious and time-consuming. Users often scrub through hours of content to find a single answer. We were inspired to solve this problem by creating a "conversational search engine" for any YouTube video. We envisioned a tool where users could simply ask questions in natural language and get immediate, accurate answers with citations, transforming passive video consumption into an interactive learning experience.
What it does
Chatpye is an intelligent AI assistant that allows you to have a conversation with any YouTube video. Simply paste a YouTube URL, and the application instantly prepares the video for analysis. You can then ask specific questions—from "What were the key arguments made in the first half?" to "Summarize the main conclusion"—and receive precise, real-time answers. The system is built with a sophisticated two-pronged approach: RAG (Retrieval-Augmented Generation): For videos where a transcript is available, it's processed, chunked, and stored as vector embeddings in a MongoDB database. When a user asks a question, the system retrieves the most relevant transcript segments to provide a highly accurate, citation-backed answer. Direct Multimodal Analysis: If a transcript is unavailable, the system seamlessly falls back to using the Gemini 2.5 Pro model's native video understanding capabilities. It directly analyzes the video's audio and visual content to answer the user's question, ensuring no video is left behind.
How we built it
Chatpye is a full-stack application built with a modern tech stack: Frontend: Next.js and React with TypeScript, using Tailwind CSS for styling. Backend: Next.js API routes running on a Node.js server. AI & Machine Learning: Google Gemini 2.5 Pro: The core LLM for both the RAG pipeline and direct multimodal video analysis. youtube-transcript: For fetching and parsing video transcripts. Database: MongoDB Atlas is used as a vector store for the transcript embeddings, enabling efficient similarity searches for the RAG system. Infrastructure: The application is containerized with Docker and designed for scalable deployment on cloud platforms. The architecture is centered around a robust API that handles video processing, job status tracking, and the core conversational chat logic. We built a sophisticated backend service that correctly identifies the video context for every user query, ensuring the AI's responses are always grounded in the right source material.
Challenges we ran into
Our biggest challenge was a critical and elusive bug we called "context mixing." In early versions, the AI would occasionally answer questions using knowledge from a previously viewed video, completely breaking the user's trust. Debugging this was a multi-day effort that took us through the entire stack: Ensuring Context Integrity: We first built a "single source of truth" API endpoint (/api/video/resolve-job) to guarantee that every chat session was locked to the correct video jobId. Fighting AI Hallucination: Even with the correct ID, we found the AI would sometimes "hallucinate" if it didn't have a transcript. Our initial text-based prompts were insufficient. Dependency Hell: When we discovered the official, robust method for multimodal prompting in the Gemini API documentation, we hit a wall. Our installed version of the @google/generative-ai library was too old and didn't support the feature. Updating it triggered a cascade of peer dependency conflicts with our testing framework (vitest), which we had to carefully resolve without breaking our test environment.
Accomplishments that we're proud of
Solving Context Mixing: Eradicating the context-mixing bug was a major victory. It forced us to build a much more resilient and architecturally sound application. We are incredibly proud of the robust, multi-layered system we engineered to ensure the AI's reliability. Dual-Pronged AI Strategy: Implementing both a RAG pipeline and a direct analysis fallback is a significant accomplishment. This hybrid approach makes our application incredibly versatile and resilient, providing the best possible analysis method for any given video. Mastering the Tech Stack: We didn't just use these technologies; we pushed them to their limits. From deep-diving into the Gemini API's type definitions to debugging Node.js dependency conflicts, we came away with a much deeper understanding of our entire stack.
What we learned
The Devil is in the Details (and Dependencies): An AI model is only as good as the data you give it. We learned that how you prompt the model is as important as what you prompt it with. Furthermore, keeping dependencies up-to-date isn't just about security; it's critical for accessing the latest and most powerful features. Architecture Over Quick Fixes: Our initial attempts to fix the context bug were localized patches. We learned that such critical issues often point to deeper architectural flaws. The real solution was to step back and re-architect the flow of data to create an unambiguous "source of truth." The Power of Multimodality: We learned firsthand how powerful modern multimodal models like Gemini 2.5 are. Moving from a text-only RAG system to one that can also directly see and hear the video content was a game-changer.
What's next for Chatpye
We're just getting started! Here are some of our exciting next steps: Chat History: Implement persistent chat history for each video, allowing users to pick up their conversations right where they left off. Support for More Sources: Expand beyond YouTube to support other video platforms (Vimeo, Dailymotion) and even direct file uploads. Advanced Caching: Implement explicit caching strategies for the Gemini API to reduce latency and costs for popular videos that are analyzed frequently. Proactive Summaries: Automatically generate a concise summary and a list of key topics as soon as a video is processed, giving users a starting point for their conversations.
Log in or sign up for Devpost to join the conversation.