Inspiration

  • We are overloaded with video content (lectures, documentaries, TED Talks)
  • Finding specific information across large video libraries is slow and inefficient
  • Current video-sharing platforms don’t support deep search or citation-like navigation

What it does

  • Built-in video chatbot interface for interactive exploration
  • Automatically generated timeline breakdown
  • Natural language query where timestamps of the videos are shown that match the user’s query

How I built it

  • Implemented RAG pipelines for video and audio
  • Processed, chunked, and embedded transcript and video clips, identifying important scene changes
  • Leveraged GPT 4 for generating timestamped section breakdowns and handling user queries, with vector space as context

Challenges I ran into

  • Frontend deployment with Vercel
  • Rate limits when extracting YouTube transcripts via third-party APIs

What's next for Multimodal Video Analysis Tool

  • Grouping queries for YouTube playlists
  • Exploring open source Embedding models and LLMs

Built With

  • chromadb
  • clip
  • gpt
  • langchain
  • react-native
  • text-embedding-3-small
  • vercel
  • yt-dlp
Share this project:

Updates