Inspiration
- We are overloaded with video content (lectures, documentaries, TED Talks)
- Finding specific information across large video libraries is slow and inefficient
- Current video-sharing platforms don’t support deep search or citation-like navigation
What it does
- Built-in video chatbot interface for interactive exploration
- Automatically generated timeline breakdown
- Natural language query where timestamps of the videos are shown that match the user’s query
How I built it
- Implemented RAG pipelines for video and audio
- Processed, chunked, and embedded transcript and video clips, identifying important scene changes
- Leveraged GPT 4 for generating timestamped section breakdowns and handling user queries, with vector space as context
Challenges I ran into
- Frontend deployment with Vercel
- Rate limits when extracting YouTube transcripts via third-party APIs
What's next for Multimodal Video Analysis Tool
- Grouping queries for YouTube playlists
- Exploring open source Embedding models and LLMs
Built With
- chromadb
- clip
- gpt
- langchain
- react-native
- text-embedding-3-small
- vercel
- yt-dlp
Log in or sign up for Devpost to join the conversation.