A tool that provides the ability to search videos based on semantic content has obvious use cases, from finding videos online, to finding videos one has shot on their own cell phone or devices. We think that a video search tool that actually shows where in a clip a given search query can be found would be immensely useful for video editors. Imagine a documentary film editor whose job it is to sift through hundreds of hours of footage. If the editor can quickly find all of the exact moments in clips containing a "sunrise", wouldn't this save a lot of organizational time and effort, and potentially greatly accelerate their creative process?
What it does
This application allows content-based search of a video library using speech input. A web UI is provided for making queries and viewing the relevant videos.
How I built it
- TorchAudio package for audio-text transformation
- Text embedding extraction using Spacy for preprocessing and FastText for embedding
- Employs TorchHub for pre-trained computer-vision models and utilities
- Applies object detection model to individual video frames
- Maps audio queries to a set of predefined classes compatible with the object detector and returns indices of frames with that object
- Frontend for receiving and passing the output/input of each step above, and present the video clips with the queried object.
Challenges I ran into
- Audio-text model performance on real-world speech
- Suitability of object detection model for the task, including dynamic range of scales, domain transfer problems, limited vocabulary, etc.
Accomplishments that I'm proud of
- Finishing this application
What I learned
- Quickly getting a multi-module application working in a short time using pre-trained PyTorch models and provided utilities
What's next for Intelligent Video Search
- Preparing models for our specific application using retraining and fine-tuning techniques
- More efficient and robust inference engine which better captures our queries and video data