Monkey Search - Intelligent Video Search with Pytorch

Inspiration

A tool that provides the ability to search videos based on semantic content has obvious use cases, from finding videos online, to finding videos one has shot on their own cell phone or devices. We think that a video search tool that actually shows where in a clip a given search query can be found would be immensely useful for video editors. Imagine a documentary film editor whose job it is to sift through hundreds of hours of footage. If the editor can quickly find all of the exact moments in clips containing a "sunrise", wouldn't this save a lot of organizational time and effort, and potentially greatly accelerate their creative process?

What it does

This application allows content-based search of a video library using speech input. A web UI is provided for making queries and viewing the relevant videos.

How I built it

TorchAudio package for audio-text transformation
Text embedding extraction using Spacy for preprocessing and FastText for embedding
Employs TorchHub for pre-trained computer-vision models and utilities
Applies object detection model to individual video frames
Maps audio queries to a set of predefined classes compatible with the object detector and returns indices of frames with that object
Frontend for receiving and passing the output/input of each step above, and present the video clips with the queried object.

Challenges I ran into

Audio-text model performance on real-world speech
Suitability of object detection model for the task, including dynamic range of scales, domain transfer problems, limited vocabulary, etc.

Accomplishments that I'm proud of

Finishing this application

What I learned

Quickly getting a multi-module application working in a short time using pre-trained PyTorch models and provided utilities

What's next for Intelligent Video Search

Preparing models for our specific application using retraining and fine-tuning techniques
More efficient and robust inference engine which better captures our queries and video data

Built With

amazon-web-services
numpy
php
python
pytorch
scipy
torchaudio
torchtext
torchvision

Submitted to

PyTorch Summer Hackathon at Menlo Park

Created by

Designed and implemented the main backend program that integrated all of the arms of our project. Received video analysis data from torchvision engineer, language embedding data from torchtext engineer, speech to text data from torchaudio engineer, and architected the backend system and algorithms to make all of the modules communicate with each other and output a formatted result for front end engineer to utilize. Also implemented a systematic process for checking cosine similarity between search query embedding and known classes, in order to allow for flexible (not rigid) search terms.

Alexander Gao
Built video inference module for use with pre-trained SSD model. Includes post-processing and storing inference results for all videos in library.

Aaron Long
I did the UI/UX - Backend - and integration with Pytorch AI engine

ivan Lozano
I like to win hackathons and hunt dragons
I did search query embedding and class embedding extraction for comparing and search for similar words.

Private user
Qing Yin