With the amazing advancements in the recent tide in image recognition, we are able to piece together technologies in multiple knowledge areas to begin to tackle the complexity of gathering information in videos in a broad and useful manner - namely with text. This was made firstly possible by the research success of Stanford researchers Andrej Karpathy and Fei-Fei Li who managed to train images to describe the scenes depicted in them in a technology dubbed NeuralTalk. Image recognition has made steady leaps in classification accuracy, but there finally seemed a way to capture verbs and deduce actions.
Our hack uses this technology to describe successive frames of a video and then match the best one with a query. After sampling the video, the images are first represented using a 16-layer convolutional neural network which attains a high-level “strong” representation. They are then transformed into a “modular space” that allows the representations to be unfolded in a long short-term memory recurrent neural network that maps to a language model. Lucky for us, great pretrained models exist for all these architectures!
Incorporating the successful object recognition capacities of the Imagaa API, we now only had to match our extracted information to a user’s query. Here we used the MetaMind semantic similarity feature which compared the query with the annotated sentence. By analyzing the contexts of the sentences, their word vectors, synonyms, syntactic roles - basically everything short of dark magic - MetaMind’s semantic similarity allows us to match the neutral delivery capacities of NeuralTalk with a diverse set of user queries. If any of the frames met a threshold for acceptable level of matching to the query, we accepted it. Otherwise we report the query not found.
Why is this app cool? It is a testament to how the various disciplines in smart and high-level data analysis can meet to together interpret our hardest and most elaborate environments. Plus there’s no shortage for being able to search content these days.
TL;DR. A preprocessing step and many more neural networks later - like magic - scenes in a video are represented with text and compared to a search.
Update: We ended up winning the third price at GreyLock Hackfest 2015 and a bunch of Oculus Rifts to hack on.