We can search videos on Youtube by their title, description, user, and so on... but why can't we search based off of what's actually IN the video? Current search strategies rely too much on metadata, which often don't correlate with most important factor in video searches: the actual content. If I want a Youtube video of a bicycle that might not necessarily be titled "Bicycle", we should be able to find it by searching over the visual and audio content itself and get relevant matches up to the second.
What it does
InTube uses Google's Cloud Vision machine learning libraries to analyze frames in YouTube videos for their actual content.
When the user submits a video for processing, InTube gets the storyboard from the video. It then feeds this data - which is essentially a collection of frames - into the Google Cloud Vision API, which intelligently flags each frame with semantic labels of what's happening on screen. This data is then fed into Algolia, which indexes the labels, the videos, and the timestamps where they appear, as well as some additional data about the video. The transcript - the audio - for the video is also parsed and timestamped, so the words can also be indexed.
Finally, the website provides a quick and easy search experience where the user can search for places in the video where the term appears. For example, if the user searches "man", the web interface brings up the exact video frames where a man shows up in Youtube videos. The user can see exactly where a man is seen or heard within in the video and can click to jump exactly to that spot.
How we built it
We built InTube with Flask, Bootstrap and Handlebars for rendering the frontend. The Google Cloud Vision API is used heavily to process and tag the videos. Video processing is done through background tasks using Celery and Redis. Algolia is used to index and search through all of the data. Google App Engine is used to deploy and host the project.
Challenges we ran into
The very first challenge was grabbing the frames from the video. It turns out Youtube doesn't have a good API for doing this. After poking around, we realized we could use the frames that Youtube generates for the video preview in the seek bar when jumping around in a video. Youtube stores these in mosaics of storyboard images, and we realized we could grab these by scraping the site. We had a similar experience when trying to scrape the text transcript from Youtube.
Another UI challenge was writing good code to run the processing in the background in Flask and send the results back to the frontend to notify the user of progress when processing a video. We learned to use Celery for background processes, and then have an endpoint that could be queried to report state for the background operation.
Accomplishments that we're proud of
The project works! It's a smooth way to find out where exactly something occurs in a Youtube video, and it hasn't really been done before. The user can instantly visualize and jump to semantic occurrences.
What we learned
This was our first time using Google's machine learning APIs. We now have a really good understanding of how it works! We also solidified our understanding of Flask, particularly using it for background operations.
What's next for InTube
We plan to expand out the search capability by indexing out lots more videos (by collecting user submissions), as well as improve the UI significantly. We also plan to try to reduce processing time by making better use of the Cloud Vision API. Most importantly, we want to add the capability for "fuzzy search" where related or synonymous keywords will also show up - for example "cats" if the user searches "kitten."