Ctrl + Frame

Inspiration

When we're trying to study for physics, we get annoyed when we have to sit through an entire 60 minute lecture or try to manually search for what we are trying to learn, so we made Ctrl-Frame.

What it does

It searches through a video using both audio and image recognition based on the given search terms and returns the timestamps where the search terms occur in the video. The user can click on the timestamps and the website will automatically play the video at the correct location.

How we built it

We split the task into four parts: web interface, image classification, audio classification, and web design. The web interface was designed to be simple and easy-to-use, and it was built html, css, javascript and would be integrated with the image and audio classification by taking keywords as search parameters. The image classification was based the on Clarifai ai machine learning model, which was trained by Clarifai ai general dataset (containing everyday objects). The video would be split into its frames, and every second’s worth of frames would be grouped together, and the objects in that group of frames would be classified. The time location of the group was noted, and that value is the value that the user would search for. The audio classification was built using Google Cloud Speech; the video would be split into intervals of audio, and each interval would be classified and the time noted.

Challenges we ran into

We faced many challenges along the way, such as the APIs not properly functioning or linking with our end. We had to design many alternatives before we had everything properly functioning. The most challenging problem we ran into was that we had to combine many different APIs to build a coherent model that is simple and effective.

Accomplishments that we're proud of

The achievement we are most proud of is how our webapp can be useful for a wide variety of cases: It can analyze long lectures so you can easily search for the material you need practice with the most; It can be used to automatically filter through many videos to find the ones with relevant content; In a more serious context, Ctrl + Frame can scan through uploaded content for threatening or other inappropriate material.

As far as features go, we're most excited about the dynamic range of search capabilities that our webapp can perform at the same time; Ctrl + Frame is able to search by singular words, phrases, and images all at once using it's wide range of video analysis tools. We are also proud of how intuitively the results from a search are displayed -- they are shown to the user as timestamp buttons, which lead them to the exact spot in the video where their search matched.

More generally, we were excited to have applied the skills we are learning in our Computer Science classes at school, including Parallel Computing, Artificial Intelligence, and Computer Vision. We feel that our project is not only useful, but was a valuable learning experience for all four of us.

What we learned

We learned a great deal in terms of natural language processing and image recognition. We were able to learn from our mistakes and finally develop a successful finished project.

What's next for Ctrl + Frame

We hope to develop the efficiency of our image recognition as well as our natural language processing further. We would also like to develop our tool as a chrome extension to provide a unique, useful utility for the masses. Ultimately, we want to help others save some of their valuable time.

Built With

clarifai
django
google-cloud
google-cloud-speech
nltk
python

Submitted to

HackTJ 2018
- Winner Phone2Action Runner Up

Created by

I worked on the front end of the application and helped out with some API problems in the audio processing part of the application.

Omkar Kulkarni
I worked on the front end to help make the back end stuff we were working on accessible. I also developed a clear, intuitive UI for our complex tool.

Abhishek Allamsetty
I worked primarily on the audio recognition and phrase searching backend. To implement this functionality I used the Google Cloud Speech API for synchronous audio transcription, and ffmpeg for some basic file conversions and splitting.

Charlie Gunn
I worked on the image recognition backend, using Clarifai ai and Open-cv to map object detections to specific video times.

George Tang

Updates

Omkar Kulkarni started this project — Mar 11, 2018 09:43 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.