SummaVid

Web Interface
A flowchart displaying how the app operates
Whisper Backend
An example of the summary provided by the app
An example of the important dates from the file
An example of the transcript from the file

What it does

We built a fully functioning app that is able to take in any video or audio of lengths up to around 2 hours (12000 words), and provide a series of different bullet point summaries depending on the context (e.g. for lectures: basic summary, important assignment dates). It works with surprisingly low quality audio (literally audio recorded from the back row of a lecture hall!)

Inspiration

As students studying in university, we were constantly required to watch pre-lecture videos and re-watch lecture videos to fully understand concepts. These videos totalled tens of hours per course, and it took a lot of time away from other, more enjoyable things in life. We wanted to still learn from these videos, but not use as much time so that we could spend our time more meaningfully.

How we built it

The backend of SummaVid was made with Python using a combination of Whisper, and OpenAI's GPT API; the frontend of SummaVid was made with HTML, Flask, and CSS.

The inputted file passes through the Whisper Speech-To-Text model and produces a transcript. That transcript is then sent to OpenAI's GPT via the API, and the summarized bullet points are produced, returned by GPT, and returned to the front end to the user.

Challenges we ran into

One challenge we couldn't solve was the addition of timestamps to the summary. We know that Whisper is capable of outputting many types of files since we could output file types such as .vtt, which included timestamps. However, this often resulted in a transcript too long to run through GPT, which limited the video length to ~30 minutes. We eventually abandoned this idea as the restrictions outweighed the benefits of being able to condense long recordings.

Other than that, there were more trivial challenges related to inexperience with the tools we needed to use, from APIs to Flask. We spent a significant amount of time figuring out how each of these tools could be used, and debugging took longer than it should've had as a result of our inexperience.

Accomplishments that we're proud of

The app is able to process any video and produce a very accurate summary of the file. In addition, the app can also extract more specific details from the file, such as important dates said during lectures, making its usage more versatile.

What we learned

Firstly, we learned to work with APIs, open-source models, and Flask. Furthermore, we also learned to bounce ideas off of each other to refine our ideas and streamline the debugging process. Finally, we learned to tweak and optimize our product based on testing results and add additional options such as language-specific models to optimize the quality of the summary.

What's next for SummaVid

For future expansion, we can include the timestamps for the bullet points, allowing the user to revisit that point from their original file if they so choose by using vector databases. Another idea would be to put this app onto a website on a web server, allowing the user to access the app without the need to download anything beforehand.

Built With

css
flask
html
openai
python
whisper

Submitted to

Hack The Classroom
- Winner Third Overall

Created by

I was responsible for coming up with the idea and using my previous knowledge of front-end development to assist the team. I also worked in integrating all of the moving parts of APIs and Whisper together to work cohesively, developed code for both of those components, and documenting the project.

Ivan Ng
I mainly worked on the backend, using Python and object oriented programming to ensure outputs from Whisper and GPT are processed properly while matching the user specified video type and sent to the frontend. I was also involved with minor parts of the webpage, creating textboxes to display summaries and transcriptions, and the design of the app logo.

Justin Cheung
My main contribution was the presentation. As a student majoring in Business and Computer Science, I was good at presentations, while still understanding most of what happened. I wrote up the project story, as well as storyboarded the entire demo video, and edited it to be presentable.

I also listened to the other members' ideas, and helped bounce ideas around and iterate through ideas to find the optimal approach.

Hayden Chan