What it does
We built a fully functioning app that is able to take in any video or audio of lengths up to around 2 hours (12000 words), and provide a series of different bullet point summaries depending on the context (e.g. for lectures: basic summary, important assignment dates). It works with surprisingly low quality audio (literally audio recorded from the back row of a lecture hall!)
Inspiration
As students studying in university, we were constantly required to watch pre-lecture videos and re-watch lecture videos to fully understand concepts. These videos totalled tens of hours per course, and it took a lot of time away from other, more enjoyable things in life. We wanted to still learn from these videos, but not use as much time so that we could spend our time more meaningfully.
How we built it
The backend of SummaVid was made with Python using a combination of Whisper, and OpenAI's GPT API; the frontend of SummaVid was made with HTML, Flask, and CSS.
The inputted file passes through the Whisper Speech-To-Text model and produces a transcript. That transcript is then sent to OpenAI's GPT via the API, and the summarized bullet points are produced, returned by GPT, and returned to the front end to the user.
Challenges we ran into
One challenge we couldn't solve was the addition of timestamps to the summary. We know that Whisper is capable of outputting many types of files since we could output file types such as .vtt, which included timestamps. However, this often resulted in a transcript too long to run through GPT, which limited the video length to ~30 minutes. We eventually abandoned this idea as the restrictions outweighed the benefits of being able to condense long recordings.
Other than that, there were more trivial challenges related to inexperience with the tools we needed to use, from APIs to Flask. We spent a significant amount of time figuring out how each of these tools could be used, and debugging took longer than it should've had as a result of our inexperience.
Accomplishments that we're proud of
The app is able to process any video and produce a very accurate summary of the file. In addition, the app can also extract more specific details from the file, such as important dates said during lectures, making its usage more versatile.
What we learned
Firstly, we learned to work with APIs, open-source models, and Flask. Furthermore, we also learned to bounce ideas off of each other to refine our ideas and streamline the debugging process. Finally, we learned to tweak and optimize our product based on testing results and add additional options such as language-specific models to optimize the quality of the summary.
What's next for SummaVid
For future expansion, we can include the timestamps for the bullet points, allowing the user to revisit that point from their original file if they so choose by using vector databases. Another idea would be to put this app onto a website on a web server, allowing the user to access the app without the need to download anything beforehand.
Log in or sign up for Devpost to join the conversation.