We're 4 college freshmen that were expecting new experiences with interactive and engaging professors in college; however, COVID-19 threw a wrench in that (and a lot of other plans). As all of us are currently learning online through various video lecture platforms, we found out that these lectures sometimes move too fast or are just flat-out boring. Summaread is our solution to transform video lectures into an easy-to-digest format.
What it does
"Summaread" automatically captures lecture content using an advanced AI NLP pipeline to automatically generate a condensed note outline. All one needs to do is provide a YouTube link to the lecture or a transcript and the corresponding outline will be rapidly generated for reading. Summaread currently generates outlines that are shortened to about 10% of the original transcript length. The outline can also be downloaded as a PDF for annotation purposes. In addition, our tool uses the Google cloud API to generate a list of Key Topics and links to Wikipedia to encourage further exploration of lecture content.
How we built it
Our project is comprised of many interconnected components, which we detail below:
Our product is able to automatically detect when lecture slides change to improve the performance of the NLP model in summarizing results. This tool uses the Google Cloud Platform API to detect changes in lecture content and records timestamps accordingly.
We use the Hugging Face summarization pipeline to automatically summarize groups of text that are between a certain number of words. This is repeated across every group of text previous generated from the Lecture Detection step.
Post-Processing and Formatting
Once the summarized content is generated, the text is processed into a set of coherent bullet points and split by sentences using Natural Language Processing techniques. The text is also formatted for easy reading by including “sub-bullet” points that give a further explanation into the main bullet point.
Key Concept Suggestions
To generate key concepts, we used the Google Cloud Platform API to scan over the condensed notes our model generates and provide wikipedia links accordingly. Some examples of Key Concepts for a COVID-19 related lecture would be medical institutions, famous researchers, and related diseases.
The front end of our website was set-up with Flask and Bootstrap. This allowed us to quickly and easily integrate our Python scripts and NLP model.
Challenges we ran into
Text summarization is extremely difficult -- while there are many powerful algorithms for turning articles into paragraph summaries, there is essentially nothing on shortening conversational sentences like those found in a lecture into bullet points.
Our NLP model is quite large, which made it difficult to host on cloud platforms
Accomplishments that we're proud of
1) Making a multi-faceted application, with a variety of machine learning and non-machine learning techniques.
2) Working on an unsolved machine learning problem (lecture simplification)
3) Real-time text analysis to determine new elements
What we learned
1) First time for multiple members using Flask and doing web development
2) First time using Google Cloud Platform API
3) Running deep learning models makes my laptop run very hot
What's next for Summaread
1) Improve our summarization model through improving data pre-processing techniques and decreasing run time
2) Adding more functionality to generated outlines for better user experience
3) Allowing for users to set parameters regarding how much the lecture is condensed by