Have you ever tried searching for videos discussing AVL trees in college-level detail on Youtube or MIT open courseware? Chances are that the information you want is buried within a 50-minute lecture about Binary Search Trees, and you will have to scroll past several entries to find it if the video is not explicitly tagged with "AVL Tree". Even if you do find this lecture video, you still have to sift through the video to find where the professor discusses AVL trees. Should you really have to trawl through search results and the whole video just for this snippet?
What it does
Our portal solves exactly that issue. *We use machine learning tools to transcribe the audio of each lecture so that topics discussed in the lecture that are not in the title, description, or tags can be indexed and searchable. To further our analysis of topics, we also analyze the powerpoint slides corresponding to the lecture to infer what topics can be learned from the lecture. **We understand the topics as phrases- a group of words and just not singular words to deliver you the best results. When The capstone of our implementation is a scalable automated asynchronous system that orchestrates all the processing of file inputs - allowing users to easily upload lectures that they want transcribed and analyzed. That is what our vision for Lecture Scope looks like - *Comfortable to use, for both teachers and students.
How I built it
Our system comprises of two parts:
1. Server: A dropwizard server that serves relevant videos given a query. For the resolution of the query, we use the tf_idf full search functionality in our elasticsearch cluster. Our cluster stores the transcript of audio files as well as text phrases extracted from the pdf.
2.Content Builder: This is an asynchronous collection of systems orchestrated using RabbitMQ. The audio processor consumes the item in the queue, converts bit rate using sox and uses Mozilla's DeepSpeech API to extract transcript and the timestamps. The Slides processor uses PyMuPDF to extract text and its font style in the document. This is because our system understands that headings in the lectures are more important than text.
Challenges I ran into
1. Preprocessing of audio files: Mozilla DeepSpeech works optimally when transcribing short 3-10s monophonic 16kHz bitrate .wav files. We discovered that the effectiveness of DeepSpeech trails precipitously with the slightest deviation from these specifications. If you try to transcribe a file longer than 30s, DeepSpeech gets stuck almost every time. If the file is not monophonic and 16kHz, the transcription is poor. We overcame this challenge 1) by using pydub to split the long audio files into smaller ones, with cuts made at points of silence, to maintain maximum integrity of the results. We met the other desired specifications by learning to use FFMPEG and SoX to preprocess the files before transcription.
2. Maven dependency hell: Dropwizard uses a dependency of jackson library. Elasticsearch uses with another version of Jackson library inside. Dropwizard used logback. Elasticsearch uses log4j. While trying to use these projects together in Gradle, we frequently had issues where some jar would be missing.
Accomplishments that I'm proud of
1) Mozilla deep speech API: With our limited background in machine learning, we were at first worried that incorporating machine learning in our project would make our project's scope and difficulty too much to handle. Companies like Veritone invest hundreds of millions of dollars and an entire infrastructure of employees to implement speech-to-text services, and here we were trying to implement a prototype as students in 36 hours. Nevertheless, we learned to use the DeepSpeech API effectively, and we got what we wanted - reliable and accurate speech transcription. Our use of machine learning in this project further broadens our project's potential applications, as we can retrain the machine learning model for specific purposes or subjects. The default model is trained on conversational speech, so it is pretty much incapable of recognizing the word "cache", as it will almost always interpret the speech as "cash". If we were to train our model especially for computer science-related speech, which we are capable of doing, we can potentially make our model more accurate for transcribing computer science lectures than the production-level speech-to-text web-services like Azure, AWS, and Google Cloud.
2) Setting up custom mappers for elasticsearch indices: We used our own custom analyzers and filters for elasticsearch. We also used its feature where it would not consider common English stopwords in the query.
3) PyMuPDF pdf extraction: We extracted the text along with its font family using PyMuPDF library in Python. We then normalised the font size to extract relevant headings in the document.
What I learned
System Design, Mozilla Deep Speech, Asynchronous processing, N-Gram phrase matching and lots of other stuff
What's next for Lecture-Scope
Well, we have already pitched the idea to some of our professors. Our vision is to create a widely adopted educational portal where you can also support trending lectures on a particular topic and private lectures.