Previous semester, I encountered a situation where I had to review numerous lectures to prepare for an upcoming examination, amounting to well over ten hours of material. As I began to watch and study these recordings, a problem came to reoccur; a significant amount of time of each and every lecture was comprised of "dead time", where the lecturer would spend time writing something on the board, pondering quietly or the beginning/ending of each recording. While I could get around this by fast-forwarding through said parts or watching the lecture on a faster playback rate, this each came with a set of drawbacks as well. The first siphoned attention away from the recording, the latter making it harder to understand. We looked to create a program to make this process of watching lectures more effective, and make studying more effective.
What it does
BLecOpS, the Boilermaker's Lecture Optimizing System, is a program intended on making viewing digital lectures a more efficient process by cutting out portions of the video where nothing is said. This is accompanied by a website to handle the submission process, and serve the converted videos.
How I built it
BLecOpS applies a statistical sampling methodology known as _ bootstrapping _ to determine the quietest speaking volume in the given lecture. We assume certain properties of the distribution to be leveraged with the asymptotic convergence of bootstrapping; as lectures are primarily composed of speaking, there is an expectation that the likely multimodal distribution has a peak corresponding to speaking volume. By bootstrapping for the maximum, we expect to obtain values centered around this peak -- the talking volume. Then to obtain the _ quietest _ speaking volume, we take the minimum of these maximums. With this parameter, the program processes the video and denotes sections of the lecture where the volume is quieter than the speaking volume, which we assume to be wasted parts of the lecture, made optimized by Python's
numpy. BLecOpS processes the audio by converting the video in a
.wav file through Python's
ffmpeg, BLecOpS converts the video into a more suitable encoding --
.mpeg -- instead of the more commonly used
.mp4 encoding. Once our program has determined a suitable volume and has denoted sections of the video, it again uses
ffmpeg's API to split the video into segments of speaking portions, removing sections of the lecture where nothing is being spoken. BLecOpS uses
ffmpeg a final time to stitch these clips together into a merged lecture.
Using Python Flask, BLecOpS hosts a website using Google Cloud Compute's at http://sice.fun to allow videos to be uploaded and optimized. The converted file will then show up in the list of files, and can be viewed through the site. Additionally, using
Chart.js, we visualized the amount of time saved using BLecOpS.
Challenges I ran into
While bootstrapping has asymptotic convergence, we did not want to work with the peak mode of the distribution of the sound. An initial attempt at determining the sample size lead to one too high and thus a volume threshold too high. Indeed, we aimed for some variability in the sampling distribution to make a minimum volume measurement meaningful.
Manipulations on the video file, cropping and concatenating, were made difficult by the encoding of
.mp4 files. There seems to be some internal organization that leads to the file being accurate as a segment, but not so when merged. Clips that were meant to be
x seconds long displayed as
y seconds long in the file explorer; indeed, when played, video players would note that the video clip was
y seconds long but then cut off at
x seconds. Although not a problem in isolation, the merging process meant significant amounts of overlap, repetition and weird changes to the playback rate. This was resolved by working with a video encoding at a far lower level --
.mpeg -- which enabled us to work at the byte level and splice far more accurately.
Accomplishments that I'm proud of
We were surprised at the results of applying BLecOpS to J. Chen's MA 261 Calc 3 lectures. In particular, the audio was very smooth even through spliced video segments. Additionally, the lecture video's length was cut down by over 50%.
What I learned
Leveraging statistical properties of the sound distribution, although difficult at first, meant a deterministic volume parameter for each and every lecture. There is something to be said that having a program fit to the data leads to more consistent results than simply guessing what a speaking volume ought to be.
Additionally, since we were operating at an extremely low level with respect to the video files, it was crucial to be efficient. Processing a single array more than once lead to extreme drops in performance and entire sections of code had to be rewritten to make the code function more quickly. We learned to eek out as much information as possible per piece of data given.
What's next for BLecOpS
Certain parts of the code were done purely by heuristic and not in some deterministic way. To remove parts of the video that were very short (less than a second) owing to say, a clap or a desk being hit, we mandated that clips had to be at least a second long. On that note, we also had that gaps in the segments had to be at least three seconds long in order to crop out to avoid choppiness in between sentences. It would be interesting to explore if these constants can be derived if not fine-tuned.
In the beginning of the project, there was some interest to do sound and signal processing to make parts of BLecOpS more efficient. In particular, we considered using the fast Fourier transform and sound filters to uniquely identify the frequency of the lecturer's voice. This can then be used to identify a minimum speaking volume deterministically or, more ambitiously, remove all non-speaking parts of the lecture. Due to time constraints, this was out of the scope of the project, but may be worth looking into in the future.
Log in or sign up for Devpost to join the conversation.