Inspiration
We were inspired to make this to make automatic closed captioning more accessible to the hard of hearing. Although closed captioning exists on sites like YouTube, other popular tools like Discord or Skype don't offer closed captioning for the hearing impaired, so we wanted to create something that could help address that gap. There are many websites with video content
What it does
It takes in all audio that is being output by the system, such as from conference calls, Google Hangouts, YouTube videos, Netflix, etc., and produces automatic closed captioning for the audio recorded. It then displays a black box with the text overlayed at the bottom of the screen.
How we built it
We used PyAudio and some sort of programming chicanery to record audio directly from the system (rather than from the microphone) and our textbox was made using PyGame. We managed to record and transcribe audio and minimize losses by recording audio in short segments via one thread, and in another thread running simultaneously, pushing the recorded audio onto a queue, where the oldest item on it would be transcribed to text using Google Cloud Speech-to-Text, and then pushed off the queue, into an input for the textbox. This method produces some lag between what the listener is reading and what's actually being said, but the immense upside of this as opposed to other methods is that very little of the audio is lost, so the user can get as much as possible.
Challenges we ran into
The two main challenges we ran into were getting audio data from the system, and minimizing the lag between all the processes of retrieving the audio, converting it to text, and then displaying it for the user. Getting audio data was actually a lot harder than it seemed, because unlike microphone input, there were no good libraries or APIs for recognizing data coming from the system itself, and trying to find a way to record that audio ate up most of our time.
To properly get audio input, we needed to run an input buffer on the audio that was output by the speaker. To do this, we had to route the output of the speakers through the computer's 'Stereo Mix' virtual device. After recording for a few seconds, the program saves the file and sends it to the Google Cloud speech recognition API for speech to text. When this is returned, the original audio file is destroyed and the text is saved.
After we got that done, we had to figure out how to minimize display lag, or less the user would get disjointed messages.
Accomplishments that we're proud of
Honestly, we're pretty proud of the fact that we didn't give up. Most of the day was spent trying to find a way to get system audio from the computer and displaying text onto the screen, and the hours spent failing to get past basically the first steps were pretty disheartening.
We're also really proud of using learning new skills during the project, such as using PyGame to get display text onto the screen, and using multithreading to be able to run multiple processes at the same time to reduce lag.
What's next for AutoCC
Our next steps to improve our process would be to try to optimize our threading even further to remove as much of the lag as possible, as well as adding translation functionality.
Log in or sign up for Devpost to join the conversation.