Our main inspiration came from how Bumblebee speaks in the Transformers movies, hence "Bumble-B". In the movie, Bumblebee uses the car's radio to speak what he wants to say by splicing audio segments together to form sentences.
What it does
We decided to build an application that does exactly that: pulls audio together from many different online sources and splices them together to form a sentence. A user can type in a desired phrase, and Bumble-B will search through a database word index, find audio clips of each word, and then splice them together at the end.
How we built it
Our main focus was to utilize Google's Speech-to-Text engine to quickly index audio files. This can be done by requesting one of Speech-to-Text's optional parameters,
enable_word_time_offsets , which sends the start and end times of each word that Speech-to-Text recognized. This allowed us to automate the process of splicing large audio files, such as an Obama speech, by allowing Google's Speech-to-Text to analyze each word and find the start and stop times, and simple python code to split and upload each individual word to Google Cloud Storage. Every word is then indexed in a Django RESTful api in order to easily query requests on the front end and pull the relevant audio snippets.
Challenges we ran into
The main challenge we faced and didn't really foresee was the accuracy of Google's Speech-to-Text word time offsets being pretty inaccurate. The Start and Stop times for each word round to the nearest 10th of a second, which is actually a very long time when speaking at a normal speed. This means that instead of getting a single, cleanly cut word, Speech-to-Text would often give back correct speech values, but cause our app to split the file into too long of a clip. For instance, We indexed the "Hi, welcome to Chili's!" vine, and while Speech-to-text accurately named all of the words, the Start and Stop times caused the "welcome" word to also include "hi" and "to" in the trimmed clip due to the large Start and Stop time rounding.
Accomplishments that we're proud of
While our auto-indexing system using Google's Speech-to-Text doesn't produce incredible results due to the inaccuracy of the Start time and End time responses, we're proud as a team that we ventured into unknown territory and used a framework that none of us had experience with. Likewise, we're quite proud of how front-end and back-end frameworks came together quickly and effectively.