During hackathons, we are constantly transferring files back and forth, but it's usually a boring affair of passing a flash drive back and forth. We pass URLs and websites back and forth too, but those are a dreary pain to type over and over and over again. Peer-to-peer messaging would be really fast, but we don’t have it set up on our computers. Instead, we’re stuck with sending plain old emails back and forth to each other. Then we thought, “There has got to be another way, specifically, a more entertaining way!”
Our solution? It’s a file transfer system via audio. But we don’t use just any old boring audio. For our data encoding we carefully hand-picked the most unique, unusual, and downright silly sounds we could find. We call it HiFi Messenger!
What it does
Simply put, it’s a peer-to-peer data transfer system that uses different sound effects to encode the data for transmission.
How we built it
The main component of HiFi Messenger is a machine learning model that we use to convert audio signals into text. During the design process, we first started by evaluating different TensorFlow audio models, but after extensive evaluation it was determined that while they may be well suited for speech recognition, they didn’t work for sound effects.
From there, we switched gears and started working with images. We modified an npm spectrogram module to stream in microphone data from the user, then converted it into a visual representation. We selected a spectrogram over a more traditional waveform because it did a better job of distinguishing the pitch of the sound, and not just the volume.
Once we knew that we could convert audio to image data, we were then able to use Google AutoML to build a model which matches an auditory sound to a label. While this was no easy task (see the challenges section), we finally got a model trained and working. Then based on the sounds the model detected with the highest accuracy, we built a web app using React, Typescript, and Material UI to convert each sound to an image, then to a label, and then to the data that was transmitted.
Once this was working, we built two main ways to transmit sounds from keystrokes: a web app; and a Java program. The web app allows for a more focused user experience for users who just want to occasionally send and receive messages. The Java program allows users to stream everything they are typing, for example, at a meeting or a workshop.
Challenges we ran into
Choosing the correct model and gathering the data to train it was an ordeal in and of itself given the time pressure of the hackathon. We knew that we needed to convert an image of a spectrogram into a label using the machine learning algorithm, but we had a few different options of how to do that. The simplest way from a model training perspective would be to send only one sound wave at a time to the model. However, in doing so we would need a method on the front end to determine where one sound ended and the next began, and that was simply not feasible. Then, we wanted to use multi label images, but in doing so we would have lost the ability to send consecutive letters (example: http would have read as htp). This was a compromise we were not willing to make, so we needed to use an object detection model over a recognition model. This unfortunately required us to put a bounding box over each and every individual audio wave. This made gathering the training data one image at a time a very slow and painful process.
To speed this process up, we edited all of our possible audio files into one big clip, then used a screen recording program to record the spectrogram of the entire clip. We then captured an image from the screen recording each time a new sound was played.
We next needed to extract the spectrogram itself from each of the images that we had captured. Our original plan was to crop them all in photoshop, but with 250 images that would have taken forever. We instead wrote a UiPath sequence to crop, save, and close each image instead. This allowed us to open about 150 images at once, and UiPath would crop them all identically faster than we could. Finally, we went through manually and determined the bounding box for each sound, saved it to a CSV, and loaded it into Google AutoML to build our model.
Accomplishments that we're proud of
Despite the long process of trying to gather the training data and train the Google AutoML Model, we got everything we wanted done, and we’re REALLY happy about it!
What we learned
This was our first time making an image detection machine learning model, and our first time working with machine learning and audio, so we learned a huge amount about both of those.We also learned a lot more about Google AutoML and TensorFlow in general.
What's next for HiFi Messenger
We’ll be working on adding more training data to the model to allow it to support different environments and speaker configurations. We also started a Java App that allowed users to send data without being restricted to the single textbox of the web app, but were having difficulties connecting TensorFlow. We’d like to get that up and running in the future.