In the era of analytics, data explosion and IT 2.0, cameras, sensors and photoelectric barriers have been massively used to collect the most powerful currency of 21st century - data. One specific type of information though, has been especially neglected, and that is auditory information!

Inspired by one of Hackaburg 2022s sponsors, Infineon, my friends and I have started discussing the unlimited source of information that is a "sound". After all, depending on the frequency and quantity of channels of the file, one second of an audio can be anything from a meaningless bleep to billions of sine waves intermeshed together to create a gold mine for anyone willing to invest the time to decipher their meaning.

From the beginning of the hackathon, our goal was to display how rich the auditory information really is, and we believed the monotonous vehicle noises were perfect for that. Even before the Hackathon, we had a clear plan and even some proof, about what sort of information could be extracted from something as innocent as a car driving by:

  • The speed and acceleration of the vehicle (derivatives of loudness contour)
  • The shape and size of the vehicle (especially when the audio is being recorded from inside)
  • The engine type of the vehicle (the source of most noise)
  • The type of road and weather (the 2nd source of most noise, tire impact)
  • The relative position from the microphone, and as such, complete echolocation of environment if provided enough stationary microphones (and a long enough timeframe)

We at least attempted to implement each and every single of the aforementioned points, some of them even extremely successfully: shape of a car with over 93% accuracy on validation data, state of environment classification, speed of vehicle and most importantly, classification of different types of engines.

But how?

Complete life cycle of an audio file

First and foremost comes the preprocessing of our data. Our team focused on quantity over quality when it came to the amount of data we wanted to have, due to us knowing that we could leave the prettifying of the data to post-processing steps. This led to surprisingly good results, partially due to the 40 GB of relevant data that we managed to go through during the Hackathon.

Some of our data-pieces began as large audio recordings, or audio files procured from YouTube videos. This data then went onto the chopping block, where it got sliced into 3-second parts. This length seemed to be the optimal length for our results, and made collecting data a little easier (compared with using 10-second samples). Everything that was shorter than 3s got prepended some silence, in order to not waste any useful information, and the 3-second snippets could then be sent further down the cleaning pipeline. The pipeline included synchronization of refresh rates between samples, synchronization of channel counts as well as random time shifts to randomize the "true beginning" of a sample and help the model generalize faster:

def time_shift(self, shift_limit: float = 0.4) -> None:
    _, sig_len = self.signal.shape
    shift_amt = int(random() * shift_limit * sig_len)
    self.signal = self.signal.roll(shift_amt) # rollover from the start

This was only one of the first augmentations we applied to the data before we started the actual training, as we firmly believe the success of a machine learning algorithm depends mostly on the data and how it has been presented. Before further augmenting the data though, we switched the way of representing it from "audio" to "picture", namely - Mel Spectrogram. From experience, we knew that Mel Spectrogram is the most suitable way of feeding the information into our Convolutional Neural Network and more importantly, how to augment the spectrogram to achieve incredible results. Opposite to pictures, a spectrogram can not be rotated or flipped without getting a completely different meaning. A spectrogram, rotated by some angle, would represent a completely different sound than the original file, as such, a very well known way of augmenting it is are maskings along one of its axes. Masking blocks of data doesn't affect the meaning of the spectrogram by a lot, but enough that it not only improves the generalization speed of the finished algorithm, but also blows a medium-sized dataset out of proportion by artificially adding variance to every single data point.

Vertical blocks represent silence, while horizontal ones represent some frequences missing

While we knew that there are further augmentations available (like pitch shift, speed-up/slow-down of the playback), extensive testing showed that any modifications to the shift in pitch or to the speed of playback, had more negative effects on results of real life data than the increase in diversity was worth.

This leads us directly to the training part, where the data points were submitted in batches of 16 and normalized amongst each other to amplify their unique traits and make them stand out more. Each Convolutional Neural Network layer then filters the image in its own specific way (width and height reductions, depth increase) until we go through four layers. The final layer is then flattened and fed to a linear layer, which is in the end responsible for assigning a prediction.

Originally, we wanted to train the model locally on our own hardware, but after starting the process we realized that our hardware was not sufficient to provide results in a timely manner. After two hours of running the job, we had to abort. So we went searching for a solution, that would speed up the training process. We decided to use a virtual machine hosted on Googles Cloud plattform, specifically a machine-type that is optimized for machine learning. The instance that was chosen ran 8 virtual CPUs and 32 GB of RAM on the AMD Milan Plattform, which is designed for high-performance-computing, simulations and machine learning. This cut down training time to just about ten minutes.

# First Convolution Block with Relu and Batch Norm.
# Use Kaiming Initialization
self.conv1 = nn.Conv2d(2, 8, (5, 5), (2, 2), (2, 2))
self.relu1 = nn.ReLU()
self.bn1 = nn.BatchNorm2d(8)
nn.init.kaiming_normal_(self.conv1.weight, a=0.1)
conv_layers += [self.conv1, self.relu1, self.bn1]

# Second Convolution Block
self.conv2 = nn.Conv2d(8, 16, (3, 3), (2, 2), (1, 1))
self.relu2 = nn.ReLU()
self.bn2 = nn.BatchNorm2d(16)
nn.init.kaiming_normal_(self.conv2.weight, a=0.1)
conv_layers += [self.conv2, self.relu2, self.bn2]

# Third Convolution Block
self.conv3 = nn.Conv2d(16, 32, (3, 3), (2, 2), (1, 1))
self.relu3 = nn.ReLU()
self.bn3 = nn.BatchNorm2d(32)
nn.init.kaiming_normal_(self.conv3.weight, a=0.1)
conv_layers += [self.conv3, self.relu3, self.bn3]

# Fourth Convolution Block
self.conv4 = nn.Conv2d(32, 64, (3, 3), (2, 2), (1, 1))
self.relu4 = nn.ReLU()
self.bn4 = nn.BatchNorm2d(64)
nn.init.kaiming_normal_(self.conv4.weight, a=0.1)
conv_layers += [self.conv4, self.relu4, self.bn4]

# Linear Classifier
self.ap = nn.AdaptiveAvgPool2d(output_size=1)
self.lin = nn.Linear(in_features=64, out_features=9)

# Wrap the Convolutional Blocks
self.conv = nn.Sequential(*conv_layers)

This now has to happen over and over again. A training function is responsible for that, where the epochs are assigned, the training data is classified over and over again until a point is reached where we feel like the CNN is starting to converge. An accuracy metric is extremely simple, since we know how easy it is to pay too much attention to the success of training data and end up overfitting. Across all of our models, multiple models have reached a 95% success rate on training data, but failed to live up to the expectations when it came to data from validation set or real life. Optimal results, without overfitting, seemed to cap out at 92% accuracy which we are still very proud of, especially knowing that none of our classification models fell below 90% on validation data.

Visualisation of the code-block above

Surely it wasn't that easy?

While we did have a ton of expertise when it came to machine learning and data modelling, the percentages don't necessarily speak for themselves! The design of frontend and visualisation of data was just as much work, if not more. We were also absolutely shocked to hear that no data is going to be provided to us, which in the end, ended up being both the most fun part of the challenge, but also the most challenging one as well. Gathering enough data to train a machine learning model might be possible, but when it comes to the quality and variety of the data, it turned out that it's not such a trivial task after all. The constant wind and/or rain outside added unnecessary noise to all of our manually collected data and the manual labelling of the data, turned out to be impossible in cases where we ourselves couldn't really tell what kind of engine was making the sound.

The process of getting the data into our trained model in order to produce results wasn't straightforward either. We wanted to automate every single step of the way, so that one would only have to record audio with a microphone and receive a result on the frontend a few seconds later.

Our automated data-pipeline starts with a Raspberry Pi, onto which we soldered a microphone, which represents the entry point for all of our manually collected data. We set up the Raspberry with the scripts provided by Infineon, and then started to further customize the python-scripts using pydub, to collect even cleaner audio samples. After collecting a .wav file with the Raspberry, we looked further for a way to automatically send the file over to our machine learning model, preferably in a simple and maintenance free way, which pointed us towards a cloud solution. The solution we came up with starts with a python script on the Raspberry, that uploads the file to a Google Cloud storage bucket. This bucket served as the central storage for all of our raw files and as the jumping off point for the GCP part of the pipeline. The next step encompasses a Cloud Function, a serverless solution for running small scripts. Cloud Function offered us the option to trigger the script every time a new file was uploaded to the bucket. The script itself, written in python, takes the file, encodes it in Base64, and then sends it over to an API endpoint in our backend, where all the aforementioned preprocessing, evaluation, logging and obviously the classification itself takes place.

As for the frontend we wanted to create a visualization dashboard for all data that already can be extracted by us and potentially will be extracted in the future. Therefore we chose a scrollable container layout, where data can be easily added. As we intended to create an interactive experience, we implemented a 3D environment to display models of the cars that we identified. The user can look at the model from all perspectives and at the same time get an overview of what data the machine learning provided. Additionally the user can listen to the original audio file which was recorded. The application itself was implemented in ReactJS and for the 3D environment ThreeJS was used. It was a challenge to combine all elements but we are happy about the way it seamlessly fit together.

We store all the data in our database for future evaluation, but more importantly the database provides all the results to our front end, where the results can be properly visualized.


While we didn't manage to check off everything off of our TODO list, we're still extremely proud of the final product. The PyTorch trained models are extremely accurate, the Google Cloud responds in extremely fast speeds making the communication already look professional, and our Front End conveys most of the information in a beautiful manner that we believe could be interesting or challenging to read out of a simple sound file. We not only got to display our strengths in the fields we specialize in, but also got to learn a lot, mostly when it came to interacting with parts made by our teammates.

What's next?

One of the great ideas we had, that we would love to implement after the hackathon, is detecting faulty/damaged engines or even predict how soon they might break in the future. For a task like this, the dataset would be one of the biggest challenges, and we would love to work with a company who could potentially support us on our journey further. Even the current state of our project has quite a few uses - one of the more creative ones we shared amongst us are city based loyalty programs that would build on the data about car types currently on the road, and use the information to synchronize the traffic lights to either reward people who are in a rush (lets say for a coupon), or more seriously, adjusting the traffic in a way to reduce the time it takes for police/ambulance to reach their destinations. Our framework has been polished for this type of detections, and we're already looking forward to working on it in the future.

Built With

Share this project: