Audio Emotion Detector -CHAP

Picture of the GUI when recording an Audio File

Inspiration

In this rapidly advancing AI world, human-computer interactions (HCI) are of extreme importance With more and more physical and virtual robots getting integrated into most people’s daily lives, having an understanding robot could help accomplish daily tasks like caring for elderlies or assessing the effectiveness of your marketing campaign. Understanding human emotions pave the way to understanding people's needs better and, ultimately, providing better service.

What it does

It's an audio emotion detector AI that outputs the most accurate emotion by combining both speech-to-emotion model and text-to-emotion model. With the help of XGBoost and a few pre-existing datasets, we were able to create an AI that uses a simple but powerful algorithm to find the best path to the most accurate emotion.

How we built it

We first implemented a preexisting speech-to-emotion model that analyses an audio file, and outputs one emotion out of the seven possibilities: Fear, Happy, Neutral, Sad, Anger, and Disgust. The model analyses the pitch and the volume and outputs an MFCC array, which is an array representation of an audio file. This array is then mapped to a specific emotion. We then implemented the text to emotion model which is basically a python package called LeXmo. This model receives a text, tokenizes it, scans through the words, and assigns each word to a specific emotion with the use of the NRC emotion word lexicon. It then outputs the percentage of each of the 10 emotions: anger, fear, anticipation, trust, surprise, sadness, joy, disgust, positive and negative. Since both models don't output the same number of emotions, we standardized the outputs by grouping some extra emotions from the text model to the speech model's output. We finally created the combined AI model that used XGBoost to use both models' outputs and combine them in an effective way to produce a more accurate emotion.

In addition, the front end of this project was implemented using a simple and user-friendly React package that records a live audio file and allows users to download it onto their computer.

Challenges we ran into

Most of us were unfamiliar with the basics of machine learning so we had to learn a lot of the fundamental concepts of AI in a short period of time. Furthermore, we bumped into technical issues while setting up the environment. For example, we had trouble installing the appropriate packages to run our server on Flask.

In addition, it was difficult to find appropriate datasets that we could use to effectively train our models. We needed data that had a variety of sentences spoken in multiple tones.

Another important challenge we faced was sending the recorded audio file from the front end to back end. The front end would save the file into a Blob, which is an entity containing multimedia data. However, we had trouble decoding the Blob into an audio file once it was sent to the back end. This is still a feature we are working to implement in the future

Accomplishments that we're proud of

Over the course of 24h, we are very proud to have learned the basics of machine learning, as well as the usage of popular ML tools such as XGBoost. We are happy about our idea to construct an AI model using two preexisting models. Although not perfect, our current AI model can analyze and output the emotion behind an audio file by analyzing both the sound of the file as well as the words that were spoken. We believe that this is a huge accomplishment.

What we learned

As previously mentioned, we learned the fundamentals of Machine Learning and a few of its tools, such as XGBoost. We also used this project to polish and expand our skills in React. We also learned about API calls between a client and its server.

What's next for Audio Emotion Detector

There are many improvements that could be made to the project. In the future, we will definitely like to train our model with more datasets with a greater variety of emotions to improve accuracy. We will also try to make our AI a multi-label classification model instead of the multiclass classification that we have. This would help determine multiple emotions someone might have. Furthermore, we will try to make the front-end communicate with the back-end more efficiently .

Built With

Submitted to

MAIS Hacks 2022
- Winner Best Hack for Business

Created by

I worked on the front end package responsible of recording the AudioFile, as well as the CSS formatting of the GUI.

letaoli123
Mahirul Islam
PLangari Langari
GianlucaP106 Piccirillo

Updates

Mahirul Islam started this project — Oct 01, 2022 07:57 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.