Emotion Classification of Voice

Inspiration

We have seen that there are a variety of people who may seem to be living normal lives but they have plenty of problems that they are hiding from others. However, we understand that their grief from their problems impacts how their speech sounds, but there are people who are unable to understand others' emotions through voice. This could potentially lead to relationship issues, so to promote a friendly community, we decided to develop a machine learning model that can recognize people's emotions from their voice.

What it does

Our model will predict an individual's emotion based on the sound of their voice. It will be able to classify their emotion based on their tone, speed, and intonation.

How we built it

We started by compiling our training data, which consisted of 50 different words/phrases said in 7 emotions apiece. We then uploaded all our files into a Jupyter Notebook and used Librosa audio processing to convert each audio file into an array of sound frequencies, and we added them into a pandas DataFrame to associate the frequencies with certain emotions. We ran drop_duplicates to clean up our data in the DataFrame. We also ran pd.info() to ensure our data doesn't have any missing values.

Once we ensured our data was clean, we ran train_test_split() with 75% being our training data, and 25% being our testing data. We then trained our model using a neural network with Tensorflow, and our layers consisted of Conv1D, MaxPooling1D, Dropout, Flatten, and Dense. We then compiled the model and fitted it with our trained data, which gave us the accuracy of our model.

Challenges we ran into

We initially trained our model using classification models like RandomForestClassifier, LogisticRegression, and VotingClassifier, but our validation accuracy for each model was very low, 52.8% for RandomForestClassifier, 27.7% for LogisticRegression, and 48.3% for VotingClassifier. We used MatPlotLib to find the maximum validation accuracy of RandomForestClassifier using a plot, but the maximum validation accuracy depicted from the plot was 52%. We used GridSearchCv to find optimal model for LogisticRegression, but the optimal model validation accuracy was still 27.7%. VotingClassifier was not giving an accuracy higher than 48.3% regardless of what models it contained.

Accomplishments that we're proud of

Our neural network model achieved a validation accuracy of 81.61%, which demonstrates our models ability to identify 81.61% of human emotions that it had never seen before (outside of our training data).

What we learned

We learned that when train Auido Models, using Layers like Conv1D, MaxPooling1D, Dropout, Flatten, and Dense are very helpful. Conv1D is very effective in capturing audio waveforms and frequencies, the latter of which we included in our DataFrame, MaxPooling1D reduces dimensionality so that we only extract the important features from our DataFrame, and the others Layers like Dropout, Flatten have a tremendous impact. It is important to use Deep Learning in Audio Classification.

What's next for Emotion Classification of Voice

We can develop our model into a user-driven app that can allow people to record their voices, and it will classify their emotion, allowing the model to respond accordingly.

Built With

jupyter
librosa
matplotlib
python
scikit-learn
tensorflow

Updates

Atharva Berde started this project — Apr 28, 2024 04:59 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.