Speech Emotion Recognition for Customer Service

Inspiration Over the past decade, chatbots have become a staple in customer service, thanks to advances in natural language processing (NLP) and machine learning (ML). While these technologies help bots interpret text-based emotions and context, they still struggle to capture the signals of human emotions such as sarcasm These emotional cues, like tone, pitch, and volume are important in creating empathetic interactions. This project was inspired by the need to solve that problem and create systems that are not only intelligent but also emotionally aware.

What We Learned Through this project, we discovered that listening to how people speak adds a lot to understanding their emotions beyond just reading their words. Some key takeaways: Combining audio features with deep learning helps the system recognize emotions more accurately and realistically.

Extracting features like tone, pitch, and volume is essential for picking up subtle cues, such as frustration or excitement.

Adding variations to the data (Data Augmentation), like different speeds, pitches, or background noise, helps the model handle real-world situations better.

How We Built It Data Preparation:

Used four datasets RAVDESS, CREMA-D, TESS, and SAVEE which covered 18,000 audio samples.

Extracted emotion labels and applied data augmentation: time stretching, pitch shifting, and adding background noise.

Feature Extraction:

Used Librosa to extract MFCCs, chroma, mel spectrograms, pitch, and volume features.

These features capture the tone, stress, and intensity of speech, critical for detecting frustration or satisfaction.

Model Development:

Built a convolutional neural network (CNN) for audio emotion recognition using Python and TensorFlow/Keras.

Model architecture:

Convolutional layers detect pitch/volume patterns

Max pooling & dropout for focus and regularization

Fully connected layers + softmax classifier for 8 emotion categories

Evaluation & Optimization:

Visualized performance with Matplotlib, handled data with Pandas & NumPy.

Tuned hyperparameters and optimized feature selection to improve accuracy.

Challenges Human speech variability: Different accents, speaking speeds, and background noise made it difficult for the model to consistently recognize emotions.

Class imbalance: Some emotions were underrepresented in the datasets, requiring careful balancing to avoid biased predictions.

Generalization: Ensuring the model works well on new, unseen audio samples, rather than just memorizing the training data, was a key concern.

Feature extraction limitation: We focused on extracting five key audio features: Zero Crossing Rate, Chroma_STFT, MFCC, RMS (Root Mean Square), and Mel Spectrogram to train the model. While there are many other features that could help, we kept it simple for this project and did not explore which ones were optimal for our dataset.

Results Achieved 73% accuracy in detecting emotions.

Demonstrated that integrating acoustic and linguistic features significantly improves emotion recognition over text-only methods.

This project shows how speech emotion recognition can make technology feel more human: Virtual assistants that really understand you: Bots can pick up on emotions and respond in a more thoughtful, empathetic way.

Better customer service: Systems can sense frustration or satisfaction, so interactions feel less mechanical.

Mental health support: Tools could detect signs of stress or anxiety in someone’s voice.

More natural interactions: Overall, this helps people and machines communicate in a way that feels closer to a real conversation.