Souns 360

Inspiration

Recently there is a surge is people using voice assistants such as Google Home or Alexa for almost everything, right from asking about the weather to ordering something on the internet. Though these devices capture the information that was being said, it fails to comprehend the real mood of the person. We feel it is important to capture the human emotions in their speech to give better services for the end customer.
For reporters and journalists, it is much easier to report real time updates on events much faster and quicker.
For wannabe actors / actresses, our emotion recognition / prediction model can help them improve their acting skills by improving and improving their emotional skills.

Understand the mood of the speaker via deep learning classification
Transcribes voice input as text using Google speech-to-text API
Summarizes the voice input using sequence to sequence modelling and gives a short snippet which the user, including relevant hashtags that one can post it on social media, including a short smiley based on his/her mood
Chooses a an image based on the voice content and mood, including the corresponding tags for images ( Not implemented due to lack of time )

Speech Recognition - Google speech-to-text API to recognize speech
Emotion recognition from speech - We trained our own deep learning model to predict emotions based on the way the speaker sounds
Text summarization - We trained a sequence to sequence model to summarize the spoken text into a couple of sentences. Used Cotentpool API for getting news sources

There were not much available labelled dataset for except Berlin emotions dataset and the Ryerson audio-visual dataset. These datasets had some limitations such as the number of spoken sentences and the number of different speakers. So, we had to build our model based on the existing dataset.
We tried to build our own dataset for emotion recognition, but we were not as good as the actors/actresses from the dataset in expressing our emotions. So, we felt our dataset will add in more noise rather than trying to improve our model accuracy.
Streaming audio data and processing the sound data over the web server was difficult because of incorrect byte format that was used during streaming and reading the streamed chunks

State of the art accuracy in predicting emotions. Much better than expected after struggling a lot with feature engineering
Some of us started learning natural language processing and was able to summarize a long speech to a condensed snippet.
We could integrate multiple machine learning models and after one full day of hackathon, we are very proud with our final project.
- Getting a python Flask web app deployed on a Google Cloud instance for REST API interfacing

Some of us enhanced our natural language processing skills, learnt new machine learning frameworks
The benefits of collaboration with like-minded people

Get more data and improve our accuracy of the model
Implementing the combination of text and image processing to easily post your ideas on social media.

I worked on App Engine.

Aswin Pyakurel
Exposed python plask REST API, deployed to Google Cloud Linux VM for backend processing

Marcus Jones
I worked on identifying emotions based on sound using neural networks. Used librosa for feature engineering sound files

Karthick Perumal
Cameron Roe

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.