In today's world, each and every person has the power to influence more people than ever, with mediums such as YouTube and other social media giving millions their own platform to speak. Speaking convincingly, meaningfully, and effectively is therefore extremely important in this time. Yet, most of us have had little practice in giving speeches, and especially in understanding how our speeches are perceived by the audience. Sure, by playing a speech back to yourself, you can get an idea of how you are doing. But, speeches require precise intonation, emotion, diction, and clearness to truly be effective. Therefore, we created Speech++ as a tool to provide an objective overview of all of this information.
What it does
Speech++ is a tool that helps users improve their speeches by understanding them on a deeper and more objective level. After reaching the site, the user can upload an audio file of their own speech. After our back-end takes a short amount of time to process the speech, the user will be able to view the full transcript of their speech, split up into automatically detected phrases. Each phrase has an associated emotional analysis based on the speaker's intonation, an associated sentiment analysis based on the word choice, and additional information to be viewed, from the range of -100 to 100 (where a higher number means more happy emotions/more happy word choice). The user can also see data of their speech visualized in a bar graph and a line graph, both of which are based off of the emotion from the sentiment analysis and emotional analysis.
How we built it
We built the project with a python back-end and an html web-page front-end. The back-end consisted of two different machine learning models, one recurrent neural network for sentiment analysis and emotional analysis respectively (wherein the emotional analysis model was heavily based off of a pre-existing model online). For the audio input, we needed to split the audio into distinct phases in order to be fed into the machine learning models. Using SciPy, we can process .wav files into an array of wave amplitudes. We then mask all data points that are less than 0.05% of the max amplitude, as we could safely assume that this part of the wave is science. We then smooth out any noisy maskings and take all blocks of valid audio as phrases. The sentiment analysis model, created with tensorflow, receives text and returns the sentiment associated with the text. The emotional analysis model receives speech in the form of .wav files and returns the associated emotions with the speech. After this information is processed, flask is used to send it to our front-end website, which was created with Bootstrap's framework. Bootstrap's theme features and data visualization features were used to create the line graph and bar graph that appear on the analytics page after users have uploaded their speech.
Challenges we ran into
One major challenge we encountered was our input audio analysis. Originally we wanted to separate the speeches by sentences, but our speech-to-text analysis model was not strong enough to do this. We therefore spliced audio by pauses, which separated the speech more by phrasing rather than sentences. In general, we felt as if the accuracy and capability of the speech-to-text tool we used could be improved on to make our website more reliable. Another challenge was the errors within our machine learning models. Although we had relatively high success rates of above 70% for both models, the cases in which they made errors significantly disrupted the practical use of the model. By taking a running average of multiple points of a phrase to calculate the true sentiment and emotional analysis, we were able to partially negate this problem.
Accomplishments that we're proud of
We are extremely happy with the final front-end design of our model. We feel as though our website has an extremely user-friendly design. For example, the user can easily understand how our models perceived each phrase of their speech on our analysis page. Our graphs also show how their emotions and word choice changed throughout their speech, in terms of how positive/negative both of these were. The user can use this data, supplemented by our bar charts, in a very practical way to understand how positive or negative the audience will sound to the audience, and if any changes need to be made based on intonation or diction. Additionally, we were very happy with the accuracy of our machine learning models in categorizing speech. Our recurrent neural network for sentiment analysis had a success rate of just over 80% based on an outside data set we used, while our recurrent neural network for emotional analysis had a success rate of around 70%. Although these numbers are not perfect, they show that our models clearly work and are useful tools to find at the very least a somewhat descriptive report of the user’s speech.
What we learned
One thing we learned was how to interface with audio files. Using SciPy, .wav files could be converted into an array of samples that we could use to analyze with our models and auto filtering. We also learned about multi-channel audio and various speech to text tools that are available. Furthermore, we learned about recurrent neural networks and their advantages, especially in sequence prediction problems. We also learned about their specifics for implementation in both tensorflow and scikit-learn while making our models. Finally, we learned about additional features of Bootstrap that we had not used prior, including its data visualization tools, which we used to make our graphs.
What's next for Speech++
Now that we have built the foundation for Speech++, we feel as if there is lots of potential for how we can improve this tool. We could potentially add more analytical tools, such as confidence measuring and grammar checking. Another addition we could make is to expand upon our current tools. Right now, we only rate sentiment and emotion on a scale from -100 to 100, which correlates with negative and positive connotations. We could add more helpful suggestions and observations such as actual emotions observed by the machine learning model instead of a purely numerical analysis.