Stage Hand

Inspiration

We created Stage Hand with the hopes of equipping people with the necessary skills to become poised and confident speakers. Upon realizing how big of a fear public speaking is for most people, we wanted to help people cope with that fear while gaining the skillsets necessary to become better presenters. Stage Hand was our solution to this as it provides users with a real time feedback tool that helps them rapidly improve their skills while overcoming their fears.

What it does

Stage Hand is a web application that offers users real time feedback on their speaking pace, expression, and coherence by having them record videos of themselves delivering speeches. The goal in doing this is to help users pinpoint and improve upon the weak aspects of their public speaking ability. As you practice with Stage Hand, it essentially coaches you on how to polish your speeches by keeping track of and displaying vital statistics like your average speaking speed, the main emotion that your facial expression conveys, and how many filler words you have used. We hope that making this information available in real time will allow for users to learn how to adapt while speaking in order to create the optimal speech and become a holistically better public speaker.

How we built it

The Stage Hand interface was built using the React library to compose the user interface and the MediaDevices object to access our user’s video camera. The data we collect from each recording is sent to the suite of Microsoft’s Cognitive Services, utilizing the Bing Speech API’s speech-to-text function and developing our own custom language and acoustic models through the Custom Speech Service. Furthermore, various points throughout the video recording were passed into the Microsoft Emotion API, which we used to analyze the speaker’s emotional expression at any given point. Throughout the entire recording, live feedback is being given to the user so they can continue to adapt and learn the skills necessary to become excellent public speakers.

Challenges we ran into

Our biggest challenge that we faced in our final rounds of debugging was that we kept getting a 403 Error when calling the Bing Speech API that was necessary for the inner speech-to-text aspect of our program. We addressed this issue with a Microsoft mentor and were able to determine that the root cause of this was on Microsoft’s end. We ended up pinging a Microsoft employee in India who had a similar issue recently, and we are awaiting a response so that we can optimize our application. In the meantime, we found a temporary workaround for this issue: we were suggested to call REST API instead of WebSocket API even though REST only works for fifteen seconds of recording. This has allowed us to create a working product that we can demo (although it is less efficient). We hope to fix this in the future upon getting a response from the employee in India. Another chief issue that we faced is that we were having a hard time recognizing filler words in the pattern of normal speech because most speech-to-text APIs automatically filter out filler words. This made it nearly impossible for us to actually be able to detect them. We were able to overcome this issue by developing custom language and acoustic models on Microsoft’s Custom Speech Service. These models allowed us to detect filler words so that we would be able to count them in our program and give feedback to the user on how to minimize the use of fillers in their speech.

Accomplishments that we're proud of

We are especially proud of our team’s ability to adapt to the variety of issues we had come across in the process of building the application. Every time we made a breakthrough, it seemed like we ran into a new issue. For example, when it seemed like we were almost done creating the backbone of our application, we faced issues when calling the Bing Speech API. Regardless of these challenges that were thrown at us, we were always able to overcome them in one way or another. In this sense, we are most proud of our team’s resilience.

What we learned

From this experience, we learned how to utilize and incorporate many technologies that none of us have previously used. For example, we were able to learn how to use Microsoft’s Cognitive Services in order to perform tasks like detecting emotion and converting speech to text for the internal processing of our application. Additionally, we learned how to develop custom language and acoustic models using Microsoft’s Custom Speech Service. Overall, we learned a lot from this experience when incorporating technologies into our application that were new to all of us.

What's next for Stage Hand

Since we ran into difficulty using the Web Socket API, we had trouble opening up a tunnel from our client to the server. This bottlenecked our ability to create a live stream that directly funnels our user’s video data to the Microsoft Cognitive Services. In the future, we hope to overcome these issues to create an app that truly allows for live interactive feedback from our devices.

Built With

google-cloud
microsoft-cognitive-services
microsoft-speech-services
microsoft-vision-services
node.js
react

Submitted to

Cal Hacks 4.0
- Winner Best Use of Azure or Microsoft Tech

Created by

Michael Li
Aakarsh Aithal
UC Berkeley || Electrical Engineering and Computer Science major, Data Science minor
Aayush Patel
Student @ UC Berkeley || Mechanical Engineering major, Electrical Engineering and Computer Science minor
Dennis Park