Inspiration
Having used Zoom nearly every single day for the past half a year, I know exactly what kind of problems one faces when using such an interface. Whether it's asking a question in class without interrupting the professor, forgetting to turn off your video when you temporarily go off screen, or somehow communicating that your microphone is broken without speaking (and praying your professor checks the chat), there are a ton of small optimization issues with modern video conference software. There have been a few attempts to make video conference learning more accessible and enjoyable (Zoom filters, Snapchat Lens Studio, etc.), but these have all been relatively lackluster.
Additionally, tens of thousands of students with disabilities across the nation lack proper accessibility to an online learning platform that helps them better interact with others in the classroom. I wanted to make something that makes online video learning more streamlined, accessible, and fun to use. That's where the idea for EduPose was born!
What it does
EduPose combines transfer learning and a pretrained image classifier convolutional neural network (MobileNet) to classify each individual frame of a webcam video. If the image (gesture/pose) corresponds to a specific pretrained image (gesture/pose), then a certain action will execute on a locally hosted JavaScript webcam in one’s browser. For example, if I hold a fist near my mouth, a text bubble will pop up saying "One sec please, I'm having mic issues."
Then, using a virtual camera software, one can stream their locally hosted browser webcam from my program as their Zoom camera of choice, essentially allowing them to modify their displayed Zoom screen so that the actions triggered by their gestures are visible to everyone in the video conference.
Now, I can toss my hands up in celebration at the end of class, and fireworks will explode in my virtual camera, for all in the video conference call to behold. How neat!
How I built it
The machine learning model was trained through transfer learning in Tensorflow.js from a dataset of a couple thousand images I took from my webcam. After testing and validating the model, I created a JavaScript based interface to interact with my model and locally host a webcam on my browser. The interface allowed specific text popups, images, and gifs to pop up upon me activating them with a specific corresponding gesture. Finally, I used virtual webcam software to use my browser based webcam as an actual webcam for Zoom conference calls.
Challenges I ran into
One of the hardest challenges I ran into was scaling my machine learning model to a JavaScript based camera, and still having the image classification work properly while not affecting the resolution of the display video too much. My machine learning model was trained with a low resolution image data set (this saved a lot of time and computing power), but as a result, the model wasn't working too well on higher resolution webcam video. I eventually ended up making two separate instances of the webcam camera (one to be classified by the image classifier, and the other to be displayed to the virtual camera). Additionally, I was having microphone and audio issues when interfacing the virtual camera with my JavaScript browser based camera, which I spent a few hours debugging, and finally (thankfully) resolved.
Accomplishments that I'm proud of
I'm pretty amazed that I was able to finish given the amount of initial problems with my original implementation, and the fact that I had no teammates when developing this. In the end, however, I was able to create a fully functional program that I am definitely proud of!
What I learned
I learned a lot about working with webcams in JavaScript, interfacing them with images, and ML model optimization and validation throughout the course of this project. Machine learning definitely has a steep learning curve, but luckily the internet really helps simplify things. I think the experiences I got from this hackathon will help me tremendously in the future.
What's next for EduPose
EduPose, while a fully functional software, has specific issues due to the nature of its machine learning model. It technically uses an image classifier (since I figured it would take way too long to train and validate a gesture object detection model in PyTorch). This means, however, that the poses could possibly not trigger a response for other users (each user may need to have their own personalized model for my program to work). I would like to expand this project by making a general gesture detection model, so that it can be used by anyone and the model doesn't have to be personalized for each person.
I think this program also has a TON of applications. Different versions can be made for students and teachers according to their needs. A version specifically designed for accessibility can be made as well. There are lots of directions to take this project in the future, and this is just the beginning.
Built With
- javascript
- ml5.js
- tensorflow.js

Log in or sign up for Devpost to join the conversation.