Due to COVID19, people across the world are unable to learn under the personal mentorship of a trainer, whether it be a fitness coach or martial arts instructor. They are stuck watching and trying to follow videos inside the confines of their homes. This lacks any sort of feedback mechanism, something that is absolutely critical to growth. We wanted to make personal training accessible from anywhere, so we decided to create trAIner.
The project idea originally came to us when we were thinking about the popular arcade game dance dance revolution. The idea of using your entire body to control a game led us to consider making dance revolution with pose estimation, but we wanted to go further and make it more general and applicable to the world today.
What it does
trAIner is an artificial intelligence based personal trainer that uses videos of trainers from across the web to make sure you are exercising safely and effectively. We compare a live feed of a user working out to a video of a trainer, and extract poses and other semantic information from frames to give the user automated feedback in real time. We also designed a novel algorithm for inclusive, generalized rep counting that is exercise-type agnostic.
We allow any trainer to upload their videos where they are automatically tagged by GCP and made searchable by MongoDB Atlas’s Full-Text Search. A user can then follow along with any video with the automated feedback.
How I built it
Normalization and Synchronization of User/Trainer Poses
In order to make sure that we can compare the poses of trainers and users, who might have different heights and body structures, it is important to normalize the data. Specifically, we take a particular corner of the bounding box for all the poses and set it as the origin. As a result, all the poses are translated to the same region. Then, we apply L2 normalization to scale all the poses so that their coordinates are similar.
Now, we have the ability to meaningfully compare poses. When poses from the trainer and user are streamed to the system, we have to be careful with synchronization. If the trainer is doing jumping jacks, since there is some delay in the user’s reaction time, the frames coming in from the user might be off relative to those of the trainer. What we do is keep the most recent trainer pose, and then we compare it to a set of the user’s poses in the next frames. Using a distance function for poses that takes into account confidence scores for the different keypoints (coordinates for body parts), we are able to find out which user pose is closest to the trainer pose. This effectively synchronizes our two data feeds and allows us to provide better feedback to the user.
Feedback is automatically generated by looking at the distance between features in the user’s pose and the corresponding feature in the trainer’s pose. Different features considered include the distance between the center of mass of a pose and other keypoints like shoulders and wrists, along with the distance between left and right points such as the distance between feet. Once these features are calculated, we use a simple thresholding algorithm to determine what needs to be changed, then prioritize the feedback based on safety and effectiveness of workout.
The rep counting algorithm looks at a user’s center of mass and wrist positions over a period of time. If a cyclic motion is detected, then we use percentile crossings and time domain features from a fast-fourier transform to continue counting the number and frequency of reps. It is essentially a specialized, lightweight machine learning model trained on the last 5 to 25 seconds of data to predict reps. These features could also be passed into a random forest model to determine exercise type to more accurately determine the number of calories burned.
Video Labeling and Searching
When a training video is uploaded, we first use the Google Cloud Video Intelligence API to tag the video with labels, then store those labels in a MongoDB Atlas collection. The full-text search uses these tags to dynamically fetch a list of relevant training videos as the user types.
Challenges I ran into
- Making accessible and inclusive AI requires a generalizable model. We overcame this by normalizing poses and making our rep counting model completely independent of predefined motions.
- The data we are dealing with is sensitive, at least on the client side, since it streams video from the user in his/her home. It was a challenge to shift our pre-existing mindset of doing all processing on the cloud, but we managed to successfully build everything so that no data is ever sent to a centralized server. Everything, from pose estimation to personal feedback, happens locally in the browser, so trAIner is privacy-preserving.
- We needed a way to make videos searchable. The problem has been solved for text data by MongoDB Atlas’s FTS, so we wanted to find a way to extract textual information from videos. Transcripts could work, but they may be unreliable. Luckily, the GCP Video Intelligence API allows us to extract labels and tags for each video.