swim-net

Get Code Link

Inspiration

Recruiting is hard. Like really difficult. Sifting through endless piles of times, meets, and races can be exhausting, and can often lead to a lot of missed potential. The three of us are on the Princeton swim team and have gone through this (stressful) college athletic recruiting process. Coaches scour the web for these raw times, attempting to gauge which athletes would fit best for their team. It's unfortunately not as simple as sorting by the fastest swimmer, as many are simply gearing up to become much faster in college. However, coaches have limited time, and often missing these "hidden gems" of athletes who haven't quite hit the peak of their career.

Our goal was to be able to automate this process for coaches, and be able to detect these abstract correlations that come together to form these "hidden gem" recruits.

What it does

Takes an athlete's best high school times for each age (so 50 Freestyle would have four entries, one for age 15, 16, 17, and 18), and outputs a metric on how their expected performance would be in college. This could be expected percentage improved or perhaps their projected amount of power points in their top events (power points are a standardized metric of speed throughout swimming).

How we built it

There are quite a few layers to this project. We'll label it in three, (somewhat) succinct steps:

We first used web scraping to attain a large list of swimmers (roughly 1,500), that had graduated college (and thus would offer a good metric of improvement from high school to college). This data was mainly taken from a platform known as Swimcloud.
We then used said list of swimmers to scrape another site (usaswimming.org), for their best times at each age, and their power index (a numerical value given to the performance of a swim). Through some magic, we grab Excel spreadsheets and reorganize them into CSV input for the DNN (deep neural network) in the next step.
We used a DNN model for this regression problem (continuous input mapped to continuous output). Without getting too in-depth on what is going on under the hood, this model is essentially a beefed-up multilayer-perceptron (a feed-forward only series of neurons) that takes in an input size of 56 (14 events times 4 age categories) and will output the desired metric.

Challenges we ran into

Getting data for this was difficult, to say the least. The only publicly available database was a rather poorly organized web client on USA Swimming. In order to even access most of the data, a name needed to be entered, an Excel spreadsheet downloaded, wait for that to complete, and then move on to the next swimmer (all of which took around 6-7 seconds, but doing this 1,260 times took a little over 3 hours). In order to get data from a swimmer, we needed to know their name. We improvised by web scraping these names from a different website called Swimcloud.
The deep neural network was having some significant trouble with overfitting at first (that is, the training data was showing significantly greater accuracy than the testing data). After a few hours of experimentation, this was mostly mitigated by adding dropout layers and conducting f-score feature selection on the events.
Getting our video to be under 2 minutes in length (we just had so much to share!)

Accomplishments that we're proud of

It actually works! We believe that is something to be proud of after all of this hard work. We'll go into some specific details about how well it did here.

Let's look at the progression of the loss functions over the epochs for the output metric of power points. Top 3 Power Points Graph

In the end, based solely on times from high school, our DNN was able to successfully predict the average power points of a swimmer's events after completing college with plus or minus 75 points. Keep in mind that power points begin at 0 and can go up and above 1200! This is rather astounding! On no other data other than the best times of a swimmer throughout high school, our DNN was able to draw significant correlations to their performance in college.

There are a few other graphs located in the github repo here if you would like to explore a few other output metrics that we measured (such as predicting the amount improved within 2%).

What we learned

In order to even approach this, a significant amount of data needed to be collected. Overall, over 1,260 graduated, collegiate swimmers were analyzed, saved, and then used to train the neural network. There were some very interesting correlations between data, confirming a lot of trends swimmers see in the pool every day. Below is an interesting example of this: 50 Fly, 100 Breast, 200 IM Diag Chart

The graph represents the correlation between three different events (100 Fly, 100 Breast, and 200 IM), while along the diagonal is a kernel density estimation plot (displaying the rough distribution of times in each event). There is an amazingly clear correlation between the speed in the 200 IM and 100 Breaststroke (seen in the top left and bottom right corners). As one gets faster in IM, they generally become a significantly better breaststroker (and vice versa). Being able to see this correlation confirmed through a scatter plot such as this was truly amazing. There were many more correlations like the one above, but we won't cloud this description area with every cool thing we uncovered.

What's next for swim-net?

Is there potential for swim-net outside of this hackathon? Absolutely, no question about it. Even given the short time span, and the extremely limited access to a good database, we were still able to uncover rather impressive correlations. Some simple ways to improve in the future include changing the output metric (in the graphs above, the power points and ratio improved from the top three events were used). We could also investigate a swimmers' overall performance in all events, points scored at college competitions or other (possibly endless) output metrics.

Not to mention, despite the rather specific name, swim-net can also be expanded into similar sports. This includes track and field, where athletes are measured on an individual, timed basis in a racing environment.