Home Run Harmony

Inspiration:

We want to get more exposure to hands-on data science opportunities. All of us are very interested in technology, problem-solving, and analytical thinking– which makes this Datathon extra exciting. Throughout the sleepless nights working on this project, we gained the most valuable insights by learning new ways to use R and Python, as well as making our first-ever data visualization models.

What It Does :

It analyzes travel metrics to make observations and form conclusions related to team performance.

How We Built It:

First Thoughts: Initially, we sought to normalize inherent team strength. We considered historical game odds and game-by-game SRS data, but obtaining these datasets proved challenging. We therefore resorted to calculating relative EWP before and after games by making use of effective data wrangling techniques

Independent variables: As for our independent variables, we considered various factors, including, weather conditions, and mode of transportation. But, we finally narrowed down our focus to two key metrics: the distance traveled and time elapsed since the last game.

First Experiment: Initially, we attempted to find correlations between distance/time and the difference in pre- and post-relative EWP using scatter plots. We found that there was no apparent correlation, and so we simplified our approach.

Second Experiment: We investigated the total number of runs scored per game against the distance and time metrics, revealing a subtle negative correlation.

This suggested there might be more to uncover in the dynamics of travel and team performance.

One of the most interesting findings we saw came from the fact that directional travel seems to play a role in performance outcomes. Teams traveling west tend to be associated with a decrease in performance. This could potentially be due to challenges of time zone adjustments as players may be more tired earlier in the evening. One of the most striking conclusions is that staying put and avoiding travel altogether, is the most beneficial for performance.

The visualizations illustrate distinct patterns between home and away performances, with away games showing greater variability in results, suggesting that external factors associated with travel may influence game outcomes.

Notably, the quartile ranges for away games are broader than for home games, particularly in the lower quartiles, indicating a more pronounced spread of lower performance scores when the team is not playing on familiar ground.

The visualizations uncover an unexpected trend: longer breaks between games are associated with reduced performance, emphasizing the need for a balanced approach that avoids excessive travel fatigue while also ensuring regular play, particularly as extended stays away from home can hinder practice opportunities.

Since we are working for the Astros track, we recommend the Astros continue investing in R&D, considering a collection of data such as modes of transportation, and personal health while traveling. This can help to optimize scheduling and travel plans to minimize their impact on player performance, helping the Astros dominate their plays. —---------------------------------------------------------------------------------------------------------------------- One thing we explored was the differences in means between different subgroups of the data based on the distance traveled before a game. We split our data up into 4 subgroups and compared them to other samples from the population.

The “close” group is all the occurrences where the team had to travel up to 400 km, a little more than the distance from Houston to Dallas. We were thinking this could potentially be a distance where the team would opt to take a bus rather than fly. The “medium” group travels in the range of 400 to 4000 km, approximately the distance from LA to New York City so it will cover all of the in-country travel distances.

We ran a Mann-Whitney U test and as you can see here, for the “close” subset we got a large U statistic and a p-value indicating that the mean of “close” was significantly lower than the population mean. This graph attempts to show exactly why it has a lower mean. We plotted the distance groups we created against time, with color-coded values indicating the average score at each distance and time point. The size of the points represents their frequency. Here are some other visualizations we made that show the same information but split up into home and away teams with the quartile range of each group shown. One thing we explored was the differences in means between different subgroups of the data based on the distance traveled before a game. We split our data up into 4 subgroups and compared them to other samples from the population.

Challenges We Ran Into:

We initially were really struggling with data wrangling. It was especially difficult to create distance data for the distance between games. To solve this problem we learned how to use the Nominatim API and Geopy to get coordinates of world cities where baseball is played. We also struggled with creating a good metric to represent team performance. After playing around with the data a little bit we made observations that allowed for us to create a metric.

In addition, we lost a teammate of ours as he became severely sick. This increased our workload considerably. Nevertheless, we worked as a unit and are extremely proud of what we have accomplished over the last 36 hours.

Accomplishments That We're Proud Of:

The color scatter plot with point sizes based on frequency took a lot of time to figure out and it turned out pretty well. It is important to note that the size of the points is logarithmic so the slightly larger points are actually indicative of much larger frequencies. From this visualization, we can see that there are a lot of values slightly below average bringing the mean down.

What We Learned:

We are all basically beginners so we had a lot to learn about data analysis in Python and R. We experimented with different statistical tests and models as well as visualization techniques. I think one big thing we definitely learned was how hard data science can be but also how rewarding it is to solve a problem and address an error effectively.

What's Next For Home Run Harmony:

Home Run Harmony could easily be expanded upon because there are so many ways to take the data. One thing that would be very interesting would be to train a neural network to make some predictions about performance. There are also other factors of flight data that could be observed.

Built With

jupyter
nominatim
python
r
timezonedb

Submitted to

Rice Datathon 2024

Created by

I worked on wrangling data, running statistical tests, and creating visualizations.

Mac Tucker
I worked on wrangling data, evaluating performance metrics, and independent variables.

Kushal Gupta
I worked on visualizing data by using boxplots and scatterplots, cleaning the data the find the distance between home and away team cities and venues, as well as researching machine learning data models. This is my first time ever using software like RStudio and Tableau. I learned so much and believe I significantly improved my R skills. I also facilitated organizing project details and assigning team roles. Overall, I am very happy about all the new knowledge I gained from this experience.

Melody Dao

Updates

Mac Tucker started this project — Jan 21, 2024 09:40 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.