Inspiration

The Houston Astros were recently found guilty of cheating during their World Series winning 2017 season, which they did by stealing pitching signs from opposing teams at home games. The Boston Red Sox were also recently accused (but not yet proven) of cheating by stealing signs at home games during their 2018 World Series winning season. Baseball players and analysts believe that stealing signs brings a huge advantage to the person at-bat because if the batter knows what type of pitch is going to be thrown next, the batter can better prepare for that pitch. Essentially, stealing signs takes the pitch's element of surprise out of the equation which gives the batter a huge advantage over the pitcher. The Astros were found guilty of stealing signs at home games because they hit trash cans and relayed signals to batters in order to warn their batters that an off-speed pitch (not a fastball) was being thrown so that the batter could prepare appropriately. We wanted to analyze the effectiveness of the Astros' cheating behaviors and use this analysis to provide insight into the Red Sox's cheating allegations.

What it does

During this project, we aimed to first investigate the effectiveness of the Astros' cheating behaviors using sabermetrics. In other words, we were looking to see how much more successful the Astros were at hitting at home games during the 2017 season. Then, we wanted to use sabermetrics to investigate whether the Red Sox had any similar success in order to gain any insights into whether the Red Sox actually cheated during the 2018 season. After completing this analysis, we looked into pitch data from 2015-2018 in order to notice any anomalies with the Astros swinging tendencies and success on different types of pitches (especially off-speed pitches). Finally, we created a model that used pitch data on the Astros from before they cheated (pre-2017) in order to predict their success at hitting pitches. We then used this model to predict their expected success at hitting pitches in 2017 (the year they cheated) in order to analyze how much cheating helped them improve their hitting.

How we built it

We found the MLB pitch data from Kaggle (https://www.kaggle.com/pschale/mlb-pitch-data-20152018#pitches.csv). However, we had to wrangle the data and join several tables in order to analyze the data effectively. Then, we had to scrape team statistics data from ESPN.com in order to have all of the data we needed in order to look at team hitting success. Again we had to wrangle this data so that we could use it appropriately. We then performed a lot of EDA and visualized several hypothesis tests. Finally, we trained a logistic regression model that allowed us to predict the number of hits we would expect a team to hit based off of historical pitch data. This model did not perform that well as we received an accuracy score of 0.6390. However, if we had more time to fine-tune the parameters and try different models, we could have built a much more successful model. Nonetheless, our model predicted that the Astros would get 777 hits at home during the 2017 season but they actually recorded 856 hits at home which strongly indicates that the Astros cheating behaviors helped them become a better hitting team in 2017.

Challenges we ran into

The datasets were very hard to navigate so we had to spend a lot of time understanding the data and figuring out a strategy to best visualize and gain meaning from the data.

Accomplishments that we're proud of

We created some very cool looking plots using sabermetrics and heatmaps that helped us analyze the Astros effectiveness at cheating. By using these plots, we were able to observe that the Astros had an abnormal amount of success at hitting during home games during the 2017 Postseason. Similarly, we observed that the Red Sox had an abnormal amount of success at hitting during homes games during the 2018 regular season. This provides some meaningful insight as it seems that the Red Sox abnormal hitting success during the 2018 season was due to sign stealing.

What we learned

  • The Astros achieved an abnormal amount of success at hitting during the 2017 season due to their cheating behaviors
  • The Red Sox seem to have cheated in 2018 as well due to their abnormal hitting success during the 2018 season
  • The Astros were much more effective at hitting against off-speed pitches during the 2017 season due to their cheating tendencies
  • Using predictive modeling, we were able to show that the Astros performed abnormally well at hitting during the 2017 season

What's next

Build more sophisticated models using one of the following algorithms: Decision Tree Algorithm, Random Forest Algorithm, Naive Bayes Classifier, k-Nearest Neighbor, Artificial Neural Network, Deep Neural Network, Organic GMO-Free Grass-fed Neural Network. We also need to study the data some more and improve our parameter selection for the models.

Share this project:

Updates