Whenever I watch a baseball game I always here about the importance of "pitch sequences," how this pitch was a "setup pitch," and things of that nature. I wanted to do some research to see if pitches could be predicted based on the pitches that came before it in the at bat.
What it does
The program takes in a large CSV file containing pitch-by-pitch data for a single pitcher. It creates a frequency table that provides information on which pitchtype is most likely to be thrown based on the previous n pitches that have been thrown in the at bat, where n is some specified constant. The program then takes in an additional CSV file of pitch-by-pitch data upon which we test our findings. We would hope that the pitch predictions based on the frequency table would give more accurate results than results found by randomly selecting a pitchtype. (In this case, we define "random" to mean the pitch a pitcher is likely to throw based on a weighted average of his pitching tendencies. For example, if a pitcher throws fastballs 50% of the time, sliders 30% of the time, and changeups 20% of the time, in a random sample we assume that every pitch he throws has a 50% probability of being a fastball, a 30% probability of being a slider and a 20% probability of being a changeup.) The program then compares the accuracy of the pitch prediction from a random guess to one that is based on the frequency table.
How we built it
We prepared for the hack-a-thon by creating a script that would pull the necessary data for us from a website (Brooks Baseball) and export the data into a usable CSV file. This script reads a pitcher's game logs from baseballreference.com and then pulls data from every game that pitcher appeared in, based on those logs, and consolidates the data into a single file. The frequency table is built off that data. From the frequency table, we use Markov chaining techniques to learn what pitch is most likely to follow a series of other pitches.
Challenges we ran into
Putting the data into an accessible format was a challenge, as was determining the best metric with which to choose a pitch.
Accomplishments that we're proud of
It works! We tested our program with all regular and postseason pitches thrown by Justin Verlander, CC Sabathia, and Clayton Kershaw from 2011 to the present (as listed on Brooks Baseball). When compared to a randomly generated series of pitches based on the three pitchers' tendencies and run through 1000 trials, these were the results when using our program when n is 2:
Verlander: Randomly guessing through 1000 trials: Mean: 35.89% Standard deviation: 0.04% With our program: 51.66% Sabathia: Randomly guessing through 1000 trials: Mean: 28.81% Stdev: 0.02% With our program: 47.54% Kershaw: Randomly guessing through 1000 trials: Average: 39.99% Standard deviation: 0.05% With our program: 46.16%
What we learned
We learned about Markov chains.
What's next for Pitch Prediction
Right now our program takes in large swaths of data and analyzes it. It would be cool if the user could enter in a pitch sequence and see what pitch is likely to be thrown next. It would also be useful if the program could give a more detailed breakdown of its analysis (i.e. results for individual pitch types, times at bat).