Predicting NBA Over/Unders Using a LSTM Model

Victor Chang - vchang2, William Allstetter - wallstet, Justin Chu - jchu17

Introduction:

We are implementing an existing paper: http://cs229.stanford.edu/proj2018/report/3.pdf. The paper's objective is to predict the total number of points that would be scored in a game. We chose this paper because we want to do a similar task in predicting NBA over/unders and we all like basketball. This is a classification problem.

Related Work

In “Analysis of basketball games using neural networks” by Ivankovic, Racković, Markoski, Radosav, and Ivkovic at University of Novi Sad, the researchers analyzed player data from 890 games from 2005 to 2010 to predict the winner of the game. To do this, they looked at the player's free throw, field goal, and three point percentages as well as their rebounds, steals, turnovers, blocks and assists. They also further divided each half court into six sections, taking into account where each shot was made. These inputs were passed through a feed forward neural network with one hidden layer. They then measured how much each input node affected the output value, with two-pointers taken from the box being the most important input. In the end, they had a 66.4% accuracy rate when training on only shooting percentages, and a 80.96% accuracy when incorporating rebounds, steals, turnovers, blocks, and assists.

Data:

For sportsbook data, we will use data from SportsBook Review Online, which has odds data in Excel files for every season of the NBA going back to 2007-2008. Given that the NBA has changed significantly since 2007, we will likely only use a limited timespan in our data (currently thinking starting at 2016-17). From this dataset, we will need to do some preprocessing to clean the data and extract the information we want: date, matchup, and the over/under.

For game data, we will collect data from either Basketball Reference using a web crawler or NBA.com using their API. Our project will differ from the Stanford paper by examining additional variables, such as pace, offensive rating, defensive rating, and three-point percentage. We will experiment with different combinations of variables to see if certain combinations offer better results.

Methodology:

Because teams can have hot and cold stretches, we will use a LSTM neural network to generate our predictions. The difficult part of our project will be making sure our dataset is accurate, figuring out which hyperparameters are optimal (window size, learning rate), how many layers to use in our architecture, and which variables to include in our dataset. Another difficult problem will be that the data from Basketball Reference will be for the entire season, but generally in sports betting, the odds can only be computed using available data. For instance, we should avoid predicting results from game 1 using data from the entire season. If possible, we should make our dataset only reflect data that has been collected up until that game.

Metrics

We plan to dictate success through two main metrics: Mean Squared Error (MSE) and model accuracy percentage. Using the first metric, we can calculate the MSE between the Over-Under value predicted by the model and the true point total observed in the game. Since we can also calculate the MSE between the sports books’ predictions and the observed point totals, we can see how our model performs relative to the sports books. Additionally, we can calculate the percentage of time the model correctly predicts a point total that was either above or below the number provided by the sports books. This accuracy number will give a sense of how often the model “wins” a bet. The authors of the paper “The Bank is Open: AI in Sports Gambling” were hoping to achieve a model with a win rate above 50%. With the same goal in mind, we would like to also create a model that successfully beats the house prediction. Our base goal is to create a functioning model that is able to train and predict NBA Over-Under values. Our target goal is to create a model with a win rate above 50%. Our reach goal is to create a model with a win rate above 52% (the profit level when taking into account sports betting fees).

Ethics

“Why is Deep Learning a good approach to this problem?” Because of the amount of statistics collected each game, sports are ripe for machine learning, giving the algorithm lots of data to train off of. Further, with binary win/loss classifications the model is more easily able to make important predictions without succumbing to noise.

“Who are the major ‘stakeholders’ in this problem, and what are the consequences of mistakes made by your algorithm?” With sports betting still being illegal across states, the idea of prediction/betting on sports games is a still contentious one. With machine learning seen as this magical, all knowing black box in popular culture, it could be a vice for those with gambaling issues. Also, there are lots of financial stakeholders in the outcomes of NBA games, with over five million dollars being bet on the 2020 NBA finals alone. Further, teams might be interested in the outcomes of the model as it can help them identify important statistics in winning games.