usasoccR: An Interactive Web Application for Soccer

Inspiration

Decision making in sports often comes down to yes/no decisions: Should I sign this player? Should I make this trade?, etc. When making these decisions, it is useful to have a single metric to represent each player that can be quickly used to compare players, compute win probabilities, and more. In other sports, such as basketball and hockey, statistics built using play-by-play data and regression analysis, so-called "adjusted plus-minus" statistics, are widely used. In contrast, adjusted plus-minus statistics have not become widespread in soccer. This project aims to fill this gap by developing an adjusted plus-minus statistic for soccer, based on play-by-play data. We compare our statistics with other one number statistics, such as FIFA rating and market values, in an interactive web application. We also include spatial visualizations, team rankings, and win probability tools in the web application.

What it does

Our data-analysis pipeline starts by scraping ESPN.com play-by-play and commentary data. Then we process the play-by-play data so they can be fed into a high-dimensional machine learning algorithm. Finally, we compute our adjusted plus-minus statistics using the ridge regression approach proposed in (Kharrat et al. 2017). We include a vignette in our R-package that explains precisely how we use our software computes adjusted plus-minus from the processed ESPN data.

What it does

Our interactive web application first gives several spatial visualizations of a sample of OPTA data, to give a flavor of the interesting visualizations that are available with OPTA. The "Player Profile" panel gives a sortable, searchable table that displays our adjusted plus-minus statistic against other single-number measures of players, such as FIFA ratings and Market Value. The "Team Profile" panel compares teams based on an aggregation of one-number statistics. Finally, our win probability panel computes win probabilities based on team profiles.

Challenges we ran into

We ran into several challenges. First, processing the ESPN play-by-play data into the design matrix required by the adjusted plus-minus algorithm was very time-consuming. Second, visualizing large amounts of data with Shiny proved difficult, as it doesn't scale well, especially given the size of OPTA data. Third, whenever we needed to merge data-sets together, such as ESPN and FIFA ratings, we needed to use record-linkage techniques in order to control for spelling differences, special characters, inclusion/omission of middle names, etc. Finally, the sparsity of soccer substitutions--relative to basketball and hockey--make the adjusted plus-minus statistic more challenging to compute because there is very significant collinearity in the data.

Accomplishments that we're proud of

We are proud of the extent of what we were able to accomplish in the hackathon’s time constraints, from web scraping, to data visualization, back-end machine learning, and matchup win probabilities---all brought together in an interactive web application.

What we learned

We all learned a lot about the entire process of the interaction between programming, modeling, and visualization for data-analysis. We also learned how to effectively specialize work, so each of us became an expert on a particular aspect of the analysis.

What's next for Random Walkers USA Soccer

There are many areas to polish our data-analysis pipeline. First, the adjusted plus-minus lacks many substitutions compared with the NBA and Hockey. We would love to keep supplementing that statistic with metrics such as goals, assists, etc. from other data-sources. We would also like to build a more complete study comparing the predictive accuracy of adjusted plus-minus against other single metrics. Second, we would love to spend time polishing the actual ESPN scraping code, adding unit tests and more rigorous documentation. Finally, we would like to explore ways to further scale our web application's visualizations. Currently, the Shiny framework is not very fast when the sample data-set exceeds approximately 100,000 rows. Implementing the web application in a more scalable visualization tool would greatly enhance usability.