Inspiration

In Football (Soccer) anything can happen up to the last minute being totally unpredictable. A single deflection, penalty, or last minute goals can affect the match and even the entire tournament. For this hackathon, I wanted to see if I can quantify the predictableness of football (soccer) using data science. I was inspired to build a tool that doesn't guess without any information but uses tons of statistic models to calculate the exact mathematical probabilities of every single outcome all into one python executable.

What it does

WorldCupML is a desktop application that predicts FIFA World Cup match outcomes in real time.

The app grabs live tournament fixtures, lineups, and group standings, and puts them up against decades of historical international match data. Using an XGBoost Machine Learning model, the engine calculates the Expected Goals (xG) for both teams based on their historical Elo ratings, recent form, and where they're playing. Then it runs up to 100 million Monte Carlo simulations using a Poisson distribution to output the exact mathematical probability of every possible score line and win condition.

How I built it

The entire application is built natively in Python.

For the Frontend/UI elements I used Custom Tkinter to build a nice looking hardware accelerated dark interface. I really wanted it to like a web app and a modern application taking inspiration from elements on google.

For the Live Data elements I used and integrated the ESPN API to asynchronously fetch live fixtures, team rosters, formations, player headshots, and group standings.

For Data Engineering I used Pandas to clean the data and engineer features like calculating Elo ratings from a massive dataset of international football results going back decades.

For Machine Learning I used scikit-learn for data preprocessing and an XGBoost Regressor to train two models that predict the Expected Goals (xG) for the home and away teams.

Lastly for Statistical Simulations I used Numpy to vectorize the Monte Carlo simulation, which lets the engine run millions of Poisson distributed match scenarios in milliseconds.

Challenges I ran into

I hit a few major technical hurdles along the way, starting with some messy API data. The live feed kept crashing my app by randomly sending back strings instead of dictionaries, which forced me to write alot of fallback login to keep things stable. Managing the UI thread safety was another issue, fetching data and running heavy ML models would completely freeze up the screen until I implemented Python's threading module to handle the heavy lifting sections in the background. On top of that, another major issue was running millions of Monte Carlo simulations in pure python was really slow. I ended up refactoring the core engine using vectorized Numpy arrays and a nice integer hashing trick, which cut my execution time down majorly.

Accomplishments that I'm proud of

Despite the roadblocks I hit which were alot, I'm really proud of what I have built. I took Tkinter which is usually really iffy and turned it into a pretty good looking UI. I also optimized my simulation engine to the point where it can process around 1,000,000 match scenarios in less than a second.

What I learned

I learned a ton about how Data Sciences and Software Engineering actually intersect in the real world. I mastered GUI programming in Python, learned how to calculate advanced sports analytics metrics such as Elo ratings from raw data, and figured out how to optimize algorithms using Numpy vectorization when Python's base performance was not cutting it.

What's next for WorldCupML

Next for this project, I plan to expand the Machine Learning model to factor in individual player stats. After this change I plan to start integrating different major tournaments such as the NBA, NFL, Premier League, etc.

Built With

Share this project:

Updates