Inspiration

This project was inspired by a desire to predict the performance of NASCAR drivers in the 2025 season using a data-driven approach grounded in machine learning. Existing models and fan discussions often lack statistical rigor or fail to use historical champion datasets for win rate estimation. By using a Random Forest Regressor, the project aims to build a more detailed and nuanced model for predicting driver success, giving fans and analysts a robust tool for season outlooks.

What it does

The project predicts the win rate of NASCAR drivers for the upcoming season. It achieves this by: Aggregating historical data (years, wins, driver names, etc.) into a structured CSV dataset. Training a Random Forest Regressor (a machine learning model using an ensemble of decision trees) to estimate win rates based on this data. Outputting expected win-rate predictions for each driver, which can be applied to projections, analysis, and strategic decisions for the season

How we built it

Data Collection: Historical NASCAR champion data is compiled into a CSV file, including columns for Driver, Year, Wins, and other season-related attributes.

Preprocessing: Using Python (100%), along with pandas and scikit-learn, the data is cleaned, encoded, and made suitable for model training.

Model Development: A Random Forest Regressor is trained on the features. This algorithm constructs many decision trees and averages their results for more accurate win-rate estimates.

Deployment/Usage: The code (NASCAR Predictor.py) can be run to retrain and test the model, providing win-rate predictions for the season, potentially updatable for new seasons as additional data is captured.

Challenges we ran into

Data Quality: Historical racing data may be sparse, incomplete, or inconsistent across seasons, requiring careful cleaning and preprocessing.

Feature Engineering: Selecting impactful features (beyond just wins and years) is necessary for good predictions.

Model Interpretability: Random Forest models can behave as black boxes—explaining why specific drivers receive certain predictions is non-trivial.

Generalization: Ensuring the model doesn't overfit historical stars and accurately predicts emerging drivers' performances.

Accomplishments that we're proud of

Functional Win-Rate Predictor: Successfully built an end-to-end pipeline that estimates NASCAR driver win rates using real historical data.

Use of ML: Leveraged Random Forest Regressor for robust, non-linear prediction rather than simplistic models.

Reusable Dataset/Coding: The workflow integrates Python and open-source tools, making it approachable for future seasons or other motor-sport datasets.

What we learned

Random Forest Strengths: This technique handles tabular sports data well and is less prone to overfitting when enough features and examples are present.

Importance of Data Structure: Having clean, well-structured datasets (driver, year, wins, etc.) is crucial for ML applications in sports.

Model Evaluation: Statistical evaluation (R², out-of-sample tests) is necessary to validate ML predictions, not just trust model output.

What's next for NASCAR Win Rate Predictor

Expand Feature Set: Add more features (e.g., tracks, qualifying results, team changes) to refine prediction accuracy.

Deployment: Build an interactive dashboard or web app for fans/analysts to run predictions with current season data.

Integrate More ML Models: Compare Random Forest predictions with other regression models for benchmarking.

Broaden Scope: Apply similar ML techniques to other racing series or sports for win-rate and performance projections.

Built With

Share this project:

Updates