Inspiration
This project was inspired by a desire to predict the performance of NASCAR drivers in the 2025 season using a data-driven approach grounded in machine learning. Existing models and fan discussions often lack statistical rigor or fail to use historical champion datasets for win rate estimation. By using a Random Forest Regressor, the project aims to build a more detailed and nuanced model for predicting driver success, giving fans and analysts a robust tool for season outlooks.
What it does
The project predicts the win rate of NASCAR drivers for the upcoming season. It achieves this by: Aggregating historical data (years, wins, driver names, etc.) into a structured CSV dataset. Training a Random Forest Regressor (a machine learning model using an ensemble of decision trees) to estimate win rates based on this data. Outputting expected win-rate predictions for each driver, which can be applied to projections, analysis, and strategic decisions for the season
How we built it
Data Collection: Historical NASCAR champion data is compiled into a CSV file, including columns for Driver, Year, Wins, and other season-related attributes.
Preprocessing: Using Python (100%), along with pandas and scikit-learn, the data is cleaned, encoded, and made suitable for model training.
Model Development: A Random Forest Regressor is trained on the features. This algorithm constructs many decision trees and averages their results for more accurate win-rate estimates.
Deployment/Usage: The code (NASCAR Predictor.py) can be run to retrain and test the model, providing win-rate predictions for the season, potentially updatable for new seasons as additional data is captured.
Challenges we ran into
Data Quality: Historical racing data may be sparse, incomplete, or inconsistent across seasons, requiring careful cleaning and preprocessing.
Feature Engineering: Selecting impactful features (beyond just wins and years) is necessary for good predictions.
Model Interpretability: Random Forest models can behave as black boxes—explaining why specific drivers receive certain predictions is non-trivial.
Generalization: Ensuring the model doesn't overfit historical stars and accurately predicts emerging drivers' performances.
Accomplishments that we're proud of
Functional Win-Rate Predictor: Successfully built an end-to-end pipeline that estimates NASCAR driver win rates using real historical data.
Use of ML: Leveraged Random Forest Regressor for robust, non-linear prediction rather than simplistic models.
Reusable Dataset/Coding: The workflow integrates Python and open-source tools, making it approachable for future seasons or other motor-sport datasets.
What we learned
Random Forest Strengths: This technique handles tabular sports data well and is less prone to overfitting when enough features and examples are present.
Importance of Data Structure: Having clean, well-structured datasets (driver, year, wins, etc.) is crucial for ML applications in sports.
Model Evaluation: Statistical evaluation (R², out-of-sample tests) is necessary to validate ML predictions, not just trust model output.
What's next for NASCAR Win Rate Predictor
Expand Feature Set: Add more features (e.g., tracks, qualifying results, team changes) to refine prediction accuracy.
Deployment: Build an interactive dashboard or web app for fans/analysts to run predictions with current season data.
Integrate More ML Models: Compare Random Forest predictions with other regression models for benchmarking.
Broaden Scope: Apply similar ML techniques to other racing series or sports for win-rate and performance projections.
Log in or sign up for Devpost to join the conversation.