Prospect Predictions - Google Cloud x MLB Hackathon
Inspiration
The inspiration for my project stemmed from a desire to predict and understand the future career performance of Major League Baseball (MLB) prospects. I wanted to leverage advanced machine learning techniques to predict player success, relying on historical performance data to assess the potential trajectory of a player's career. As someone passionate about sports and data analysis, this challenge felt like a great opportunity to combine those interests and contribute to the future of player evaluations.
What I Learned
Throughout the development of this project, I learned the following:
- Data Preprocessing: I gained a deeper understanding of how to clean, transform, and process large sets of data to make them suitable for predictive modeling.
- Machine Learning Models: I learned how to apply various machine learning models (e.g., regression, decision trees, and ensemble models) to predict future performance based on historical statistics.
- Google Cloud Services: This project gave me hands-on experience with Google Cloud tools, especially BigQuery for data handling and Google AI for model building and deployment.
- Performance Evaluation: I learned how to evaluate model performance using metrics like accuracy, precision, recall, and F1 score to ensure reliable predictions.
How I Built the Project
- Data Collection: I gathered historical player data, including statistics like batting average, ERA, fielding percentage, and other relevant metrics.
- Data Preprocessing: The collected data was cleaned to remove outliers and missing values. I also normalized and scaled the data to ensure uniformity across different statistical measures.
- Feature Engineering: Key features like age, position, and prior minor league performance were used to build a comprehensive set of predictive features.
- Modeling: Using Google Cloud AI tools, I applied several machine learning algorithms to train models on historical data. I fine-tuned the models by adjusting hyperparameters and evaluating their performance.
- Prediction & Deployment: The final model was deployed on Google Cloud, allowing for real-time predictions on MLB prospect data.
Challenges Faced
- Data Quality: One of the biggest challenges was dealing with incomplete or inconsistent data. Many player statistics were missing, requiring careful attention to ensure the dataset remained reliable for training.
- Model Overfitting: Initially, some models overfitted the training data, resulting in inaccurate predictions. To solve this, I implemented regularization techniques and cross-validation to ensure the model's generalizability.
- Scalability: As the project grew, managing large datasets and performing real-time predictions became increasingly complex. Leveraging Google Cloud’s scalable infrastructure helped resolve this issue.
- Interpretability: Understanding and explaining why a model made a specific prediction was challenging, particularly with complex models. I worked to incorporate model interpretability techniques to better explain outcomes.
Conclusion
This project was a great learning experience, allowing me to apply machine learning techniques in the sports analytics domain. It also highlighted the importance of data quality, model validation, and scalability when working with real-world data. I'm excited about the potential impact this project could have on how MLB teams scout and evaluate prospects in the future.
Built With
- bigquery.
- bigquery:
- containerization.
- docker:
- frameworks
- functions:
- gcp):
- git:
- jupyter
- mlb
- notebooks:
- numpy
- platform:
- python:-for-data-processing-and-machine-learning-(pandas
- scikit-learn).
- scikit-learn:
- sql:
- storage.
- storage:
- streamlit:
- tensorflow:
Log in or sign up for Devpost to join the conversation.