Inspiration
I came across this challenge shortly after and wanted to participate to determine my competitiveness and gain more experience with deployments on GCP. Recently, participation and completion in the Triple Ten Data Science program has really sharpened my python skills to where I can compete on the next level. My start-up recently became a Google Cloud AI partner through the Partner Advantage Program and have applied to the Cloud First Accelerator with Google. I really hope that I can get in one soon, though I am still developing the technologies.
Below is my methodology: 1. Train the regression-based prediction model using the MLB statistics on each player from each team. * Test linear regression, LASSO, Ridge, and ensemble methods like DT and RF
2. Find an optimal set of player stats using available performance fields and evaluate model performance using regression metrics (MSE, RMSE, and MAE)
3. Use historical league data as layers of training to reinforce the best player rankings potentially using a neural network and re-evaluate the performance
* Transfer learning
4. Run the players from other leagues through the model and obtain a ranking
What it does
The model merges multiple datasets going from leagues to teams to roster and player based on joins using similar indexes. In doing this we expand from 10^4 to 10^7 data points when we consider historical years (2013 to present). It then evaluates the five regression methods listed above and produces performance metrics.
How we built it
I built this using schema and workflow diagrams. I also had built modules where I evaluate multiple models in one block which is useful. I used AI in certain portions of the code to help me with processing the data, for example, when creating better loops to iterate through the historical component and when having incongruent lengths prior to entering the model evaluation module.
Challenges we ran into
We ran into some computation challenges, but I bought some computes and went from CPU to v2-8 TPU to v5e-1 TPU though the computation time still remained high with the final dataset. I added additional preprocessing and cut down by 10%, but it still took approximately 60 minute to make the final join due to the number of players in the API. I was also expecting to find stats on each player, but this was not the case. Instead, I relied on the strength of the models in past assignments experiments that I've done and was confident that we could get above 80% predictive ability with enough hyperparametric tuning and feature selection considering we had 131 fields at one point.
The deployment to Vertex AI was the last roadblock that I hit. I was able to split the dev code into train.py and requirements.txt and track my IAM information, but I was unable to submit the job for a reason that I could not discern at this time. This is certainly something I plan to work on for my start-up and future hackathons. I was able to push the files from Colab and GitHub to a GCS bucket which was a highlight and was an incremental step of progress for my development as a cloud developer. I feel that I am really close, but this is really one of the most challenging part of deployment. You can see my progress here: https://github.com/vicknentura/ML-in-SQL-BigQuery/blob/main/Hackathon%20IAMs.
Accomplishments that we're proud of
Sticking through the assignment with so much data as I am still learning TF/keras implementations. I applied two new concepts: transfer learning and concurrent.futures which allowed for the processing of the historical datasets. I am also proud that the model had very high predictive capabilities. Concurrent.futures cut down the processing time from over 3 hours even with v5e-1 TPU to minutes.
What we learned
APIs are very tricky and there is always more behind what meets the eye. I also was painfully reminded that overfitting can be one of the most challenging things that we encounter in data science.
What's next for Predict Prospecting using an Optimized ML Model
Testing the overfitting component and creating a visual report deployed through Google Cloud, shortly followed by an interactive user interface.
Built With
- ai
- api
- github
- google-cloud
- googlecolab
- json
- python
Log in or sign up for Devpost to join the conversation.