This data analytics project was done for a Kaggle contest where the goal was to perform demand prediction for Grupo Bimbo company. Python language was used with Jupyter notebooks. XGBoost library was used to perform training and predictions. Various feature engineering features such as NLTK for text extraction, creation of lag columns and averaging over a large number of variables were used to enhance data. After the train table was created, XGBoost was utilized to optimize according to the scoring function dictated by the contest, RMSLE. Hyperparameter tuning was also leveraged after feature selection based on feature importance and correlation analysis to determine the best parameters for the XGBoost optimizer. The final submission to Kaggle achieved a score of 0.48666; placing our team in the top 17% of the 2000 contestants. The biggest challenges were related to analyzing and training on a large data set. This was overcome by forcing the data types to smaller types (unsigned integers, low accuracy floats, etc.), using HDF5 file format for data storage and launching a powerful Google Cloud Compute Preemptible Instance (with 208 GB RAM). Further improvements would include attempting hyperparameter tuning across a wider range of training tables (with different features) and also implementing a failsafe method for running the experiment in preemptible instances. Additionally, creating different models and averaging them to find optimal and non-overfitted models would have yielded better results.

Built With

Share this project: