Chevron-UH

Inspiration

We decided to do the Chevron Track because the questions they were asking were business based and allowed for the exploration of applying different models to the dataset as well as transforming different variables in order to get the most out of the data. We began with looking at overall trends and realized that the dataset had no linear correlations, and model and model we were getting the same Test RMSE (and a terrible one at that). We looked for external data and tried to research how these grants were awarded but ultimately could not find any data that was any more significant than what we already had.

How we built it

In order to build good models for our regression task, we need to choose suitable models for our dataset. However, our dataset is unstructured with lots of redundancy. So, we did our best to clean the data and try various models and experiments to have the best understanding of our dataset. We found that the data is non-linear with many features, so using tree models, XGBoost in particular, would be a good option. XGBoost, an advanced version of Decision Tree, has the advantage of being a robust, high-efficient ML algorithm that works well for most datasets. The challenge was that XGBoost has many hyperparameters and we needed to find the best combination of them in order to build a high-accuracy model. Using scikit-learn's RandomizedSearchCV function, we managed to obtain the best XGBoost model that yields an acceptable RMSE for the project.

Challenges we ran into

One of the challenges we originally ran into was a dataset that was unable to be analyzed readily. The data was very readable and included some observations that were irrelevant to our research question. We filtered out these observations and checked that the data was consistent and then we converted the data to a wide format. With this, we could look at the 30+ data points given for a given state in a given year, and the model will put out something that is more logical.

The next problem that we ran into was that nothing in the dataset was correlated linearly. We tried doing some logarithmic and exponential transformations, but these did not help the problem. We also pulled in some outside data that was fruitless as well. We explored the data through visualization and found that it was hard for any one variable to tell us much about the response individually- and so we looked to metrics to help us. A standard linear regression model on all the data got a test RMSE of approximately 32000000, but with generated metrics on annual change put through a Random Forest, we achieved an RMSE of about 24000000. We still were not satisfied with this, so we used an XGBoost model and were able to outperform the simpler models (with additional feature engineering) with a complex model (doing better without these engineered features).

Accomplishments that we're proud of

We are proud of the RMSE we achieved, and the effort we put into trying different creative methods for finding ways to predict what we wanted to know. We all got to practice different things we have been learning in class and felt like we were all stretched in our knowledge and realized new gaps that we had.

What we learned

We learned that sometimes abstract models can be better than the most creative metrics, external datasets for niche metrics on states are hard to find (and often awful to deal with), new intricacies of the models we are using, and more of the quirks that are involved with them. Overall, we learned more about working as a team, and not only contributing to the things we are strong in but giving others the opportunity to learn something new and to stretch ourselves in datathon.