Inspiration
The inspiration of our project was to predict the total renewable energy investment in each state in 2020 as accurately as possible given various bits of information about the energy sector in all states.
What it does
Our models take in the 30+ dimensional input data from 2015 to 2019, and attempt to fit meaningful relationships between the input data and and reference variable, the amount of renewable energy assistance.
How we built it
We used Pandas as the base framework to preprocess and organize our data. Most of the models we tried using were implemented in Sklearn, including linear regression, polynomial regression, KNN regression, random tree regression, and more.
Challenges we ran into
The data was very high dimensional but there were not many data points. Even if we trained one big model for all states, we only had 50 states * 5 years = 250 points worth of data that was more than 30 dimensions in size. Even if we cut down redundant/correlated variables, there would still be 15 dimensions. And after running PCA, a dimensionality reduction algorithm, we discovered that the individual states have very distinct data. This led us to believe that training one model per state could be good too, but in practice, this would lead to there only being 5 data points to train on per model, which did not lead to good results. The curse of dimensionality was pulling us down on one end and a lack of generalizable pattern on a national scale was pulling us down on the other end.
Accomplishments that we're proud of
It isn't the flashiest of achievements, but we proved that all the regression techniques we've tried are inferior to the very simple benchmark of taking the average of all 5 years of a state's renewable energy investment to predict the next year. This shows that at least to our knowledge, there is no good information to be extracted from the data that would improve predictions. Real world phenomena modeled by the predictor MSN features is just too noisy to be interpreted meaningfully.
What we learned
Not all datasets have easily interpretable patterns and meaning, and the curse of dimensionality is very real. Machine learning is cool, but a large amount of information rich data is required to make meaningful conclusions.
Log in or sign up for Devpost to join the conversation.