We attempted the Goldman-Sachs challenge trying to see how the environment and stock market are tied together. In our interpretation of the challenge, we used part of the historical environmental data and all the historical stock data. Specifically, we tried to find trends in the market based on environmental indicators, year average adjusted closing prices per stock, and year average volume per stock. In our environmental data, we included all countries in our indicator averages because the market is not only limited to one country and the economy of individual countries is tied together in our highly globalized world economy. However, we do understand that globalization of the economy likely increases over time, which may introduce issues in our models’ prediction accuracy.

Data Processing

Our first step in processing the data was understanding how and what the data consisted of. Most importantly, we wanted to see where data was missing, what to include, and what to exclude. Then, we were able to rule out 3 of the 5 total provided tables. We determined that the daily stock history and environmental data atlas contained the information we wanted. Then, we had to determine how to utilize yearly environmental data per country for historical daily stock data. Our first idea was to take the year average adjusted close price per stock and append it as features for each year of environmental data. However, our feature matrix would have a crazy amount of features because each company would be added on. So, then we thought maybe we could instead focus on a single sector. But, we ran into problems with some companies not being classified in a sector and felt a select few companies may have too much influence on a sector. We settled on analyzing a single company at a time and seeing how environmental changes affected their long term market trends. In the process, we created three custom features: year average adjusted closing price of a single company, year average volume of a single company, and the aggregate change in environmental factors in all countries.

Feature Selection

For our design matrix, we had features that included all the historical environmental data since the first year of a given company's listing in the stock market and the yearly average adjusted closing price that was listed in the dataset. For example, in our preliminary case we predicted the future stock value of AAPL, since they have only been on the stock market for 30 years, we had 30 years of environmental data as well as 30 data points that consisted of the yearly average adjusted closing price. We did note that we are limited to a max year of 2019 because the environmental data ends in 2019, i.e. AAPL has data to 2020 but the environmental data was missing. Overall, we had 49 different features, with 47 environmental features, 1 year feature, and 1 historical stock feature. We chose to use the yearly average closing price because the environmental data provided was only one day per year so, only using yearly averages for the stock values seemed the most natural selection.

Models and Results

At first, we tried the SVR regression model and linear regression model from scikit learn. Both models were pretty bad at predicting the test data. We tried some other models that we didn’t finish completely. Then, we decided to utilize AutoGluon to try to fit our data. These models included: ExtraTreesMSE, RandomForestMSE, NeuralNetMXNet, CatBoost, LightGBMLarge, WeightedEnsemble_L2, LightGBM, LightGBMXT, NeuralNetFastAI, XGBoost, KNeighborsDist, KNeighborsUnif. Then using root mean squared to score their performance, WeightedEnsemble_L2 and CatBoost were our best performing, at a respective 0.212364 and 0.353104. Our worst performing models included the LightGBM and Neural Network. We also wanted to run a more simple model, but doing least squares linear regression as a way to predict the future stock prices. For the linear model, we had to exclude the stock price from our features and only use environmental features as our predictors. A forward model selection minimizing the AIC was used to determine which variables would be included in the final model. We started with an intercept only model of the data and first every one variable model was checked and the one that minimized the AIC was selected. After that, we check all variables (excluding the first one) to see which second variable minimizes the AIC and that one is selected. We repeat this process until our AIC stops decreasing and reaches a minimum. In our case, there were enough environmental variables that by the time we had 27 variables, our model completely fit the data, and our AIC reached negative infinity, so we used the 27 variable model as our forward selection. We attempted to predict stock market value for AAPL and our model was on average 53.19% off of the true value for the first 5 years, making using it for predictions a poor idea.

Conclusion

In the end, we concluded that it is hard to correlate environmental data and market trends based on our selected features and models. From our modelling, the environmental data and the market trends were at best very loosely related. The majority of models we tested were not great at predicting our test set. When it did perform well, the models of Weighted Ensemble and CatBoost didn’t come from widespread models and utilized academic research journals. However, with more time we do believe that with smarter feature selection and creating more robust custom features, we would be able to achieve better results. Additionally, with more time we want to test other companies to see if there is a performance increase or decrease.

Share this project:

Updates