Predicting S&P 500 stock prices based on commodities data

Final Reflection

Midpoint Reflection

Who

mpark49, vhuang5, rzyngier

Intro

What problem are you trying to solve and why?

We are attempting to use a variety of monthly spot commodity prices as inputs with corresponding SPY (S&P 500 index) prices in order to predict potential correlations between the two. We plan to use a variety of different commodities in order to find those which yield the best training and validation results. We aim to create the best model possible due to the lucrative implications of strong stock market models.

If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper.

We are not implementing an existing paper.

If you are doing something new, detail how you arrived at this topic and what motivated you.

We were first considering a couple of papers that used neural networks that for instance used Indian stock market futures and then utilized them to analyze the US stock market. We all have a passion for investing and find stock market analysis interesting. We thought that a project centered around this would be exciting and something that we can build upon the work of other papers. We also thought that looking at the way that commodities correlate with stock prices would be an interesting relationship.

What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc.

This is supervised learning because we have all the labels and input pairs ready for the model to train on.

Related Work

While we are not implementing an existing paper, we drew inspiration from a few papers that made use of Deep Learning architectures to predict aspects of the stock market. In general, the papers took historical data from a single or series of market indices/stocks to train a model. In particular, Hiransha et al. link trained four different architectures to predict stock prices. The paper trained each model with one stock from the National Stock Exchange (NSE - India) and then trained the models on five different companies from both the NSE and New York Stock Exchange (NYSE). The paper observed that the model trained on the NSE was able to reach reasonable predictions for the NYSE as the two markets share similar dynamics. This observation led us to consider exploring the correlation between commodities and the overall NYSE, in which we set the S&P 500 as a benchmark. When looking at other papers link, link, link, we saw that many made use of LSTM models and cited their performance to predict nonlinear data in time series. Thus, we drew inspiration from those papers in electing to implement our project using a LSTM models.

Data

As of now, we are planning on using link to get historic monthly data for commodities that extend back 30 years. We have to export the data to excel files and can join them based on date to have a combined dataset of different commodity prices historically. This dataset is already relatively simple and clean, we may just have to parse out some values into a more readable string in terms of preprocessing as well as convert strings to floats. In terms of data for the S&P 500 (and other general stock data), we plan on using a Yahoo Finance API (either yfinance or yahoo_fin). Most of the data will be relatively clean and the datasets are pretty small. The only things we really need to worry about is formatting them into the correct shapes to be passed into as inputs and labels for our model.

Methodology

What is the architecture of your model?

We are going to use an RNN (LSTM) with 2 LSTM layers with dropout layers to prevent overfitting (because we do not have much data), and 2-3 dense layers (including the softmax layer).

How are you training the model?

We are training the model using an 80/20 split for training and testing. (We will train on 80% of the historical data and test on 20% of the historical data)

If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

If we do not find convincing results using the amount of data that we have, we plan on using different commodities, and if we still do not find anything convincing, we are also considering a mixed commodity ETF with much more intense price data than we have available for pure commodities.

Metrics

What constitutes “success?”

We are planning to minimize mean absolute percentage error or L2 error.

What experiments do you plan to run?

We are going to initially train the model on the commodity data with the model outputting prices corresponding to the S&P 500. As we develop our model, we aim to potentially incorporate or remove dense/dropout layers. Further, we anticipate the need to finetune our hyperparameters to optimize our model. If our model performs reasonably well, we would hope to then work on modifying the model to work on achieving our target and stretch goals detailed below.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

Accuracy doesn’t really apply in the traditional sense for this project. Minimizing loss is realistically more appropriate.

If you are doing something new, explain how you will assess your model’s performance._ We will assess how effective the model’s performance is on a relative basis using MSE. Because MSE may vary substantially with different models, we realistically expect to calculate MSE for lets say a given commodity and the results we get.

What are your base, target, and stretch goals?

Our base target would be to have our model detect a correlation between the commodities and the S&P 500 during training and be able to reasonably predict the S&P 500. The target goal would be able to extend the performance of the model to include sector performance. So rather than having the model predict the S&P 500 price, we would hope that it can be trained and perform on a sector-specific ETF. Lastly, our stretch goal would be for the model to be able to be trained on a specific stock.

Ethics

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

The major stakeholders in this are essentially other holders of any of the assets in traded markets. The potential consequence of a reliance on a model that attempts to predict the outcome of the market is financial loss on stakeholders. If individuals were to assume belief in the model and invested accordingly, they stand to suffer financial consequences.

How are you planning to quantify or measure error or success? What implications does your quantification have?

Since stock prices and trends are continuous, we would prioritize minimizing loss in order to quantify success or error. The accuracy of our predictions is also an important metric but I think we would defer to the loss more as a telling indicator of success. Our project is not really wrapped in very heavy ethical implications, however, if for example our model accurately predicted S&P 500 prices based on commodities prices extremely accurately and became well-known, there may be some room for market manipulation in order to directly affect market prices.

Division of labor

We intend to evenly distribute the workload for the project.

Built With

Share this project:

Updates