This project is a collaboration between my dad ( and myself ( We were inspired by Blei, Ng and Jordan's paper "Latent Dirichlet Allocation" which describes an Expectation Maximization algorithm for identifying terms that make up topical clusters in a corpus.

What it does

Our model finds the top 20 clusters of 10 words each using summaries from the US Federal Reserve for a given time frame.

How we built it

We built our model using scikit-learn's TfidfVectorizer and LatentDirichletAllocation implementations along with other scikit-learn, numpy, pandas and seaborn functionality. We used BeautifulSoup to read xml files from the Federal Reserve.

Challenges we ran into

We spent a while thinking of metrics to show two vectors are dissimilar. This is important when comparing the top policies for different time frames. The metric we chose was (1-cosine similarity) because it generalizes to R^10,000 space where each term's sentiment vector lies and is easy to understand.

Accomplishments that we're proud of

This is our first hackathon and datathon. We are proud that our solution can answer all of the challenges question and generalized to recent data as well.

What we learned

We learned about LDA and project flow for datathons.

Judging Criteria

What were the common regulatory priorities of these agencies from 2001 through 2006?

From 2001-2006, federal agencies were focused on increasing the information the public companies disclose to investors. The phrases internal reporting, control and disclosure were among the top 20 most discussed as two big cases of hidden practices WorldCom and Enron led to significant volatility in the stock market. The period was also marked by discussion of legislation aimed at forcing publicly traded corporations to disclose financial documents. The Securities and Exchange Act of 1934, which mandated that companies disclose any information pertinent to investors, was discussed along the newly drafted Sarbanes-Oxley Act. The Sarbanes-Oxley Act responded to fraudulent companies’ lack of disclosure by imposing heavy penalties on nonconformant firms.

What new topics and issues emerged from 2007 through 2012? (And what topics went away?)

Security based swaps and risk swaps emerged as new topics after 2007. The trading of credit for mortgages and loans created volatility in the previously stable housing market. As banks and firms were unable to manage swaps and mortgages, the Great Recession began. The Great Recession’s impact on the lives of everyday Americans caused the agencies to analyze credit card fees.

How significantly do topics shift between administrations? (Bush-Obama in 2008-09, Obama-Trump in 2016-17)

Federal policies for the Bush to Obama inauguration were mostly distinct. The Bush presidency aligned with regulations on fraudulent corporations, whereas the early Obama presidency was spent solving issues related to the Great Recession. The topics federal agencies discussed during the Obama to Trump inauguration were more similar. Both administrations had a cluster of documents detailing how the Consumer Financial Protection Bureau can better regulate the market. Additionally, a cluster with trading, futures, and commissions is shared between the presidents. The administrations differ in their views on the Frank-Dodd and Securities and Exchange Act. Obama’s office discussed rules, requirements, and reform contrasted with Trump’s office pushing reform, amendments, and adustments.

How many times, on average, would a topic need to be discussed before it is flagged as an emerging topic?

This project uses a term frequency/ inverse document frequency vectorization algorithm. This schema weighs words that appear frequently in a few documents higher than both words that appear infrequently and words that appear in many documents (like stopwords). For this analysis, the average number of times each word in the top 20 topics discussed is approximately 50.8, an experimental upper bound for the desired statistic.

Events for testing methods and approaches include the housing crisis of 2008 (and prior/subsequet rulemaking leading up to it) and the still ongoing COVID-19 crisis. A good unsupervised approach would identify both events as new clusters.

This project utilizes an unsupervised Latent Dirichlet Allocation model to identify the top 20 topics in a timeframe as well as the 10 words or phrases that most contribute to the topic. The LDA was able to identify key indicators of the 2008 recession such as an increase in the term risk alongside swaps, loans, mortgages, card fees, __ and other financial institutions associated with the event. Additionally, the model was able to group terms __covid and 19 in a topic about interim rules and agencies brought on by the extraordinary circumstances.

Share this project: