TAMU Datathon 2021 Bloomberg Company Challenge

(1) What were the common regulatory priorities of these agencies from 2001 through 2006?

Common regulatory priorities of these agencies included banks and banking, reporting and record keeping, and securities. These were by far the most common throughout these years having been in nearly half (48%) of all publications.

(2) What new topics and issues emerged from 2007 through 2012? (And what topics went away?)

New topics that emerged include consumer protection and savings associations. Otherwise, the frequency of topics stayed about the same. No other previously popular topics experienced a significant relative drop in frequency.

(3) How significantly do topics shift between administrations? (Bush-Obama in 2008-09, Obama-Trump in 2016-17)

Topics tend to stay fairly consistent between administrations. The top three most common topics remain the same between administrations and all other topics also remain pretty similar compared to their previous frequencies.

Sample cluster data for (2) is accessible here (https://docs.google.com/spreadsheets/d/1PdiNlAEgDvV7JmU2ybpjHCQnsbp3XmZD/edit?usp=sharing&ouid=111870485620435403352&rtpof=true&sd=true), full cluster data can be obtained via GitHub.

About the Project & Challenges:

Our idea in this project was to categorize the documents into clusters for us to learn about “clusters” of hot topics during a certain time period. We started by using the Federal Register API to acquire the data. We felt that in order to properly classify a record we needed the abstract, title, and corresponding topics. We used the urllib library in Python to request the data in json format, this allowed us to easily parse the json for the fields we wanted and save it all to a csv file. Our next challenge was to figure out how to cluster the documents. Clustering the documents would require that we represent the document(s) by a vector of numbers and find similarities/distances between documents. If we had a system which could give us the most similar documents, this would in turn give us the most popular topics that were grouped together or appeared in a certain time period. The way we decided to represent these documents in vectors is by using a feed-forward neural network’s (FFN) representations. We first started by parsing the text input to have equal size input since traditionally, FFNs don’t accept variable length input. For the titles, we found that there were 54 unique topics so each input was a 54 dimensional vector with 1 indicating the existence of the i’th topic. For the text input, we created a bag of words model using the n most popular words in our dataset. The input vectors were then n dimensional vectors with each i’th element representing the frequency of the word in a piece of text. The model is then trained to predict the topics vector from the text vector. To create the text vector, we used the abstract or the title. We found that abstract resulted in ~75% accuracy (n = 25) and title resulted in ~65% accuracy (n = 15) after training for 100 epochs. As a result of this, we chose to use the abstract as model input to generate our vectors. We then used an intermediate layer in the model to generate “representations” of our text data, the hope was that the model learned better ways to represent the text data in intermediate layers while trying to predict the topics as opposed to doing this without a FFN. Once we got representation vectors for each document, we used K-means with the number of clusters equal to 5 to get the resulting clusters. Based on these clusters, we were able to analyze trends in time by limiting the documents we considered based on time.

We faced some challenges in the making of this project, which made programming this significantly longer. The first challenge we faced was that we had to preprocess all of the text data by scratch, this was difficult because there were many features we had to add when pre-processing such as padding, checking for valid words, and support for dynamic number of top most popular words. The data preprocessing also ran into many bugs since text data is a difficult type of data to work with. Training the neural network wasn’t an easy task as well because we had to ensure our network was optimal and test which data type (abstract or title) worked best with various hyperparameters. This was difficult because the amount of time we had was limited, and we needed to ensure the functionality of other components on time. Given more time to work on this, we believe we can greatly improve the resulting clusters with a stronger neural network and a larger testing phase of clustering algorithms.

Built With

federal-register-api
json
jupyter
numpy
pandas
python
sklearn
tensorflow

Updates

Vinesh Ravuri started this project — Oct 17, 2021 03:58 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.