Here Comes the Bloom(berg Industry Challenge)

Inspiration

The reason that we chose this project is because we felt the objective had a great balance of structure and open-endedness. We were interested in this project because we didn't have much experience with text data/Natural Language techniques, so it was something we were interested in learning more about. Also, we didn't have much experience with unlabeled data, so we knew that using unsupervised learning would be a fun challenge and we wanted to see how far we could take this in a short time.

What it does

Our project parses all of the federal report documents given to us and identifies the largest paragraphs/sections of the paper, as this tends to be where the bulk of the report content resides. Then, we put our data through some preprocessing, where we throw out short, insignificant parts (like citations, etc.) and then remove stop words, stem/lemmatize the phrases, etc. After that, we use an unsupervised clustering method and visualize the results.

How we built it

We used BeautifulSoup to parse through the XML documents. We chose to focus on the 'p' tags because that was the main information needed to proceed. We noticed that we needed a way to process the data in a consistent manner, and we also noticed that our parsing picked up a lot of short tokens like citations that had nothing to do with the overall content. Using a histogram, we plotted the frequency of different word counts and chose a threshold of 300 words and filtered out tags with less than these amount of words. Focusing on only the 'meat' of our data, we used the Gensim python package to clean the data, and then we used their built in Latent Dirichlet Allocation (LDA) model to find patterns in the documents (using bag-of-words). Then, we used PyLDAvis python package to visualize our results.

Challenges we ran into

The primary challenge we ran into was that our group had no experience with Natural Language processing. We had to learn a lot of contextual information about data preprocessing for natural language and why it was important, which at first was a lot of overhead. We also had a lot of documentation to read between PyLADavis and Gensim to understand the libraries. Allocating tasks was pretty difficult, because in SWE it is common to work on different files/features and put them together at the end. On this project, which was data heavy, we had some challenges optimizing our time usage since some parts of the project relied on data that we hadn't been able to preprocess at that time. Luckily, we dedicated some time to figure out a plan forward so that we could all be productive simultaneously, and that helped us out going forward.

Accomplishments that we're proud of

We are proud of our final product for sure. I'm not suggesting that it's the "best" project out there but we definitely made a plan throughout datathon and adjusted when needed, and delivered a visualzed, working project. For me personally, this is the first hack-a-thon in which I did submit a project (I haven't had the chance to go in the past year and a half, since HowdyHack as a freshman) so I'm definitely proud of that.

What we learned

We learned a lot. Mainly, I'd say the NLP Packages were super interesting to learn about. We all had familiarity with data analytics, but we didn't have much experience with text data and NLP. Outside of our project, I liked the workshops hosted by the companies where we learned about tools like Dremio and why they were important. While they aren't tools that I work with daily, it is helpful to know about the cutting edge tools industry is using to improve their operations.

What's next for Bloomberg Industry Challenge

For us, we wish we had more time to improve the performance. Some of the clusters were close together, so having more time to tune hyperparameters (like number of topics) could've helped us deliver a better project. Further, we would like to add extra features like topic emergence/topic prediction. Currently, our project only clusters the existing data for a time frame we specify (ex: 2008-2009 for Obama Admin., etc.), and we have to interpret the clusters ourselves based on common words characterizing the clusters. Having more time for topic emergence prediction would be something we would be interested in working on.