Bloomberg Challenge - Team Brainstorm

We find the top key words from each article, find each embedding's "nearest neighbors", and use them to guess each article's contents. The general classifier is based on a custom Bayes/GMM estimator.

Comment 2

What it does

It uses a TF-IDF word "count" vector combined with the WordCloud stopword filter to measure frequencies of each key word. These are used as labels for the k-Nearest-Neighbor algorithm. For each input embedding, it returns

its 8 nearest neighbors, and
each NN's top 12 key words. This gives 96 clue words per embedding to guess the article's contents.

The general classifier is a custom estimator based on Bayesian classification, but using Gaussian Mixtures under the hood. This allows for more flexible shaping of the clusters' boundaries, as well as a more elegant approach for the "gray area" embeddings.

Built With

Updates

mashburnj Mashburn started this project — Oct 09, 2022 01:58 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.