What it does
It uses a TF-IDF word "count" vector combined with the WordCloud stopword filter to measure frequencies of each key word. These are used as labels for the k-Nearest-Neighbor algorithm. For each input embedding, it returns
- its 8 nearest neighbors, and
- each NN's top 12 key words. This gives 96 clue words per embedding to guess the article's contents.
The general classifier is a custom estimator based on Bayesian classification, but using Gaussian Mixtures under the hood. This allows for more flexible shaping of the clusters' boundaries, as well as a more elegant approach for the "gray area" embeddings.
Log in or sign up for Devpost to join the conversation.