What it does

It uses a TF-IDF word "count" vector combined with the WordCloud stopword filter to measure frequencies of each key word. These are used as labels for the k-Nearest-Neighbor algorithm. For each input embedding, it returns

  • its 8 nearest neighbors, and
  • each NN's top 12 key words. This gives 96 clue words per embedding to guess the article's contents.

The general classifier is a custom estimator based on Bayesian classification, but using Gaussian Mixtures under the hood. This allows for more flexible shaping of the clusters' boundaries, as well as a more elegant approach for the "gray area" embeddings.

Built With

Share this project:

Updates