What it does
Using K Means clustering, we identified the optimal number of clusters. Then, the program uses a Non-negative Matrix Factorization to get the latent document-topic and term-topic probability matrices. Using the document-topic matrix, we found the 10 articles from either source that best represented the topic and saved a weighted average. We compared the weighted average of each topic to the 5 challenge embeddings to determine the topic whose embedding most closely matched each challenge embedding.
How we built it
We used Jupyter Notebooks (ipynb) to develop. The Pandas package was helpful for reading and cleaning the data. The Scikit-Learn package was extremely useful because of the inbuilt NMF and cosine similarity code.
Challenges we ran into
We tried using a simple 11-Means Classifier but had trouble with boundary articles influencing the results. Also, we tried LDA, a winner from last year's challenge, and found that it struggled with repeating terms across topics on this years' corpuses. We wanted to find how each close every embedding was to all topics' average embedding using a modified Gram-Schmidt algorithm. We did not have enough time to debug our implementation though.
Accomplishments that we're proud of
We are proud that after creating a train-test split and running the same process, the test results checked out.
What we learned
We learned about Non-negative Matrix Factorization and how to statistically interpret matrix factorizations.
What's next for EmbedToText
I want to hack the hacked-Gram-Schmidt more and see if I can find multiple topics that each embedding relates to.