We created a tool that allows users to explore the best ways to tag their German documents.
What we do:
- We read and pre-process text from any type of text file (including .pdf, .doc, etc)
- We execute key phrase extraction using a number of methods (user specifiable in our CLI tool): RAKE (Rapid Automatic Keyword extraction), KEA (Keyphrase Extraction Algorithm), gensim's keywords method (based on lemma TextRank) and a modified text rank with word2vec word embeddings.
- We developed and made available an evaluation metric to compare generated tags vs some golden (e.g. human annotated) tags: we use the word2vec vector space model and cosine similarity to determine whether predicted tags are somewhat similar to those chosen by people.
Built With
- gensim
- kea
- machine-learning
- python
- rake
- statistics
- word2vec


Log in or sign up for Devpost to join the conversation.