We created a tool that allows users to explore the best ways to tag their German documents.

What we do:

  • We read and pre-process text from any type of text file (including .pdf, .doc, etc)
  • We execute key phrase extraction using a number of methods (user specifiable in our CLI tool): RAKE (Rapid Automatic Keyword extraction), KEA (Keyphrase Extraction Algorithm), gensim's keywords method (based on lemma TextRank) and a modified text rank with word2vec word embeddings.
  • We developed and made available an evaluation metric to compare generated tags vs some golden (e.g. human annotated) tags: we use the word2vec vector space model and cosine similarity to determine whether predicted tags are somewhat similar to those chosen by people.

Built With

Share this project:

Updates