DEMO: YouTube Playlist

Inspiration and What it does

The capture of Metadata (tagging), when storing electronic documents creates intelligent database mining when information is needed. Document search results become consistent and relevant, saving time and generating useful business knowledge. Document management software must have a powerful and automated way to capture metadata.

The current challenge from Dracoon, presented us with a way to generalize this problem. Dracoon, a data security company, faces a challenge of finding a specific document in a well-organized file store can still be tedious and time-consuming for employees. Our service aims to provide a handy way to generate these document tags.

While building the service, we observed that the core idea of comparing information and finding the relevant coefficient can be applied in other areas. We have demonstrated this by building two microservices...

A. Given a web URL of a news article, generate a relevancy score. This led us to rank news articles in terms of how good the article is with respect to the title. AKA: Clickbait detector!

B. Curated news. Once we figured out how "good" the article was for the user, we compiled a dynamic billboard of local news article for the individual.

How we built it

The lack of enough training data for "learning deep" makes most non-dimension reduction learning methods useless. Also the pre-assigned "manual" tags for supervised learning algorithms relies heavily on the tags being very accurate and tagged in one particular style, otherwise the loss values will not be accurate and will hamper the learning. The style is very important as the "deep learning" tries to imitate the thinking of the person while giving the tags. As different person have different style and thinking the network might not learn well enough.

Key Phrase extraction techniques rely heavily on the available corpus and is repeatable with changes in the corpus. Also if the corpus is too small the keywords may not be good. So we went ahead with implementing bagging techniques to improve our key word extraction.

Identify Tags

  • New Document's Class is identified

    • Using a cosine similarity parameters over predefined class
    • Made robust by bagging over multiple documents of a class
  • Important words in the document is identified

    • TF-IDF parameter is used to identify the import words in the new document which bears a resemblance with other documents in the same Class
  • Semantic Annotation of the New Document

    • Gazzetter used to identify important sentence constructs
    • GATE based Semantic Segmentation of the document using the identified important words from TF-IDF and the Gazetter
  • Combination of Tf-IDf words and Semantic Segmentation words are used to assign Tags

Article Relevancy

  • Relevancy score of an article is identified
  • The cosine similarity of the article with the heading

    • Whether the article discusses the heading
  • Reputability of the website of the article

    • Based on a history of all past article relevant on the Website
  • Final Article relevancy percentage based the combination of both above steps

Challenges we ran into

  • The 2 day to build from idea to product for multiple language was challenging enough :P
  • The handling of special characters for European language was problematic and gave troubles in handling different formats.

Accomplishments that we're proud of

  • A non machine learning based approach
  • Support all major language
  • Does not require a very large data set to build our model
  • custom made for each user relevancy score

What we learned

  • This is was our first real data based multi-language Natual Language Processing project.

Built With

Share this project:

Updates