Boring Articles, lengthy articles, not evolving documents in 21st digital age.
How it works
Given any document/url, we'll extract the content of the article, classify each section into predefined categories(IPTC News Codes), and present it in a easy to navigate graph format. We also does more analytics like 'highlighting important sentences', 'context analysis(#hashtags)' etc., to make the documents even more interesting.
Challenges I ran into
Extracting content from web articles is very challenging because of lack of standard practises among developers. We came up with a generic approach which can be applied to many different kinds of articles, but still far from perfect. Topic Modelling is also challenging because of lack of proper datasets for classification of Web articles. We choose IPTC News Codes based modelling
Accomplishments that I'm proud of
A new way of visualizing documents is proposed which could be game changing. Also lot of machine learning and Natural language processing features are added into the project.
What I learned
What's next for Docs2Graphs
We intend to release a beta version soon with better scrapping and classification performances. Next we try to generalize this approach to more kinds of articles, books etc.,,