Inspiration

Lots of unstructured content exist on the Web in the form of news, blogs, and other document types. In addition, a single article may contain ads, snippets, and other irrelevant content. What's needed is a way to extract a main article's content, and from the main content extract relations and entities.

How it works

Please see documentation here. The REST service endpoint is http://realitywarp-vocifery.rhcloud.com/api/v0/extract/triplets and it takes a single "url" parameter that should be encoded. The url parameter could be a link to a news article, blog entry, or any other document with text content. Set timeout to 60 seconds or more and perform a GET or POST operation on the endpoint with the url parameter. Be patient as the service performs expensive text processing operations. The response is a JSON message containing entities with relations between entities encoded as {subject, predicate, object} triplets.

Challenges I ran into

Working with natural language text is inherently challenging due to errors, bad characters, ambiguities and other issues.

Accomplishments that I'm proud of

Being able to expose this service for other developers to use.

What I learned

I learned about the knowledge graph and how companies like Google use it to answer natural language questions like "how old is the president of India?". Facts are encoded as triplets that are linked together to form a graph. Answers to a question could be found by traversing the graph.

What's next for Triplet Extraction Service

Support more relation types. Reduce errors and improve precision and recall. Improve system performance. Feel free to suggest more features.

Share this project:

Updates