GText

The inspiration came from looking for women's maiden names. One of the most unused sources of that information is her parent's obituary. The software was first developed to gather that information in a normalized way that full text search just can't match.

The engine is now very robust and functional. The remaining development is to run enough data through it and enhance the rule sets and data dictionaries to become more and more accurate at extracting the correct data.

The most impressive feature is that the system can understand and extract relationships between people. For example it can maintain relationships between husband and wife, father and children, etc. It can even maintain the relationship between a person and the boat they arrived on when extracting immigration records.

By changing dictionaries and rule sets the engine can be used to extract information from text in other languages or other industries like legal and medical.

The target market is mainly businesses that gather and sell data to customers. It is also possible that it will be released as a service where a person could make a pdf of a book and submit it online for OCR and name extraction from the new digital data for pennies a page. Eventually it would make sense to approach law firms to help automate extracting data from documents during legal proceedings.

Built With

Submitted to

RootsTech Innovator Challenge

Created by

The core of the system is the GText library that provides an engine for processing English text containing information of genealogical interest. The engine recognizes names of persons andpronoun-based references to persons; infers persons from those names; infers relationships between persons; infers events that occurred in the lives of the persons; and infers the dates and places where those events occurred. The engine also recognizes many entities such as churches, schools, companies, institutions, and so on. The core library is now used in three applications. First is the test harness, a Mac OS X app used to evaluate the system and test new features. Second a client/server configuration where clients interact with a server that does the processing; through a simple interface, clients provide text to be analyzed, and the server analyzes and then returns the results as a JSON string. And third a custom application that processes obituaries that are encoded in a specific XML format.

Tom Wetmore
Earl Mott

Updates

Earl Mott started this project — Jan 16, 2015 12:41 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.