Recent years, knowledge bases like Wikidata and Freebase have grown rapidly. These knowledge bases are valuable data sources for many different purposes. One of the purposes of knowledge bases is to serve as training dataset for machine learning, especially in rule inference areas. Well-known research areas include rule inference based on markov logic network and inductive logic programming. However, these methods are all under closed world assumption. Our knowledge base is naturally under open world assumption that un-exhibited facts are not necessarily correct. Another challenge here is that the size of present knowledge base is extremely large. We want to find a way of mining rules and revealing relationships among given keywords.
What it does
The input is set of keywords. And our wiki-rule-miner would return frequent chains of rule patterns from Wikidata knowledge base. These chain of rules will connect given keywords and reveal their underlying relationships. However, due to limited time and capacity of our hardware, we have implement just a prototype, which can mine a small part of Wikidata graph.
How we built it
We use Wikidata toolkit to extract all nodes and edges from the Wiki graph. The graph data is preprocessed in Python.
Subsequently, we use GPU to count frequent rule patterns. These rules are all horn clauses and we set a language bias such that the length of rules can never extend 3, which follows the principle of Occam's razor.
Challenges we ran into
Out major struggle was in preprocessing the large dataset, hard text processing, and a need to come up with a parallel model to count frequent rule patterns on gpu. For GPU part, we have to make memory acces aligned and make sure every thread is doing the right thing while traversing the graph.
Accomplishments that we're proud of
In 24 hours, we have implemented, from ground, the extractiong and preprocessing cpu part and the traversing and counting part on gpu.
What we learned
We have learned that text processing and gpu coding are so hard, especially in the debugging phase. These task are difficult because the data volume is too large. And a lot of unexpected text errors just appear endless.
For gpu part, we invent a map-reduce method to traverse and count frequent patterns. Each node in the graph is an entity and has a label or type. For instance, a person would have the type attribute human. Thus, the pattern for the node is human. And there can be of lot of entities nodes that have type human. Our traverse expands in execute, for each pattern we map it to different patterns after extending paths under that pattern. And then we reduce paths with same pattern into one pattern, meanwhile count the occurances of this patterns, which is equivalent to the number of concrete paths the pattern has. We keep map and reduce until we connect all keyword. To come up with this algoriothm is a huge challenge.
What's next for 107 - GPU Wiki Rules Miner
We will optimize our prototype of wiki-rule-miner, make it more scalable and efficient. If we can find enough computing resources, we may try to make the computation task across multi-gpu.