Knowledge-base Extractor
This is a node.js application that aims at extracting the knowledge represented in the Google infoboxes (aka Google Knowlege Graph Panel).
The Algorithm implemented is the following:
- Query DBpedia for all concepts (types) for which there is at least one instance that has a link to a Freebase ID
- For each of these concepts pick (n) instances randomly
- For each instance, issue a Google Search query:
- if an infobox is available -> scrap the infobox to extract the properties
- if no infoxbox is available, check if Google suggests "do you mean ... ?" and if so, traverse the link and look for an infobox
- if no infobox or correction is available, disambiguate the concept (type) used in the search query and check if an infobox is returned
- if Google suggests disambiguation in an infobox parse all the links in it -> it is best to find which suggestion maps to the current data-type we are using -> check the Freebase - DBpedia mappings
- Cluster properties for each concept
Notes
- The result of our expirement is in the results folder
results/dbpedia.json
- For a more detailed view for each DBpedia class, one can check the files in
results/dbpedia
How to run?
- Clone the repo to your local machine
- run
npm install
on the root of the local project directory
You have to create one folder for caching ...
- Main folder called cache in the root of the project.
We Will automatically create these child folders:
- folder called
GKB
inside the cache folder: This will hold the aggregated Google Knowledge boxes extracted for a DBpedia concept (type) - folder called
instances_GKB
inside the cache folder: This will hold the Google Knowledge box for a single instance - folder called
instances
inside the cache folder: This will hold the DBpedia instances for each concept (type) folder called
instance_properties
inside the cache folder: Thiw ill hold the distinct list of properties for all the instances of a certain concept- run
node server.js
- The application is run in the console and the output will be available in results/result.json
- run
Crawling Configuration
These are the default options that can be found in file KBE.js
cache_dbpedia_concepts : true,
limit_dbpedia_concepts : true,
limit_dbpedia_instances : true,
limit_dbpedia_concepts_value : 10,
limit_dbpedia_instances_value: 10,
proxy : null
cache_dbpedia_concepts
cache the concepts retrieved from DBpedia.limit_dbpedia_concepts
limit the number of concepts retrieved by DBpedia, false will retrieve all the conceptslimit_dbpedia_instances
limit the number of instances retrieved for each concept, false will retrieve all the instanceslimit_dbpedia_concepts_value
the number of concepts you wish to retrievelimit_dbpedia_instances_value
the number of instances you wish to retrieve for each conceptproxy
the proxy address string containing ports i.ehttp:\\proxy:8080
For our experiment the parameters are:
cache_dbpedia_concepts : true,
limit_dbpedia_concepts : false,
limit_dbpedia_instances : true,
limit_dbpedia_concepts_value : null,
limit_dbpedia_instances_value: 100,
proxy : null
Updates
- Properties now have the direct links to DBpedia ontology
- Properties scores are normalized
Sample Result
"Band": {
"summary": {
"label": {
"uri": "http://dbpedia.org/property/label",
"count": 100
},
"description": {
"uri": "http://purl.org/dc/elements/1.1/description",
"count": 100
},
"type": {
"uri": "http://dbpedia.org/property/type",
"count": 100
},
"origin": {
"uri": "http://dbpedia.org/property/origin",
"count": 88.17204301075269
},
"members": {
"uri": "http://dbpedia.org/property/members",
"count": 88.17204301075269
},
"albums": {
"uri": "http://dbpedia.org/property/albums",
"count": 87.09677419354838
},
"leadSingers": {
"uri": "http://dbpedia.org/property/leadSingers",
"count": 6.451612903225806
},
"recordLabel": {
"uri": "http://dbpedia.org/property/recordLabel",
"count": 12.903225806451612
},
"awards": {
"uri": "http://dbpedia.org/property/awards",
"count": 13.978494623655912
},
"nominations": {
"uri": "http://dbpedia.org/property/nominations",
"count": 7.526881720430108
},
"born": {
"uri": "http://dbpedia.org/property/born",
"count": 2.1505376344086025
},
"nationality": {
"uri": "http://dbpedia.org/property/nationality",
"count": 2.1505376344086025
},
"height": {
"uri": "http://dbpedia.org/property/height",
"count": 1.0752688172043012
}
},
"infoboxless": [
"!Action Pact!",
"Allele (band)",
"Anti-Pasti",
"Armageddon (A&M band)",
"Banket (band)",
"Battlelore",
"Ben Folds Five"
],
"Unmapped_Properties": {
"leadSinger": 1,
"recordLabels": 1,
"songs": 1,
"upcomingEvents": 1,
"peopleAlsoSearchFor": 1,
"activeFrom": 1,
"filmMusicCredits": 1,
"activeUntil": 1,
"moviesAndTvShows": 1
}
}
Log in or sign up for Devpost to join the conversation.