Knowledge-base Extractor

This is a node.js application that aims at extracting the knowledge represented in the Google infoboxes (aka Google Knowlege Graph Panel).

The Algorithm implemented is the following:

  • Query DBpedia for all concepts (types) for which there is at least one instance that has a link to a Freebase ID
  • For each of these concepts pick (n) instances randomly
  • For each instance, issue a Google Search query:
    • if an infobox is available -> scrap the infobox to extract the properties
    • if no infoxbox is available, check if Google suggests "do you mean ... ?" and if so, traverse the link and look for an infobox
    • if no infobox or correction is available, disambiguate the concept (type) used in the search query and check if an infobox is returned
    • if Google suggests disambiguation in an infobox parse all the links in it -> it is best to find which suggestion maps to the current data-type we are using -> check the Freebase - DBpedia mappings
  • Cluster properties for each concept

Notes

  • The result of our expirement is in the results folder results/dbpedia.json
  • For a more detailed view for each DBpedia class, one can check the files in results/dbpedia

How to run?

  • Clone the repo to your local machine
  • run npm install on the root of the local project directory

You have to create one folder for caching ...

  • Main folder called cache in the root of the project.

We Will automatically create these child folders:

  • folder called GKB inside the cache folder: This will hold the aggregated Google Knowledge boxes extracted for a DBpedia concept (type)
  • folder called instances_GKB inside the cache folder: This will hold the Google Knowledge box for a single instance
  • folder called instances inside the cache folder: This will hold the DBpedia instances for each concept (type)
  • folder called instance_properties inside the cache folder: Thiw ill hold the distinct list of properties for all the instances of a certain concept

    • run node server.js
    • The application is run in the console and the output will be available in results/result.json

Crawling Configuration

These are the default options that can be found in file KBE.js

cache_dbpedia_concepts       : true,
limit_dbpedia_concepts       : true,
limit_dbpedia_instances      : true,
limit_dbpedia_concepts_value : 10,
limit_dbpedia_instances_value: 10,
proxy                        : null
  • cache_dbpedia_concepts cache the concepts retrieved from DBpedia.
  • limit_dbpedia_concepts limit the number of concepts retrieved by DBpedia, false will retrieve all the concepts
  • limit_dbpedia_instances limit the number of instances retrieved for each concept, false will retrieve all the instances
  • limit_dbpedia_concepts_value the number of concepts you wish to retrieve
  • limit_dbpedia_instances_value the number of instances you wish to retrieve for each concept
  • proxy the proxy address string containing ports i.e http:\\proxy:8080

For our experiment the parameters are:

cache_dbpedia_concepts       : true,
limit_dbpedia_concepts       : false,
limit_dbpedia_instances      : true,
limit_dbpedia_concepts_value : null,
limit_dbpedia_instances_value: 100,
proxy                        : null

Updates

  • Properties now have the direct links to DBpedia ontology
  • Properties scores are normalized

Sample Result

  "Band": {
    "summary": {
        "label": {
            "uri": "http://dbpedia.org/property/label",
            "count": 100
        },
        "description": {
            "uri": "http://purl.org/dc/elements/1.1/description",
            "count": 100
        },
        "type": {
            "uri": "http://dbpedia.org/property/type",
            "count": 100
        },
        "origin": {
            "uri": "http://dbpedia.org/property/origin",
            "count": 88.17204301075269
        },
        "members": {
            "uri": "http://dbpedia.org/property/members",
            "count": 88.17204301075269
        },
        "albums": {
            "uri": "http://dbpedia.org/property/albums",
            "count": 87.09677419354838
        },
        "leadSingers": {
            "uri": "http://dbpedia.org/property/leadSingers",
            "count": 6.451612903225806
        },
        "recordLabel": {
            "uri": "http://dbpedia.org/property/recordLabel",
            "count": 12.903225806451612
        },
        "awards": {
            "uri": "http://dbpedia.org/property/awards",
            "count": 13.978494623655912
        },
        "nominations": {
            "uri": "http://dbpedia.org/property/nominations",
            "count": 7.526881720430108
        },
        "born": {
            "uri": "http://dbpedia.org/property/born",
            "count": 2.1505376344086025
        },
        "nationality": {
            "uri": "http://dbpedia.org/property/nationality",
            "count": 2.1505376344086025
        },
        "height": {
            "uri": "http://dbpedia.org/property/height",
            "count": 1.0752688172043012
        }
    },
    "infoboxless": [
        "!Action Pact!",
        "Allele (band)",
        "Anti-Pasti",
        "Armageddon (A&M band)",
        "Banket (band)",
        "Battlelore",
        "Ben Folds Five"
    ],
    "Unmapped_Properties": {
        "leadSinger": 1,
        "recordLabels": 1,
        "songs": 1,
        "upcomingEvents": 1,
        "peopleAlsoSearchFor": 1,
        "activeFrom": 1,
        "filmMusicCredits": 1,
        "activeUntil": 1,
        "moviesAndTvShows": 1
    }
  }
Share this project:

Updates