-
Notifications
You must be signed in to change notification settings - Fork 2
Home
The Linked Dataset Profiling tool is an implementation of the approach proposed in [1]. It's main purpose is generating structured profiles of Linked Datasets. A profile in this case representes a graph, consisting of linked datasets, resource instances and topics. The topics are DBpedia categories, which are extracted through a Named Entity Disambiguation process by analysing textual literals from resources. The main steps which are executed from the profiling tool are:
- Dataset metadata extraction from DataHub
- Resource instance extraction
- Entity and topic extraction through NED from extracted resources using NED tools like Spotlight or TagMe!
- Profile graph construction and topic ranking
- Exporting of profiles in JSON format.
The individual steps are explained in detail in [1], here we provide a brief overview of the output from each step. In step (1) the input required from the tool is a dataset id, extracted from DataHub, i.e. lak-dataset (http://datahub.io/dataset/lak-dataset) or a group id of datasets, i.e. lodcloud (http://datahub.io/organizations/lodcloud). As an output the LDP tool extracts the metadata from the datasets, such as the SPARQL endpoint, name, maintainer etc., and stores in a directory given by the user. In step (2) LDP extracts resource instances from the given datasets in (1). It has the option to sample the extracted resources, based on three sampling strategies: random, weighted and centrality (see [1]). Furthermore, the user can define what percentage of resources he/she wants to extract, i.e. 5,10,...,95% of resources. In step (3) from the extracted resources, the tool performs the NED process by analysing the textual literals of resources. Here, one can define what datatype properties are of interest for the NED process, which can be fed in into the tool during the process. In this step, LDP extracts entities as DBpedia entities, and the topics from the extracted entities through the datatype property dcterms:subject. The last step of the LDP tool, is step (4) where from the extracted datasets and their corresponding sampled resources, and the extracted entities and topics in step (3), we build the dataset topic graph as our profile. The topics are ranked for their relevance to the respective datasets by different graphical models that can be fed into the LDP tool by the user, i.e. prank, kstep, hits, for PageRank with Priors, K-Step Markov and HITS, respecitvely. Finally, after ranking the topics for their relevance, the LDP tool can export the profiles into JSON format, such that they can be further analysed or exported into RDF or other formats. For RDF we provide the tool which exposes the profiles into RDF using the VoID and VoL schema.
In order to run the LDP tool, it requires few variables to be added in its config file. We show here the possible input values for the different variables (where with "|" we show all acceptable and recognisable values by the tool), wheres for others we provide a simple textual description. See below for the sample config file. The defined variables and values should be stored in a separate file and should be given as an inline argument to the LDP tool, e.g. java -jar ldp.jar config.ini
- loadcase=0|1|2|3|4 (provide only one value at at time, 0 - is for step (1), 1 for step (2), and so on).
- datasetpath=directory location (provide an existing directory where the extracted datasets and resources will be stored)
- normalised_topic_score=file location (provide the path and the name of the file which will hold the computed values for the normalised topic relevance score computed as in [1])
- annotationindex=file location (provide the path and the name of the file which will hold the extracted entities and topics from DBpedia)
- sample_size=1|2|...|95 (the sample size, which defines the ratio of extracted resources for a dataset. Be aware here that Step (3) for large sample sizes takes a long time, and as shown in [1] a sample size of 10% is representative)
- sampling_type=random|weighted|centrality (the sampling strategy to extract the resources, 'centrality' performs best in terms of profiling accuracy)
- outdir=directory location (provide the path to an existing directory for the output directory location)
- topic_ranking_objects=directory location (provide the path to an existing directory for the output generated by the different topic ranking approaches)
- query_str=datahub_dataset_id|datahub_group_id (provide the dataset id or group id from datahub for which you want to perform the profiling)
- is_dataset_group_search=true|false (in case you are looking for a dataset id, then the value here should be false, and vice versa)
- topic_ranking_strategies=prank|kstep|hits (choose one of the topic ranking approaches in Step (4), which determines the relevance of topics for a dataset)
- property_lookup=file location of properties of interest which need to be considered for NED analysis (provide the path to a file containing datatype properties of interest for the NED process. Here the datatype properties should be one per line and their object values should be textual literals)
- raw_graph_dir=directory location (the location where to store the generated dataset topic graphs)
- ned_operation=tagme|spotlight (here one can define which NED process to use. Spotlight doesn't require any changes, while for TagMe! one needs to get the API credenitals (contact at http://tagme.di.unipi.it/) and provide it under the tagme_api_key)
- tagme_api_key= (provide the API KEY for TagMe! in case as NED tool is used TagMe!).
- dbpedia_endpoint=en http://dbpedia.org/sparql,de http://de.dbpedia.org/live/sparql (the DBpedia sparql endpoints in different languages)
- load_entity_categories=true (this has to be set to true as it checks for the extracted entities whether their corresponding topics (categories) are extracted)
- dbpedia_url=http://dbpedia.org/sparql (the URL for the english DBpedia used for the extraction of entity categories)
- timeout=10000 (the timeout when extracting resources from the datasets)
- includeEntities=false (here one can define whether the entities should be included in the profiles or should be left out of the ranking process)
- dataset_topic_graph=raw_graph/dataset_topic_graph.obj (the file location where the dataset_topic_graph is stored)
- alpha=0.1 (values used to initialise the KStep and PageRank models)
- k_steps=3 (the K value for K-Step Markov)
- ranking_iterations=10 (the number of iterations used for the ranking of topics with K-Step Markov and PageRank with Priors)
The code and the tool is provided under the creative commons licence (CC). When using the LDP tool please cite the paper in [1]. For additional information, refere to the website: http://data-observatory.org/lod-profiles/about.html.
[1] Besnik Fetahu, Stefan Dietze, Bernardo Pereira Nunes, Marco Antonio Casanova, Davide Taibi, Wolfgang Nejdl: A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles. ESWC 2014: 519-534