Skip to content

Link Guess Workflow and Project

difrad edited this page Feb 29, 2012 · 33 revisions

The Link Guesser is a parallel data link recommender that runs on High Performance Computers like the TWC Hercules machine and the CCNI. It reads in semantic datasets and searches for possible predicates that can be linked to Instance Hub. It currently looks for US States and wgs:lat and wgs:long information.

This page describes the current and planned workflow of the Link Guesser and how it will fit into the overall TWC LOGD data conversion workflow.

1) Select datasets to analyze

The first step is choosing the datasets to analyze with the Link Guesser. These should be listed in the link-guesses retrieve.sh. This script will contain the list and is version controlled in the escience svn.

We will use the following URI as an example throughout this workflow discussion. This would be listed in the retrieve.sh script.

http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30

2) Retrieve dataset Turtle dumps

The second step is to obtain the turtle dump files. When invoked, the retrieve.sh script will find the (potentially compressed) turtle dump files by dereferencing the dataset URIs and following the void:dataDump predicate. These files go into the source/ directory of a new version directory, according to the csv2rdf4lod-automation directory conventions.

The following directories and file results when this step is done.

version/2012-Feb-20/source/data-gov-1000-2010-Aug-30.ttl.gz

3) Convert to N-Triples and specify the dataset URI with .sd_name

The link guesser requires N-Triples format, but the data dumps are not hosted as N-Triples (because it is too verbose). So, the third step is to uncompress any compressed data dumps and convert them to N-Triples, storing the results in manual/. The Link Guesser also needs the dataset URI, so we'll include a sibling file with the extension .sd_name whose contents is just the string of the URI.

version/2012-Feb-20/manual/data-gov-1000-2010-Aug-30.ttl.nt
version/2012-Feb-20/manual/data-gov-1000-2010-Aug-30.ttl.nt.sd_name

4) Analyze the datasets

The first step is having the Link Guesser analyze datasets. This is done by feeding the Guesser the following three inputs:

an N-triples LOGD dataset , , and the void.ttl file of that dataset (which comes from the CSV2RDF4LOD converter). Inputs:

  • The data to analyze (as N-TRIPLES) (i.e. http://logd.tw.rpi.edu/source/data-gov/file/1000/version/2010-Aug-30/conversion/data-gov-1000-2010-Aug-30.ttl.gz)
  • The named graph of that dataset (e.g. http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30)

Write output to

automatic/data-gov-1000-2010-Aug-30.ttl.nt.void.ttl

The Guesser reports a list of predicates that match as ether a US State or a Lat/Long, and provide a score of how well it believes this predicate can be linked to Instance Hub. The Guesser expresses its guesses in RDF similar to the following:

<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30>
   # This is the dataset that we are analyzing for links.
   a void:Dataset;
   void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses>;
   # Our analysis will become a subset of the collection of link-guesses about data-gov/dataset/1000/version/2010-Aug-30
.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses>
   a void:Dataset;
   void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>;
   # This ------/\ is the dataset of link guesses that we just created for data-gov/dataset/1000/version/2010-Aug-30
.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>
   a void:Dataset, conversion:LinkGuessDataset, conversion:MetaDataset;
   dcterms:modified "2012-02-20T20:40:26-05:00"^^xsd:dateTime;
   void:dataDump <http://logd.tw.rpi.edu/source/twc-rpi-edu/provenance_file/link-guesses/version/2012-Feb-20/automatic/data-gov-1000-2010-Aug-30.ttl.nt.void.ttl> 
   # This ---------/\ is the data file created by dom's super computer link guesser 2000.
.
<http://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/link-guesses/version/2012-Feb-20>
   # This is the dataset of link guesses that we performed for _all_ datasets on 2012-Feb-20.
   a void:Dataset;
   void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>;
   # Our collection of _all_ link guesses on 2012-Feb-20 includes the same dataset that we put under 
   # _each_ of the datasets that we analyzed for links.
.
<http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/vocab/raw/state> 
  # We are adding a description directly to the predicate used in the dataset, so that it is easy to find guesses from it.
  :hasLinkGuess <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20/guess/1>;
.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20/guess/1>
   # We are naming our guesses within the scope of the original datasets (2012-Feb-20 is the version of our link guesses) 
   a :LinkGuess;
   void:inDataset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>;
   # This --------/\ is the void:Dataset of guesses that is a void:subset of the original and our guess collections.
   dcterms:dateTime "2012-02-04T23:00:00Z"
   :dataset <http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/version/2011-Dec-16>;
   :link_concept <http://dbpedia.org/class/yago/StatesOfTheUnitedStates>;
   :confidence 85;
   # ^--- these three properties are in the vocabulary of the link guesser.
   prov:wasAttributedTo :link_guesser_2000;
.

:link_guesser_2000
   a doap:Software;
   dcterms:creator "Jesse"
   dcterms:contributor "Greg";
   dcterms:contributor "Dominic";
.

In this example, for the EPA dataset "Toxin Release Into The Atmosphere", the Guesser has identified the predicate http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/vocab/raw/state as possibly being able to link to the Instance Hub category of US States with a confidence of 85 out of 100.

Dominic checks out a working copy of onto hercules, runs retrieve.sh, then performs the link guessing algorithm, which puts files into the automatic/ directory of a new version.

5) Commit link guesses to escience SVN.

Dominic svn commits version/2012-Feb-20/automatic/*.

6) Publish guesses

This void.ttl can now be loaded into the void graph in http://logd.tw.rpi.edu/sparql endpoint.

We use the normal csv2rdf4lod-automation process to publish the guesses.

To publish it, someone on gemini svn updates to get new guesses.

root@gemini:/mnt/raid/srv/logd/data/source/twc-rpi-edu/link-guesses/version/2012-Feb-20# cr-publish-cockpit.sh -w

creates publish/bin/virtuoso-load-twc-rpi-edu-link-guesses-2012-Feb-20.sh and hosts http://logd.tw.rpi.edu/source/twc-rpi-edu/file/link-guesses/version/2012-Feb-20/conversion/twc-rpi-edu-link-guesses-2012-Feb-20.void.ttl

root@gemini:/mnt/raid/srv/logd/data/source/twc-rpi-edu/link-guesses/version/2012-Feb-20# publish/bin/virtuoso-load-twc-rpi-edu-link-guesses-2012-Feb-20.sh --meta

We can verify that the link guess metadata makes it into the graph <http://logd.tw.rpi.edu/vocab/Dataset> by grabbing a guess URI from the file and getting its descriptions:

PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?g ?p ?o
WHERE {
  GRAPH ?g  {
    <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20/guess/10>
       ?p ?o
  }
} 

7) Query for guesses

Now that this information is loaded into the endpoint, we can query for all datasets that have been analyzed by the Link Guesser, check if they should be linked, and then actually make the link in the dataset.

8) rest of stuff

The second part will be done by a simple PHP script that will query the endpoint for all datasets that have been analyzed but have not yet been linked (this information of what enhancements have been made to a dataset is also found in the same void graph.)

select ?todo 
where {
  TODO: make query
}

The script will display to the user all of the datasets that fit this description. The user can then choose a dataset, view the predicate(s) that have link potential and see a sample of the values for that predicate (if sample data for this dataset is loaded into the endpoint).

After a review of the information, the user can decide if this link should be made. The script will then modify the enhancement parameters for that dataset by adding a LinksVia using Instance Hub Category US States.

TODO: what code is handling this, where will it live, and how will it find out about the eparams to modify, and where will the modified eparams go?

The user interface and server side component will be a php script on gemini. This php script will first query for all datasets that have link guesses and have not been linked to instance hub states. Once a user selects a dataset, the php script will now query for the link guess with confidence over 75%, and display these predicates to the user. The php script while doing this will also query for the source, dataset name, and version of this dataset (using this information it will be possible to know exactly where on gemini (in the /srv/logd/data/source directory) the dataset actually lives, and where the param.ttl is.

Once a user has selected a predicate to link to instance hub, the php script send the predicate URI and the location of the param.ttl to a python script which uses rdflib. This python script will load in the param.ttl, modify the rdf in the param.ttl to add the new information of what to link to instance hub, and the write this rdf back out as turtle in the same place as the old param.ttl. Now the new param.ttl has all the link information that we need. The php script can now pull the conversion trigger again to do the actual linking.

Current problems:

When the user picks a dataset to enhance by linking to instance hub, the php script must know what the latest enhancement number is for this verisoned dataset. For example, in http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30 we need to have a way to know if this current version is raw, e1, e2, etc. We need to know this so we can find the param.ttl for this dataset, so we can create a new enhancement layer for this dataset. And we need to do this in an automatic way.

In param.ttl the prefix TODO is the same as RDFS. When RDFlib sees this, it only uses the TODO prefix in it's outputted rdf, and not RDFS prefix. This is a problem as the converter will not actual run again unless the TODO has been changed to RDFS, and this does not seem possible to do using RDFlib.

Outstanding tasks

What currently need to be decided/built:

  • The vocabulary for expressing this link potential from the Link Guesser into the void.ttl needs to be defined and agreed upon.
  • The PHP script that displays and modifies the dataset needs to be written

Oustanding backburner issues

We're setting these aside:

  • predicates collide over multiple tables
  • global is used instead of local (but we'd want it to be local b/c we just specified e3 with LinksVia) - implement CSV2RDF4LOD_PUBLISH_GLOBAL_ENHANCEMENTS_STANCE="top-down" or "bottom-up" (we are currently top-down)