Skip to content

Loading additional data

Stian Soiland-Reyes edited this page Jul 25, 2016 · 33 revisions

Tutorial: Loading additional data in Open PHACTS

Q: How can I update RDF datasets in Open PHACT's RDF store?

We'll assume you have already installed Open PHACTS using Docker, and that you have Virtuoso running as the Docker container ops-virtuoso, exposed on http://localhost:3003/sparql (or equivalent hostname).

The RDF data loaded in Open PHACTS come from different sources - some of which require manual download or transformation.

Each dataset is kept in a different named graphs in Virtuoso, meaning we can query them separately. For instance, on http://localhost:3003/sparql try this query to count triples in the Uniprot graph:

SELECT COUNT(*) WHERE { 
    GRAPH <http://purl.uniprot.org> {
      ?s ?p ?o.
    }
}

The query might take some seconds to execute the first time if the Virtuoso server has recently been restarted.

count?
1131186434

Overview

In short, updating a data source consists of:

  1. Download/generate the updated RDF data
  2. Drop the old GRAPH (if replacing)
  3. Load RDF into Virtuoso (NOTE: this could take several hours or even days)
  4. Download/generate any updated linksets
  5. Load updated linksets to Identity Mapping Service
  6. If the RDF has changed in structure (e.g. vocabulary changes), update affected queries in the API
  7. Test that API calls that worked before still work as expected (allowing for actual data change)

As a running example this page uses Uniprot, but the process will be similar for each of the data sources.

What data source?

So let's say we want to update Uniprot, which we find in the Open PHACTS 2.0 data sources as being version 2015_11, however latest version is 2016_07.

Note that Uniprot is a special case for Open PHACTS, because although you can download all of the Uniprot RDF, that is 128 GB - compressed as rdf.xz, which mean it would require something like 2 TB just for loading. The queries used by Open PHACTS luckily only require a much smaller subset of this, about 7.9 GB as rdf.gz.

Still, just parsing may take many hours, so for demonstration purposes this tutorial downloads a small subset.

Download data

The Uniprot data source is described as:

curl -d 'query=reviewed%3ayes&force=yes&format=rdf' http://www.uniprot.org/uniprot/ > swissprot_20151116.rdf.xml
curl -d 'query=reviewed%3ayes&force=yes&format=rdf' http://www.uniprot.org/uniparc/ > uniparc_20151116.rdf.xml
curl -d 'sort=&desc=&query=reviewed%3ayes&fil=&format=rdf&force=yes' http://www.uniprot.org/uniref/ > uniref_20151116.rdf.xml

That is three separate queries against Uniprot API, which return the result as RDF based on the latest release. So if we run those queries again today we should get the latest Uniprot data.

Note that executing the above query download can take multiple hours, so in this tutorial we will cheat a bit to make a small subset. The Uniprot query can be accessed in the browser for each of those three sources:

You may notice that all three queries filter crucially by reviewed%3ayes - that is reviewed:yes in the search box.

This filters to only include the the Swiss-prot manually reviewed entries of Uniprot, which as you see drastically reduces the result size:

Data source Reviewed Unreviewed Total
uniprot 551,705 65,378,749 65,930,454
uniparc 503,157 123,735,183 124,238,340
uniref 952,465 143,351,546 144,304,011

For each of these queries you will see there's a Download button, from where you select:

  • Download all (551705)
  • Format: RDF/XML
  • Compressed

For the purpose of this tutorial we will modify the query to download only the entries created in the last month. You can similarly modify the queries to only load a particular species, etc.

(Note: uniref used the column published rather than created, but as that still gives 267,624 hits, for demonstration purposes we artificially limit this to also filter on 100% identity)

Now you should have three files:

-rw-rw-r--  1 stain stain  18K Jul 25 14:12 uniparc-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r--  1 stain stain 3.1M Jul 25 14:12 uniprot-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r--  1 stain stain  47M Jul 25 14:17 uniref-reviewed%3Ayes+AND+published%3A%5B20160706+TO+-%5D.rdf.gz

Note that we used the filtering only for demonstration purposes, for instance the above do not include pre-existing entries that have since been modified, and also we only go one month back. The full update with the queries from the wiki page should eventually end up with something like:

-rw-rw-r-- 4 stain stain 2.2G Nov 16  2015 swissprot_20151116.rdf.xml.gz
-rw-rw-r-- 4 stain stain 4.5G Nov 16  2015 uniparc_20151116.rdf.xml.gz
-rw-rw-r-- 4 stain stain 1.3G Nov 16  2015 uniref_20151116.rdf.xml.gz

Dropping the old graph

If we are replacing an existing graph, then we would generally want to DROP GRAPH to remove the old named graph and keep the other graphs which you are not replaced, as doing a full reload of all the sources from RDF can be quite time consuming.

Note that for large graphs as in Open PHACTS, dropping a graph in Virtuoso can also be time consuming (sometimes slower than loading the new graph!)

Staging the data

Now we need to move the data to be available from within Virtuoso's Docker container. If you have downloaded the files on a different computer, you need to first transfer them to your Open PHACTS server:

cd ~/Downloads
mkdir uniprot    

ls -al uni*gz # Should just be 3 files made just now

ssh heater.cs.man.ac.uk mkdir -p data
scp -r uniprot/ heater.cs.man.ac.uk:data/

The easiest way to load data into Virtuoso with Docker is to use the staging script - which automates using Virtuoso RDF Bulk loading feature (ld_dir and rdf_loader_run).

This script assumes the data to be loaded is in the Docker volume /staging (mapped from the local file system), and that the Virtuoso store is in /virtuoso. A file staging.sql should be in the top-level of the mapped /staging folder.

So in our case the data to load is already in /home/stain/data in the uniprot subfolder:

stain@heater:~/data$ ls -al uniprot/
total 51204
drwxrwxr-x 2 stain stain     4096 Jul 25 14:36 .
drwxrwxr-x 3 stain stain     4096 Jul 25 14:37 ..
-rw-rw-r-- 1 stain stain    17547 Jul 25 14:36 uniparc-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r-- 1 stain stain  3156208 Jul 25 14:36 uniprot-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r-- 1 stain stain 49242276 Jul 25 14:36 uniref-reviewed%3Ayes+AND+published%3A%5B20160706+TO+-%5D.rdf.gz

We'll create our data/staging.sql file by modifying the relevant lines from ops-docker's staging.sql

stain@heater:~/data$ vi staging.sql 
ld_dir('/staging/uniprot' , '*.rdf.gz' , 'http://purl.uniprot.org' );

Note that we changed the filename pattern '*.nq.gz' (which is used within the https://data.openphacts.org/ downloads) to '*.rdf.gz' to match our filenames. Remember that the first parameter is the path within the Docker container, and must start with /staging, as our /home/stain/data folder will appear as /staging within the container.

Tip: Be careful about filename case as Linux is case-sensitive - keeping it lowercase is easiest.

Note: note that most of the graph names used in Open PHACTS do NOT include the trailing /.

Parsing the RDF

Now that our data is ready to be loaded by Virtuoso, we are going to:

  • Shut down ops-virtuoso
  • Run the bulk loading script from our mapped /staging
  • Restart ops-virtuoso
  • Inspect by SPARQ to verify the load was complete
  • Verify the API still works

Important: You must shutdown the running ops-virtuoso instance before we run the staging script. The reason is that the script runs in a separate Docker container and needs full exclusive access to its /virtuoso.

stain@heater:~/data$ sudo docker stop ops-virtuoso

Next we'll start the bulk loading staging script, using the ops-virtuosodata data volume, in addition to the volume /home/stain/data mapped as /staging (read-only).

stain@heater:~/data$ docker run -v /home/stain/data:/staging:ro --volumes-from ops-virtuosodata -it stain/virtuoso staging.sh
 * Starting Virtuoso Open Source Edition 7.2  virtuoso-opensource-7                                                                                                                                                                                                                       

Note that starting Virtuoso might take some time before you should see:

Configuring SPARQL
Populating from /staging/staging.sql
Starting 7 rdf_loader_runs
Starting RDF loader 1
Starting RDF loader 2
Starting RDF loader 3
Starting RDF loader 4
Starting RDF loader 5
Starting RDF loader 6
Starting RDF loader 7
Starting RDF loader 8

The number of loader threads depend on the number of CPU cores, which generally significantly speed up loading, but in this case there's only 3 files, so in effect only three loader threads will be active. If you are preparing another RDF data source for loading with Virtuoso, then it's generally advisable to have many smaller RDF files (e.g. 10 MB each) than a single large file.

Clone this wiki locally