-
Notifications
You must be signed in to change notification settings - Fork 3
Loading additional data
Q: How can I update RDF datasets in Open PHACT's RDF store?
We'll assume you have already installed Open PHACTS using Docker, and that you have Virtuoso running as the Docker container ops-virtuoso
, exposed on http://localhost:3003/sparql (or equivalent hostname).
The RDF data loaded in Open PHACTS come from different sources - some of which require manual download or transformation.
Each dataset is kept in a different named graphs in Virtuoso, meaning we can query them separately. For instance, on http://localhost:3003/sparql try this query to count triples in the Uniprot graph:
SELECT COUNT(*) WHERE {
GRAPH <http://purl.uniprot.org> {
?s ?p ?o.
}
}
The query might take some seconds to execute the first time if the Virtuoso server has recently been restarted.
count? |
---|
1131186434 |
In short, updating a data source consists of:
- Download/generate the updated RDF data
- Drop the old GRAPH (if replacing)
- Load RDF into Virtuoso (NOTE: this could take several hours or even days)
- Download/generate any updated linksets
- Load updated linksets to Identity Mapping Service
- If the RDF has changed in structure (e.g. vocabulary changes), update affected queries in the API
- Test that API calls that worked before still work as expected (allowing for actual data change)
As a running example this page uses Uniprot, but the process will be similar for each of the data sources.
So let's say we want to update Uniprot, which we find in the Open PHACTS 2.0 data sources as being version 2015_11
, however latest version is 2016_07
.
Note that Uniprot is a special case for Open PHACTS, because although you can download all of the Uniprot RDF, that is 128 GB - compressed as rdf.xz
, which mean it would require something like 2 TB just for loading. The queries used by Open PHACTS luckily only require a much smaller subset of this, about 7.9 GB as rdf.gz
.
Still, just parsing may take many hours, so for demonstration purposes this tutorial downloads a small subset.
The Uniprot data source is described as:
curl -d 'query=reviewed%3ayes&force=yes&format=rdf' http://www.uniprot.org/uniprot/ > swissprot_20151116.rdf.xml
curl -d 'query=reviewed%3ayes&force=yes&format=rdf' http://www.uniprot.org/uniparc/ > uniparc_20151116.rdf.xml
curl -d 'sort=&desc=&query=reviewed%3ayes&fil=&format=rdf&force=yes' http://www.uniprot.org/uniref/ > uniref_20151116.rdf.xml
That is three separate queries against Uniprot API, which return the result as RDF based on the latest release. So if we run those queries again today we should get the latest Uniprot data.
Note that executing the above query download can take multiple hours, so in this tutorial we will cheat a bit to make a small subset. The Uniprot query can be accessed in the browser for each of those three sources:
You may notice that all three queries filter crucially by reviewed%3ayes
- that is reviewed:yes
in the search box.
- http://www.uniprot.org/uniprot/?query=reviewed%3Ayes&sort=score
- http://www.uniprot.org/uniparc/?query=reviewed%3Ayes&sort=score
- http://www.uniprot.org/uniref/?query=reviewed%3Ayes&sort=score
This filters to only include the the Swiss-prot manually reviewed entries of Uniprot, which as you see drastically reduces the result size:
Data source | Reviewed | Unreviewed | Total |
---|---|---|---|
uniprot | 551,705 | 65,378,749 | 65,930,454 |
uniparc | 503,157 | 123,735,183 | 124,238,340 |
uniref | 952,465 | 143,351,546 | 144,304,011 |
For each of these queries you will see there's a Download button, from where you select:
- Download all (551705)
- Format: RDF/XML
- Compressed
For the purpose of this tutorial we will modify the query to download only the entries created in the last month. You can similarly modify the queries to only load a particular species, etc.
- http://www.uniprot.org/uniprot/?query=reviewed%3Ayes+AND+created%3A%5B20160601+TO+*%5D&sort=score
- http://www.uniprot.org/uniparc/?query=reviewed%3Ayes+AND+created%3A%5B20160601+TO+*%5D&sort=score
- http://www.uniprot.org/uniref/?query=reviewed%3Ayes+AND+published%3A%5B20160601+TO+*%5D&sort=score
(Note: uniref
used the column published
rather than created
, but as that still gives 267,624
hits, for demonstration purposes we artificially limit this to also filter on 100% identity)
Now you should have three files:
-rw-rw-r-- 1 stain stain 18K Jul 25 14:12 uniparc-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r-- 1 stain stain 3.1M Jul 25 14:12 uniprot-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r-- 1 stain stain 47M Jul 25 14:17 uniref-reviewed%3Ayes+AND+published%3A%5B20160706+TO+-%5D.rdf.gz
Note that we used the filtering only for demonstration purposes, for instance the above do not include pre-existing entries that have since been modified, and also we only go one month back. The full update with the queries from the wiki page should eventually end up with something like:
-rw-rw-r-- 4 stain stain 2.2G Nov 16 2015 swissprot_20151116.rdf.xml.gz
-rw-rw-r-- 4 stain stain 4.5G Nov 16 2015 uniparc_20151116.rdf.xml.gz
-rw-rw-r-- 4 stain stain 1.3G Nov 16 2015 uniref_20151116.rdf.xml.gz
If we are replacing an existing graph, then we would generally want to DROP GRAPH to remove the old named graph and keep the other graphs which you are not replaced, as doing a full reload of all the sources from RDF can be quite time consuming.
Note that for large graphs as in Open PHACTS, dropping a graph in Virtuoso can also be time consuming (sometimes slower than loading the new graph!)
Now we need to move the data to be available from within Virtuoso's Docker container. If you have downloaded the files on a different computer, you need to first transfer them to your Open PHACTS server:
cd ~/Downloads
mkdir uniprot
ls -al uni*gz # Should just be 3 files made just now
ssh heater.cs.man.ac.uk mkdir -p data
scp -r uniprot/ heater.cs.man.ac.uk:data/
The easiest way to load data into Virtuoso with Docker is to use the staging script - which automates using Virtuoso RDF Bulk loading feature (ld_dir
and rdf_loader_run
).
This script assumes the data to be loaded is in the Docker volume /staging
(mapped from the local file system), and that the Virtuoso store is in /virtuoso
. A file staging.sql
should be in the top-level of the mapped /staging
folder.
So in our case the data to load is already in /home/stain/data
in the uniprot
subfolder:
stain@heater:~/data$ ls -al uniprot/
total 51204
drwxrwxr-x 2 stain stain 4096 Jul 25 14:36 .
drwxrwxr-x 3 stain stain 4096 Jul 25 14:37 ..
-rw-rw-r-- 1 stain stain 17547 Jul 25 14:36 uniparc-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r-- 1 stain stain 3156208 Jul 25 14:36 uniprot-reviewed%3Ayes+AND+created%3A%5B20160601+TO+-%5D.rdf.gz
-rw-rw-r-- 1 stain stain 49242276 Jul 25 14:36 uniref-reviewed%3Ayes+AND+published%3A%5B20160706+TO+-%5D.rdf.gz
We'll create our data/staging.sql
file by modifying the relevant lines from
ops-docker's staging.sql
stain@heater:~/data$ vi staging.sql
ld_dir('/staging/uniprot' , '*.rdf.gz' , 'http://purl.uniprot.org' );
Note that we changed the filename pattern '*.nq.gz'
(which is used within the https://data.openphacts.org/ downloads) to '*.rdf.gz'
to match our filenames. Remember that the first parameter is the path within the Docker container, and must start with /staging
, as our /home/stain/data
folder will appear as /staging
within the container.
Tip: Be careful about filename case as Linux is case-sensitive - keeping it lowercase is easiest.
Note: note that most of the graph names used in Open PHACTS do NOT include the trailing /
.
Now that our data is ready to be loaded by Virtuoso, we are going to:
- Shut down
ops-virtuoso
- Run the bulk loading script from our mapped
/staging
- Restart
ops-virtuoso
- Inspect by SPARQ to verify the load was complete
- Verify the API still works
Important: You must shutdown the running ops-virtuoso
instance before we run the staging script. The reason is that the script runs in a separate Docker container and needs full exclusive access to its /virtuoso
.
stain@heater:~/data$ sudo docker stop ops-virtuoso
Next we'll start the bulk loading staging script, using the ops-virtuosodata
data volume, in addition to the volume /home/stain/data
mapped as /staging
(read-only).
stain@heater:~/data$ docker run -v /home/stain/data:/staging:ro --volumes-from ops-virtuosodata -it stain/virtuoso staging.sh
* Starting Virtuoso Open Source Edition 7.2 virtuoso-opensource-7
Note that starting Virtuoso might take some time before you should see:
Configuring SPARQL
Populating from /staging/staging.sql
Starting 7 rdf_loader_runs
Starting RDF loader 1
Starting RDF loader 2
Starting RDF loader 3
Starting RDF loader 4
Starting RDF loader 5
Starting RDF loader 6
Starting RDF loader 7
Starting RDF loader 8
The number of loader threads depend on the number of CPU cores, which generally significantly speed up loading, but in this case there's only 3 files, so in effect only three loader threads will be active. If you are preparing another RDF data source for loading with Virtuoso, then it's generally advisable to have many smaller RDF files (e.g. 10 MB each) than a single large file.
This wiki is licensed under a Creative Commons Attribution 4.0 International License.