Canonical form of SPARQL Patterns #1536

joernhees · 2015-04-30T13:37:20Z

joernhees
Apr 30, 2015
Maintainer

I'm currently performing >> 1M SPARQL Queries as part of some machine learning algorithm. As this takes a while, i thought about caching results for SPARQL queries. The problem here is, that different SPARQL queries can contain Variables with different names, but are isomorphic otherwise. Example:

select * where { ?s foo:bar foo:bla }

is isomorphic to

select * where { ?s2 foo:bar foo:bla }

For quick checking in a cache it would be cool to have a canonical form of a SPARQL Pattern, very much like #441 (rdflib.compare.to_canonical_graph(g1)) for rdflib.Graph.

A SPARQL Query's pattern part can be represented as an rdflib.Graph which contains Variables. By replacing Variables with BNodes (using the variable name as bnode id) one gets pretty close to a graph that one could use the to_canonical_graph algorithm on, with one exception: BNodes can't be used as predicates (RDF Concepts).

As this is out of spec, i guess it's ok this fails:

In [1]: from rdflib import *
INFO:rdflib:RDFLib Version: 4.2.1-dev

In [2]: from rdflib.compare import *

In [3]: g = Graph()

In [4]: g.add((BNode('v1'), BNode('v2'), URIRef('foo')))

In [5]: to_canonical_graph(g)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[...]
/usr/local/lib/python2.7/site-packages/rdflib/compare.pyc in _canonicalize_bnodes(self, triple, labels)
    456         for term in triple:
    457             if isinstance(term, BNode):
--> 458                 yield BNode(value="cb%s" % labels[term])
    459             else:
    460                 yield term

KeyError: rdflib.term.BNode('v2')

Nevertheless, as this is quite close to a cool feature and graph canonicalization isn't exactly the easiest problem to think about: is it maybe possible to slightly adapt the RGDA1 algorithm to support BNodes in the predicate position as well and thereby also making it fit for SPARQL Patterns? Maybe @jimmccusker has an idea on this?

jpmccu · 2015-04-30T19:25:38Z

jpmccu
Apr 30, 2015

I think that RDF and SPARQL APIs distinguish between variables and BNodes.
Anything beyond BGP would need something more complex than you're
suggesting. However, it looks like SPIN has a mapping of SPARQL in RDF [1],
so you could compute the hash of that.

Jim

[1] http://spinrdf.org/

On Thu, Apr 30, 2015 at 9:37 AM Jörn Hees notifications@github.com wrote:

I'm currently performing >> 1M SPARQL Queries as part of some machine
learning algorithm. As this takes a while, i thought about caching results
for SPARQL queries. The problem here is, that different SPARQL queries can
contain Variables with different names, but are isomorphic otherwise.
Example:

select * where { ?s foo:bar foo:bla }

is isomorphic to

select * where { ?s2 foo:bar foo:bla }

For quick checking in a cache it would be cool to have a canonical form of
a SPARQL Pattern, very much like #441
#441 (
rdflib.compare.to_canonical_graph(g1)) for rdflib.Graph.

A SPARQL Query's pattern part can be represented as an rdflib.Graph which
contains Variables. By replacing Variables with BNodes (using the
variable name as bnode id) one gets pretty close to a graph that one could
use the to_canonical_graph algorithm on, with one exception: BNodes can't
be used as predicates (RDF Concepts
http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-triples).

As this is out of spec, i guess it's ok this fails:

In [1]: from rdflib import *
INFO:rdflib:RDFLib Version: 4.2.1-dev

In [2]: from rdflib.compare import *

In [3]: g = Graph()

In [4]: g.add((BNode('v1'), BNode('v2'), URIRef('foo')))

In [5]: to_canonical_graph(g)---------------------------------------------------------------------------KeyError Traceback (most recent call last)
[...]/usr/local/lib/python2.7/site-packages/rdflib/compare.pyc in _canonicalize_bnodes(self, triple, labels)
456 for term in triple:
457 if isinstance(term, BNode):--> 458 yield BNode(value="cb%s" % labels[term])
459 else:
460 yield term
KeyError: rdflib.term.BNode('v2')

Nevertheless, as this is quite close to a cool feature and graph
canonicalization isn't exactly the easiest problem to think about: is it
maybe possible to slightly adapt the RGDA1 algorithm to support BNodes in
the predicate position as well and thereby also making it fit for SPARQL
Patterns? Maybe @jimmccusker https://github.com/jimmccusker has an idea
on this?

—
Reply to this email directly or view it on GitHub
#483.

0 replies

joernhees · 2015-05-01T12:09:07Z

joernhees
May 1, 2015
Maintainer Author

Thanks for the reply. At the moment i'm just after BGPs, but you're right, it could get a lot more complicated.

Using a SPIN-like approach to just map SPARQL to RDF and then use your RGDA1 algorithm on it is a very interesting idea. They're not transforming the Variables into BNodes, but i could. Another problem i see for now is the order preserving Turtle sequence they're using (http://spinrdf.org/sp.html#overview) for the BGP, as i want a statement-order independent mapping... so i guess i'll try to go for a more direct mapping into an RDF graph via BNodes for the reification of each BGP statement and BNodes for all Variables.

0 replies

kasei · 2015-05-01T19:01:34Z

kasei
May 1, 2015

FYI, we added support for this in the Attean perl library to help with caching. It was a fairly simply extension of existing graph canonicalization code to handle any set of patterns (triples, triple patterns, quads, quad patterns, SPARQL Results...).

% cat test.rq
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
    ?p a foaf:Person ; foaf:name ?name
}

% perl bin/canonicalize_bgp.pl test.rq
# Hash key: f3e47550cdcf8c0bd1c93f55f01f9b1a73db9efd
SELECT (?v001 AS ?p) (?v002 AS ?name) WHERE {
    ?v001 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
    ?v001 <http://xmlns.com/foaf/0.1/name> ?v002 .
}

0 replies

chimezie · 2015-05-01T19:20:35Z

chimezie
May 1, 2015
Collaborator

This might not be too immediately practical, since the code I'm pointing
you to is very out of sync with the latest rdflib, but I had to create just
such a solution to implement Fuxi's RETE network.

See HashablePatternList:

https://code.google.com/p/fuxi/source/browse/lib/Rete/Network.py#50

Perhaps there is a more recent port of this ..

However, this only covers Basic Graph Patterns ...

On Fri, May 1, 2015 at 3:01 PM, Gregory Todd Williams <
notifications@github.com> wrote:

FYI, we added support for this
https://github.com/kasei/attean/blob/master/bin/canonicalize_bgp.pl in
the Attean https://github.com/kasei/attean perl library to help with
caching. It was a fairly simply extension of existing graph
canonicalization code to handle any set of patterns (triples, triple
patterns, quads, quad patterns, SPARQL Results...).

% cat test3.rq
PREFIX foaf: http://xmlns.com/foaf/0.1/
SELECT * WHERE {
?p a foaf:Person ; foaf:name ?name
}

% perl bin/canonicalize_bgp.pl test3.rq

Hash key: f3e47550cdcf8c0bd1c93f55f01f9b1a73db9efd

SELECT (?v001 AS ?p) (?v002 AS ?name) WHERE {
?v001 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/Person .
?v001 http://xmlns.com/foaf/0.1/name ?v002 .
}

—
Reply to this email directly or view it on GitHub
#483 (comment).

0 replies

joernhees · 2015-05-04T14:28:23Z

joernhees
May 4, 2015
Maintainer Author

first of all: thanks for all the cool feedback, i guess i would've chewed on this for quite a while without you guys ;)

I ended up using the following reification based approach, mostly because it's quite short (most of the below is doctest) and uses RGDA1 canonicalization that we already have in rdflib. Below I first convert each triple into a reified statement in a new graph, also converting Variables into BNodes. Then I use the RGDA1 canonicalization on that graph and afterwards re-extract the triples, transforming the renamed BNodes into renamed Variables. The result is a variable-name and order independent canonicalization of the BGP:

def canonicalize_sparql_bgp(gp):
    """Returns a canonical basic graph pattern (BGP) with canonical var names.

    :param gp: a GraphPattern in form of a list of triples with Variables
    :return: A canonical GraphPattern with Variables renamed.

    >>> U = URIRef
    >>> V = Variable
    >>> gp1 = [
    ...     (V('blub'), V('bar'), U('blae')),
    ...     (V('foo'), V('bar'), U('bla')),
    ...     (V('foo'), U('poo'), U('blub')),
    ... ]
    >>> cgp = canonicalize_sparql_bgp(gp1)
    >>> v_blub = V('cb0')
    >>> v_bar = V(
    ...  'cb3d1b27f6269e23775a8da8d966dd669aa8262176ae6b938cccd653316791c42269')
    >>> v_foo = V(
    ...  'cb3b2718590899b3875a33cdc4aad060832711a614ee9c0ac83323f2e961bcc3f2db')
    >>> expected = [
    ...     (v_blub, v_bar, U('blae')),
    ...     (v_foo, v_bar, U('bla')),
    ...     (v_foo, U('poo'), U('blub'))
    ... ]
    >>> cgp == expected
    True

    To show that this is variable name and order independent we shuffle gp1 and
    rename its vars:
    >>> gp2 = [
    ...     (V('foonkyname'), V('baaar'), U('bla')),
    ...     (V('foonkyname'), U('poo'), U('blub')),
    ...     (V('funkyname'), V('baaar'), U('blae')),
    ... ]
    >>> cgp == canonicalize_sparql_bgp(gp2)
    True

    """
    assert isinstance(gp, Iterable)
    g = Graph()
    for t in gp:
        triple_bnode = BNode()
        s, p, o = [BNode(i) if isinstance(i, Variable) else i for i in t]
        g.add((triple_bnode, RDF['type'], RDF['Statement']))
        g.add((triple_bnode, RDF['subject'], s))
        g.add((triple_bnode, RDF['predicate'], p))
        g.add((triple_bnode, RDF['object'], o))
    cg = rdflib.compare.to_canonical_graph(g)
    cgp = []
    for triple_bnode in cg.subjects(RDF['type'], RDF['Statement']):
        t = [
            cg.value(triple_bnode, p)
            for p in [RDF['subject'], RDF['predicate'], RDF['object']]
        ]
        t = tuple([Variable(i) if isinstance(i, BNode) else i for i in t])
        cgp.append(t)
    return sorted(cgp)

remaining question is: do we want this in rdflib somewhere?

0 replies

jpmccu · 2015-05-04T15:10:59Z

jpmccu
May 4, 2015

Hmm, SPIN's approach is not what I would have imagined. They seem to have a special sp:_ URI space for variables, and like Jorn said, they use a list to order the BGP patterns, both of which seem odd to me, but are probably needed if they want to do complete 1:1 reconstructions.

0 replies

uholzer · 2015-05-21T19:18:08Z

uholzer
May 21, 2015
Collaborator

I'd like to draw your attention to a related discussion flaming up on the semantic web mailing list:
Re: deterministic naming of blank nodes
They are mainly discussing naming of blank nodes and canonicalization.

0 replies

uholzer · 2015-05-21T19:30:02Z

uholzer
May 21, 2015
Collaborator

@jimmccusker: Would it be difficult to extend RGDA1 to N3?

Surely, you would first have to extend it to Generalized RDF Graphs.

The next problem are formulas: In order to run RGDA1 on an N3 formula, it could first determine a hash for each subformula by invoking itself recursively. Then it could use these hashes in place of the original formulas.

But then there are also the variable bindings ...

Finally, having support for N3 could make it easier to canonicalize SPARQL?

0 replies

jpmccu · 2015-05-21T21:23:52Z

jpmccu
May 21, 2015

Some small changes would need to be made to support Generalized RDF graphs
(mostly to support canonicalizing bnodes that are used as predicates). I
don't know enough about all the extra stuff that's in N3 that isn't in RDF
to go beyond that though.

I think a better general use approach for canonicalizing SPARQL will be
something more like SPIN, although it doesn't fully support Jorn's use case
here, since it provides URIs for variables.

Is full N3 used commonly anymore? I haven't seen much of it.

Jim

On Thu, May 21, 2015 at 3:31 PM Urs Holzer notifications@github.com wrote:

@jimmccusker https://github.com/jimmccusker: Would it be difficult to
extend RDGA1 to N3?

Surely, you would first have to extend it to Generalized RDF Graphs
http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-generalized-rdf
.

The next problem are formulas: In order to run RDGA1 on an N3 formula, it
could first determine a hash for each subformula by invoking itself
recursively. Then it could use these hashes in place of the original
formulas.

But then there are also the variable bindings ...

Finally, having support for N3 could make it easier to canonicalize SPARQL?

—
Reply to this email directly or view it on GitHub
#483 (comment).

0 replies

uholzer · 2015-05-22T19:20:06Z

uholzer
May 22, 2015
Collaborator

Well, I am using N3 extensively. (Although I am usually not representative.) Also, Jos De Roo is still actively developing EYE.

But okay, point about SPARQL taken.

0 replies

jpmccu · 2017-05-10T23:39:06Z

jpmccu
May 10, 2017

FYI I think I'm now canonicalizing blank nodes in predicates in RGDA1, but I haven't tested it explicitly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical form of SPARQL Patterns #1536

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

Hash key: f3e47550cdcf8c0bd1c93f55f01f9b1a73db9efd

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Canonical form of SPARQL Patterns #1536

joernhees Apr 30, 2015 Maintainer

Replies: 11 comments

jpmccu Apr 30, 2015

joernhees May 1, 2015 Maintainer Author

kasei May 1, 2015

chimezie May 1, 2015 Collaborator

Hash key: f3e47550cdcf8c0bd1c93f55f01f9b1a73db9efd

joernhees May 4, 2015 Maintainer Author

jpmccu May 4, 2015

uholzer May 21, 2015 Collaborator

uholzer May 21, 2015 Collaborator

jpmccu May 21, 2015

uholzer May 22, 2015 Collaborator

jpmccu May 10, 2017

joernhees
Apr 30, 2015
Maintainer

jpmccu
Apr 30, 2015

joernhees
May 1, 2015
Maintainer Author

kasei
May 1, 2015

chimezie
May 1, 2015
Collaborator

joernhees
May 4, 2015
Maintainer Author

jpmccu
May 4, 2015

uholzer
May 21, 2015
Collaborator

uholzer
May 21, 2015
Collaborator

jpmccu
May 21, 2015

uholzer
May 22, 2015
Collaborator

jpmccu
May 10, 2017