RGDA1 for RDFlib #441

jpmccu · 2014-12-02T04:39:25Z

RGDA1 (RDF Graph Digest Algorithm 1) combines the traces graph labeling algorithm with the Sayers and Karp RDF graph digest algorithm. This algorithm will produce canonical cryptographically unique identifiers for any RDF graph.

joernhees · 2014-12-02T07:28:08Z

related to #385

joernhees · 2014-12-02T07:34:41Z

@jimmccusker could you have a look at the failing tests? https://travis-ci.org/RDFLib/rdflib/jobs/42700591
(Just in case you don't know: You can update this PR by pushing more commits to the branch jimmccusker:canonicalization. Travis will then automatically re-run the tests.)

joernhees · 2014-12-02T07:39:55Z

rdflib/compare.py

@@ -112,65 +146,292 @@ def __eq__(self, other):
            return False
        elif list(self) == list(other):
            return True  # TODO: really generally cheaper?
-        return self.internal_hash() == other.internal_hash()
+        return self.internal_hash() == other.graph_digest()


i guess if you want to rename internal_hash() to graph_digest() (as below) this line should be:
return self.graph_digest() == other.graph_digest()

…ip conversion.

…ng is integers, so this should work.

jpmccu · 2014-12-07T07:23:20Z

I've aliased graph_digest to internal_hash and addressed the failing tests. Hopefully we are ready to go now. One of the builds errored, but it seems to be a timeout: https://travis-ci.org/RDFLib/rdflib/builds/43242370

uholzer · 2014-12-07T20:13:41Z

rdflib/compare.py

+        self.hash_cache = {}
+
+    def key(self):
+        return (len(self.nodes),self.hash_color())


self.hash_cache and self._hash_color seem to be unused. And you are using a global variable _hash_cache. It seems that it is never and is therefore a potential memory leak.

uholzer · 2014-12-07T20:33:03Z

If you have some literature references, it would be good to add them as comments to the code. They can help a lot with understanding the code.

There is a reference to another algorithm in the code:

class IsomorphicGraph(ConjunctiveGraph):
    """
    Ported from
    <http://www.w3.org/2001/sw/DataAccess/proto-tests/tools/rdfdiff.py>
    (Sean B Palmer's RDF Graph Isomorphism Tester).
    """

As this is no longer correct, please remove it.

Then I saw the following ancient code in IsomorphicGraph.__eq__:

        elif list(self) == list(other):
            return True  # TODO: really generally cheaper?

This looks horrible. It seems to create a complete copy of both graphs in memory. I think we should remove it, because it is definitely more expensive.

uholzer · 2014-12-07T20:48:22Z

rdflib/compare.py

@@ -219,7 +482,7 @@ def to_canonical_graph(g1):
    deterministical MD5 checksums, correlated with the graph contents.
    """
    graph = Graph()
-    graph += _TripleCanonicalizer(g1, _md5_hash).canonical_triples()
+    graph += _TripleCanonicalizer(g1, _sha256_hash).canonical_triples()


_TripleCanonicalizer takes a has function that behaves like the ones from hashlib. But _sha256_hash is different. Wouldn't it be correct to ommit the second argument, such that the default (haslib.sha256) is used? If yes, you can even remove the method _sha256_hash altogether.

Also, you should update the comment above, as its no longer "MD5 checksums".

…sure and default hash function.

joernhees · 2014-12-08T15:27:31Z

the build timeouts for python 3.2 don't seem specific to this PR: i re-ran the previously passing py32 test of the latest commit on master https://travis-ci.org/RDFLib/rdflib/builds/42335741 and they now show the same timeout. I guess it's some temporary problem of py32 with travis / an updated lib that causes this.

As the other tests all pass i'd say "merge" if no one disagrees.

joernhees · 2014-12-08T16:26:05Z

rdflib/compare.py

+    result += td.microseconds / 1000000.0
+    return result
+
+class runtime(object):


are the runtime and call_count decorators needed for anything but the benchmark example?
Should they be private? (Usually classes are meant to be CamelCased according to PEP8)

…calization Conflicts: rdflib/compare.py

uholzer · 2014-12-10T20:15:43Z

rdflib/compare.py

+
+
+class Color:
+    def __init__(self, nodes, hashfunc, color=(), hash_cache={}):


hash_cache={} is problematic in this case due to the way python handles default values. In fact, the default values are created once, when the function is constructed, and reused for every call. For mutable values, this results in effects like the following:

>>> def f(i, s={}): ... s[i] = 'foo' ... return s ... >>> print(f(0)) {0: 'foo'} >>> print(f(1)) {0: 'foo', 1: 'foo'} >>> print(f(2)) {0: 'foo', 1: 'foo', 2: 'foo'}

You probably want to do something like this:

def __init__(self, nodes, hashfunc, color=(), hash_cache=None): if hash_cache is None: hash_cache = {}

color=() on the other hand is okay, because tuples are immutable.

Alternatively, you could also create a copy of the dict:

def __init__(self, nodes, hashfunc, color=(), hash_cache={}): hash_cache = dict(hash_cache)

This would mean that the object given by parameter hash_cache is never changed. Instead, a copy is created and modified later on.

Which alternative to choose of course depends on the usecase.

Thanks, your first suggestion is what I intended. The idea is that the hash_cache is actually re-used by all colors within a canonicalization, obviating the need to re-hash existing colors. This code review has been very helpful!

uholzer · 2014-12-10T20:38:20Z

Excellent. One of my projects has unit tests that rely heavily on graph isomoprhism tests. Granted, they are mostly trivial cases (the old algorithm worked too), but it makes me happy.

Two points that are open (from my point of view) before we merge:

Is there a unit test that checks does a non-trivial graph isomporhism test? I.e. a unit test that the old algorithm wouldn't have passed?
I can not really make a statement on the readability of the implementation of the algorithm. Can someone else comment? Do you think it is straight forward to understand the code given some background knowledge about the algorithm? Maybe we would need to add some well placed comments? (Don't add comments, before we have some good feedback from others. @joernhees, what do you think?)

Anyway, I think we are quite close to merging this into master.

jpmccu · 2014-12-10T20:47:17Z

On question 1, I have tests for several nontrivial graphs, including this one:

http://pallini.di.uniroma1.it/Introduction.html#lev5

Which has a very tricky cross swap where one equitable partition contains two orbit partitions. This is the main use case that requires the sort of tree search shown in traces. I have other things like duplicate bnodes and bnode loops as well.

As to question 2, I'm actually hoping that this implementation will be more readable than the original paper, which I had to work very hard to understand. I'm very curious about which points may be more or less confusing.

uholzer · 2014-12-10T20:58:45Z

Okay. That answers my questions.

@joernhees I agree that we can merge now.

RGDA1 for RDFlib. Big thanks to @jimmccusker

uholzer · 2014-12-11T20:33:52Z

@jimmccusker, congratulations!

Thank you very much for your hard work on this. It is always very important that results from computer science get incorporated into actual software, because otherwise we just keep reinventing square wheels.

I have one remaining request: Once your work has been published, could you please make a pull request to update the literature reference? This would be awesome.

Jim McCusker and others added 12 commits November 15, 2014 23:37

First pass at a traces implementation, not full pruning yet

2e115b8

Final bugfixes and a round of performance improvements

dad0677

naive pruning of automorphisms

cdc3d6a

added benchmark and related instrumentation

368f352

further benchmark refinements

d911bc4

added multithreading to the benchmark.

0419fa4

switched to multiprocessing

a69c22a

serial loading of ontologies.

86e324e

throttling the bioportal downloads to a max of 4 connections.

ceae542

Forgot to actually put the finished tasks back out of the queue.

0f8adec

Forgot to actually put the finished tasks back out of the queue.

c3ede71

more automorphism detection

ea8ea14

joernhees reviewed Dec 2, 2014
View reviewed changes

Jim McCusker added 4 commits December 7, 2014 01:04

More tests working, but it looks like JSON-LD doesn't do safe roundtr…

110a0d4

…ip conversion.

Support for python 2.6 and hopefully 3.x

8be0ad2

Another unicode tweak, works locally on python 2.7.7 (anaconda)

3ad72fa

.n3() always returns unicode (right?) and the only other possible thi…

52121a5

…ng is integers, so this should work.

uholzer reviewed Dec 7, 2014
View reviewed changes

Updated comments and citations, removed badly performing equality mea…

b29e0b0

…sure and default hash function.

minor: code style guides

8e78708

joernhees mentioned this pull request Dec 8, 2014

minor: code style guides jpmccu/rdflib#1

Merged

joernhees reviewed Dec 8, 2014
View reviewed changes

Privatized some decorators and utility functions.

4be615c

Jim McCusker added 3 commits December 8, 2014 23:34

Merge remote-tracking branch 'joernhees/canonicalization' into canoni…

a4f2999

…calization Conflicts: rdflib/compare.py

Fixed up some of the style adjustments to pass tests.

200c226

misnamed performance decorator

26c2755

uholzer reviewed Dec 10, 2014
View reviewed changes

removed unintended singleton

bf20429

joernhees added a commit that referenced this pull request Dec 11, 2014

Merge pull request #441 from jimmccusker/canonicalization

175c028

RGDA1 for RDFlib. Big thanks to @jimmccusker

joernhees merged commit 175c028 into RDFLib:master Dec 11, 2014

joernhees mentioned this pull request Dec 15, 2014

incorrect canonical graph algorithm in rdflib.compare #385

Closed

joernhees modified the milestone: rdflib 4.2.0 Feb 19, 2015

joernhees added the enhancement New feature or request label Feb 19, 2015

ocefpaf mentioned this pull request Apr 3, 2015

Updated rdflib. ioos/conda-recipes#177

Merged

joernhees mentioned this pull request Apr 30, 2015

Canonical form of SPARQL Patterns #483

Closed

pyup-bot mentioned this pull request Nov 8, 2016

Update rdflib to 4.2.1 mytardis/mytardis#733

Closed

This was referenced Jan 16, 2017

Initial Update mozilla/addons-server#4303

Closed

Update rdflib to 4.2.1 mozilla/addons-server#4390

Closed

pyup-bot mentioned this pull request Jan 29, 2017

Update rdflib to 4.2.2 mytardis/mytardis#815

Merged

This was referenced Mar 16, 2017

Initial Update mozilla/amo-validator#510

Closed

Update rdflib to 4.2.2 mozilla/amo-validator#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RGDA1 for RDFlib #441

RGDA1 for RDFlib #441

jpmccu commented Dec 2, 2014

joernhees commented Dec 2, 2014

joernhees commented Dec 2, 2014

joernhees Dec 2, 2014

jpmccu commented Dec 7, 2014

uholzer Dec 7, 2014

uholzer commented Dec 7, 2014

uholzer Dec 7, 2014

joernhees commented Dec 8, 2014

joernhees Dec 8, 2014

uholzer Dec 10, 2014

uholzer Dec 10, 2014

jpmccu Dec 10, 2014

uholzer commented Dec 10, 2014

jpmccu commented Dec 10, 2014

uholzer commented Dec 10, 2014

uholzer commented Dec 11, 2014



		class Color:
		def __init__(self, nodes, hashfunc, color=(), hash_cache={}):

RGDA1 for RDFlib #441

RGDA1 for RDFlib #441

Conversation

jpmccu commented Dec 2, 2014

joernhees commented Dec 2, 2014

joernhees commented Dec 2, 2014

joernhees Dec 2, 2014

Choose a reason for hiding this comment

jpmccu commented Dec 7, 2014

uholzer Dec 7, 2014

Choose a reason for hiding this comment

uholzer commented Dec 7, 2014

uholzer Dec 7, 2014

Choose a reason for hiding this comment

joernhees commented Dec 8, 2014

joernhees Dec 8, 2014

Choose a reason for hiding this comment

uholzer Dec 10, 2014

Choose a reason for hiding this comment

uholzer Dec 10, 2014

Choose a reason for hiding this comment

jpmccu Dec 10, 2014

Choose a reason for hiding this comment

uholzer commented Dec 10, 2014

jpmccu commented Dec 10, 2014

uholzer commented Dec 10, 2014

uholzer commented Dec 11, 2014