Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RGDA1 for RDFlib #441

Merged
merged 23 commits into from
Dec 11, 2014
Merged

RGDA1 for RDFlib #441

merged 23 commits into from
Dec 11, 2014

Conversation

jpmccu
Copy link
Contributor

@jpmccu jpmccu commented Dec 2, 2014

RGDA1 (RDF Graph Digest Algorithm 1) combines the traces graph labeling algorithm with the Sayers and Karp RDF graph digest algorithm. This algorithm will produce canonical cryptographically unique identifiers for any RDF graph.

@joernhees
Copy link
Member

related to #385

@joernhees
Copy link
Member

@jimmccusker could you have a look at the failing tests? https://travis-ci.org/RDFLib/rdflib/jobs/42700591
(Just in case you don't know: You can update this PR by pushing more commits to the branch jimmccusker:canonicalization. Travis will then automatically re-run the tests.)

@@ -112,65 +146,292 @@ def __eq__(self, other):
return False
elif list(self) == list(other):
return True # TODO: really generally cheaper?
return self.internal_hash() == other.internal_hash()
return self.internal_hash() == other.graph_digest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess if you want to rename internal_hash() to graph_digest() (as below) this line should be:
return self.graph_digest() == other.graph_digest()

@jpmccu
Copy link
Contributor Author

jpmccu commented Dec 7, 2014

I've aliased graph_digest to internal_hash and addressed the failing tests. Hopefully we are ready to go now. One of the builds errored, but it seems to be a timeout: https://travis-ci.org/RDFLib/rdflib/builds/43242370

self.hash_cache = {}

def key(self):
return (len(self.nodes),self.hash_color())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.hash_cache and self._hash_color seem to be unused. And you are using a global variable _hash_cache. It seems that it is never and is therefore a potential memory leak.

@uholzer
Copy link
Contributor

uholzer commented Dec 7, 2014

If you have some literature references, it would be good to add them as comments to the code. They can help a lot with understanding the code.

There is a reference to another algorithm in the code:

class IsomorphicGraph(ConjunctiveGraph):
    """
    Ported from
    <http://www.w3.org/2001/sw/DataAccess/proto-tests/tools/rdfdiff.py>
    (Sean B Palmer's RDF Graph Isomorphism Tester).
    """

As this is no longer correct, please remove it.

Then I saw the following ancient code in IsomorphicGraph.__eq__:

        elif list(self) == list(other):
            return True  # TODO: really generally cheaper?

This looks horrible. It seems to create a complete copy of both graphs in memory. I think we should remove it, because it is definitely more expensive.

@@ -219,7 +482,7 @@ def to_canonical_graph(g1):
deterministical MD5 checksums, correlated with the graph contents.
"""
graph = Graph()
graph += _TripleCanonicalizer(g1, _md5_hash).canonical_triples()
graph += _TripleCanonicalizer(g1, _sha256_hash).canonical_triples()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_TripleCanonicalizer takes a has function that behaves like the ones from hashlib. But _sha256_hash is different. Wouldn't it be correct to ommit the second argument, such that the default (haslib.sha256) is used? If yes, you can even remove the method _sha256_hash altogether.

Also, you should update the comment above, as its no longer "MD5 checksums".

@joernhees
Copy link
Member

the build timeouts for python 3.2 don't seem specific to this PR: i re-ran the previously passing py32 test of the latest commit on master https://travis-ci.org/RDFLib/rdflib/builds/42335741 and they now show the same timeout. I guess it's some temporary problem of py32 with travis / an updated lib that causes this.

As the other tests all pass i'd say "merge" if no one disagrees.

result += td.microseconds / 1000000.0
return result

class runtime(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the runtime and call_count decorators needed for anything but the benchmark example?
Should they be private? (Usually classes are meant to be CamelCased according to PEP8)



class Color:
def __init__(self, nodes, hashfunc, color=(), hash_cache={}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash_cache={} is problematic in this case due to the way python handles default values. In fact, the default values are created once, when the function is constructed, and reused for every call. For mutable values, this results in effects like the following:

>>> def f(i, s={}):
...   s[i] = 'foo'
...   return s
... 
>>> print(f(0))
{0: 'foo'}
>>> print(f(1))
{0: 'foo', 1: 'foo'}
>>> print(f(2))
{0: 'foo', 1: 'foo', 2: 'foo'}

You probably want to do something like this:

def __init__(self, nodes, hashfunc, color=(), hash_cache=None):
    if hash_cache is None:
        hash_cache = {}

color=() on the other hand is okay, because tuples are immutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could also create a copy of the dict:

def __init__(self, nodes, hashfunc, color=(), hash_cache={}):
    hash_cache = dict(hash_cache)

This would mean that the object given by parameter hash_cache is never changed. Instead, a copy is created and modified later on.

Which alternative to choose of course depends on the usecase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, your first suggestion is what I intended. The idea is that the hash_cache is actually re-used by all colors within a canonicalization, obviating the need to re-hash existing colors. This code review has been very helpful!

@uholzer
Copy link
Contributor

uholzer commented Dec 10, 2014

Excellent. One of my projects has unit tests that rely heavily on graph isomoprhism tests. Granted, they are mostly trivial cases (the old algorithm worked too), but it makes me happy.

Two points that are open (from my point of view) before we merge:

  1. Is there a unit test that checks does a non-trivial graph isomporhism test? I.e. a unit test that the old algorithm wouldn't have passed?
  2. I can not really make a statement on the readability of the implementation of the algorithm. Can someone else comment? Do you think it is straight forward to understand the code given some background knowledge about the algorithm? Maybe we would need to add some well placed comments? (Don't add comments, before we have some good feedback from others. @joernhees, what do you think?)

Anyway, I think we are quite close to merging this into master.

@jpmccu
Copy link
Contributor Author

jpmccu commented Dec 10, 2014

On question 1, I have tests for several nontrivial graphs, including this one:

http://pallini.di.uniroma1.it/Introduction.html#lev5

Which has a very tricky cross swap where one equitable partition contains two orbit partitions. This is the main use case that requires the sort of tree search shown in traces. I have other things like duplicate bnodes and bnode loops as well.

As to question 2, I'm actually hoping that this implementation will be more readable than the original paper, which I had to work very hard to understand. I'm very curious about which points may be more or less confusing.

@uholzer
Copy link
Contributor

uholzer commented Dec 10, 2014

Okay. That answers my questions.

@joernhees I agree that we can merge now.

joernhees added a commit that referenced this pull request Dec 11, 2014
RGDA1 for RDFlib. Big thanks to @jimmccusker
@joernhees joernhees merged commit 175c028 into RDFLib:master Dec 11, 2014
@uholzer
Copy link
Contributor

uholzer commented Dec 11, 2014

@jimmccusker, congratulations!

Thank you very much for your hard work on this. It is always very important that results from computer science get incorporated into actual software, because otherwise we just keep reinventing square wheels.

I have one remaining request: Once your work has been published, could you please make a pull request to update the literature reference? This would be awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants