-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RGDA1 for RDFlib #441
RGDA1 for RDFlib #441
Conversation
related to #385 |
@jimmccusker could you have a look at the failing tests? https://travis-ci.org/RDFLib/rdflib/jobs/42700591 |
@@ -112,65 +146,292 @@ def __eq__(self, other): | |||
return False | |||
elif list(self) == list(other): | |||
return True # TODO: really generally cheaper? | |||
return self.internal_hash() == other.internal_hash() | |||
return self.internal_hash() == other.graph_digest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess if you want to rename internal_hash()
to graph_digest()
(as below) this line should be:
return self.graph_digest() == other.graph_digest()
…ng is integers, so this should work.
I've aliased graph_digest to internal_hash and addressed the failing tests. Hopefully we are ready to go now. One of the builds errored, but it seems to be a timeout: https://travis-ci.org/RDFLib/rdflib/builds/43242370 |
self.hash_cache = {} | ||
|
||
def key(self): | ||
return (len(self.nodes),self.hash_color()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.hash_cache
and self._hash_color
seem to be unused. And you are using a global variable _hash_cache
. It seems that it is never and is therefore a potential memory leak.
If you have some literature references, it would be good to add them as comments to the code. They can help a lot with understanding the code. There is a reference to another algorithm in the code:
As this is no longer correct, please remove it. Then I saw the following ancient code in
This looks horrible. It seems to create a complete copy of both graphs in memory. I think we should remove it, because it is definitely more expensive. |
@@ -219,7 +482,7 @@ def to_canonical_graph(g1): | |||
deterministical MD5 checksums, correlated with the graph contents. | |||
""" | |||
graph = Graph() | |||
graph += _TripleCanonicalizer(g1, _md5_hash).canonical_triples() | |||
graph += _TripleCanonicalizer(g1, _sha256_hash).canonical_triples() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_TripleCanonicalizer
takes a has function that behaves like the ones from hashlib
. But _sha256_hash
is different. Wouldn't it be correct to ommit the second argument, such that the default (haslib.sha256
) is used? If yes, you can even remove the method _sha256_hash
altogether.
Also, you should update the comment above, as its no longer "MD5 checksums".
…sure and default hash function.
the build timeouts for python 3.2 don't seem specific to this PR: i re-ran the previously passing py32 test of the latest commit on master https://travis-ci.org/RDFLib/rdflib/builds/42335741 and they now show the same timeout. I guess it's some temporary problem of py32 with travis / an updated lib that causes this. As the other tests all pass i'd say "merge" if no one disagrees. |
result += td.microseconds / 1000000.0 | ||
return result | ||
|
||
class runtime(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the runtime
and call_count
decorators needed for anything but the benchmark example?
Should they be private? (Usually classes are meant to be CamelCased according to PEP8)
…calization Conflicts: rdflib/compare.py
|
||
|
||
class Color: | ||
def __init__(self, nodes, hashfunc, color=(), hash_cache={}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash_cache={}
is problematic in this case due to the way python handles default values. In fact, the default values are created once, when the function is constructed, and reused for every call. For mutable values, this results in effects like the following:
>>> def f(i, s={}):
... s[i] = 'foo'
... return s
...
>>> print(f(0))
{0: 'foo'}
>>> print(f(1))
{0: 'foo', 1: 'foo'}
>>> print(f(2))
{0: 'foo', 1: 'foo', 2: 'foo'}
You probably want to do something like this:
def __init__(self, nodes, hashfunc, color=(), hash_cache=None):
if hash_cache is None:
hash_cache = {}
color=()
on the other hand is okay, because tuples are immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could also create a copy of the dict:
def __init__(self, nodes, hashfunc, color=(), hash_cache={}):
hash_cache = dict(hash_cache)
This would mean that the object given by parameter hash_cache is never changed. Instead, a copy is created and modified later on.
Which alternative to choose of course depends on the usecase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, your first suggestion is what I intended. The idea is that the hash_cache is actually re-used by all colors within a canonicalization, obviating the need to re-hash existing colors. This code review has been very helpful!
Excellent. One of my projects has unit tests that rely heavily on graph isomoprhism tests. Granted, they are mostly trivial cases (the old algorithm worked too), but it makes me happy. Two points that are open (from my point of view) before we merge:
Anyway, I think we are quite close to merging this into master. |
On question 1, I have tests for several nontrivial graphs, including this one: http://pallini.di.uniroma1.it/Introduction.html#lev5 Which has a very tricky cross swap where one equitable partition contains two orbit partitions. This is the main use case that requires the sort of tree search shown in traces. I have other things like duplicate bnodes and bnode loops as well. As to question 2, I'm actually hoping that this implementation will be more readable than the original paper, which I had to work very hard to understand. I'm very curious about which points may be more or less confusing. |
Okay. That answers my questions. @joernhees I agree that we can merge now. |
RGDA1 for RDFlib. Big thanks to @jimmccusker
@jimmccusker, congratulations! Thank you very much for your hard work on this. It is always very important that results from computer science get incorporated into actual software, because otherwise we just keep reinventing square wheels. I have one remaining request: Once your work has been published, could you please make a pull request to update the literature reference? This would be awesome. |
RGDA1 (RDF Graph Digest Algorithm 1) combines the traces graph labeling algorithm with the Sayers and Karp RDF graph digest algorithm. This algorithm will produce canonical cryptographically unique identifiers for any RDF graph.