Switch to different, stable hash algorithm in Bag #6723

itamarst · 2020-10-09T17:32:08Z

In #6640, it was pointed out that groupby() doesn't work on non-numerics. The issue: hash() is used to group, hash() gives different responses for different Python processes. The solution was to set a hash seed.

Unfortunately, Distributed has the same issue (dask/distributed#4141) and it's harder to solve. E.g. distributed-worker without a nanny doesn't have a good way to set a consistent seed.

Given that:

Until now groupby was broken for non-numerics and no one noticed. So probably not a huge use case.
It's hard to solve this with hash().

A better approach might be a different hash algorithm. For example, https://deepdiff.readthedocs.io/en/latest/deephash.html.

deephash seems promising:

from deepdiff import DeepHash
from pprint import pprint

class CustomObj:

    def __init__(self, x):
        self.x = x

    def __repr__(self):
        return f"CustomObj({self.x})"


objects = [
    125,
    "lalala",
    (12, 17),
    CustomObj(17),
    CustomObj(17),
    CustomObj(25),
]

for o in objects:
    print(repr(o), "has hash", DeepHash(o)[o])

Results in:

$ python example.py 
125 has hash 08823c686054787db6befb006d6846453fd5a3cbdaa8d1c4d2f0052a227ef204
'lalala' has hash 2307d4a08f50b743ec806101fe5fc4664e0b83c6c948647436f932ab38daadd3
(12, 17) has hash c7129efa5c7d19b0efabda722ae4cd8d022540b31e8160f9c181d163ce4f1bf6
CustomObj(17) has hash 477631cfc0315644152933579c23ed0c9f27743d632613e86eb7e271f6fc86e4
CustomObj(17) has hash 477631cfc0315644152933579c23ed0c9f27743d632613e86eb7e271f6fc86e4
CustomObj(25) has hash 8fa8427d493399d6aac3166331d78e9e2a2ba608c2b3c92a131ef0cf8882be14

Downsides:

This hashing approach, serializing the object to a string and then hashing the string, is probably much more expensive than hash().
Custom __hash__ won't work.

That being said, bag has used hash() this way for 4 years, maybe more, it's been broken with Distributed the whole time I assume, so probably that second downside is less relevant. Not sure about the performance aspect, though.

The text was updated successfully, but these errors were encountered:

mrocklin · 2020-10-12T15:47:28Z

cc @eriknw @jcrist who might have experience here

itamarst · 2020-10-12T16:26:17Z

Another, faster approach: don't support arbitrary objects, just explicitly supported specific types. I.e. enums/int/float/string/tuple/bytes and presumably numpy numeric types. And probably 5 other types I'm forgetting about.

jcrist · 2020-10-12T16:36:34Z

A different option would be to hijack the pickle infrastructure with a hash function. This has the benefit of supporting arbitrary python types as long as they're pickleable (and all types used as results in dask should be pickleable), while also being ok on perf for large objects. The overhead here is mostly on setup of the pickler per-call, so small objects (ints, small strings) would be slower while nested objects should be the same. There are faster hashing algorithms out there than those in hashlib (and we don't need a cryptographic hash here), but this was quick to write up.

In [16]: import cloudpickle, hashlib

In [17]: class HashFil:
    ...:     def __init__(self):
    ...:         self.hash = hashlib.sha1()
    ...:     def write(self, buf):
    ...:         self.hash.update(buf)
    ...:         return len(buf)
    ...:     def buffer_callback(self, buf):
    ...:         self.write(buf.raw())
    ...:

In [18]: def custom_hash(x):
    ...:     fil = HashFil()
    ...:     pickler = cloudpickle.CloudPickler(fil, protocol=5, buffer_callback=fil.buffer_callback)
    ...:     pickler.dump(x)
    ...:     return fil.hash.hexdigest()
    ...:

In [19]: %timeit custom_hash([1, 2, 3, "hello"])
3.56 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Perf could be further improved by special casing a few common types (str, int, float, bytes, ...) and having a fast path for those.

itamarst · 2020-10-12T17:53:22Z

The worry about pickle is that some implementations might use non-deterministic iteration. E.g. if it's just iterating over an internal dict... and then we're back to the same problem, albeit in a much smaller set of cases.

itamarst · 2020-10-12T17:53:53Z

Although... dict iteration is actually deterministic these days, isn't it. So maybe it's fine.

itamarst · 2020-11-21T22:15:07Z

Some existing solutions:

https://github.com/QUVA-Lab/artemis/blob/master/artemis/general/hashing.py
https://github.com/joblib/joblib/blob/master/joblib/hashing.py

jsignell · 2020-11-25T15:21:28Z

The joblib solution seems like a good one. There is also some discussion of hashing going on in #6844 and #6843.

cisaacstern · 2023-12-09T00:37:23Z

Thanks all for your work on this so far! Reviving this thread as today @TheNeuralBit discovered that it is the root cause of apache/beam#29365 (thanks Brian!). To summarize the context there, and my motivations:

Apache Beam is a framework for embarrassingly-parallel data processing, which flexibly deploys to various "runners"
Pangeo Forge leverages Beam for cloud-optimization of scientific data (I am a core developer on this project)
Beam can currently deploy to a minimal DaskRunner. This runner is implemented with Dask Bag, however, so the groupby issues discussed here prevent it from grouping with string or other non-numeric keys. ¹
From my perspective (along with many other Pangoe Forgeans), if this DaskRunner were to be developed into a fully-functional state, it would serve as an ideal onramp for supporting both (a) existing Dask users to run Beam pipelines; and (b) existing Beam users (of which there are a lot!) to become Dask cluster enthusiasts!²

So with that background out of the way, I'd love to re-open the discussion here as to what folks see as the most viable path forward:

Something like Switch to different, stable hash algorithm in Bag #6723 (comment)
https://github.com/QUVA-Lab/artemis/blob/master/artemis/general/hashing.py
https://github.com/joblib/joblib/blob/master/joblib/hashing.py
Ask users to BYO hashing function (defaulting to built-in hash), and raise an error if they do not while also using a key type that's unsupported by built-in hash (as suggested in Dask requires consistent Python hashing distributed#4141 (comment))
Another option that hasn't been mentioned yet

?

I would gladly contribute to or lead a PR on this topic (or defer to someone else who is motivated to do so), but of course looks like the first step is developing a bit more consensus about the path forward.

P.S. @jacobtomlinson and I plan to present a bit on the DaskRunner at the upcoming Dask Demo Day, and I will update our draft presentation to include a mention of this issue.

String keys are used in Beam's groupby tutorial, and other non-numeric keys are also used internally throughout Beam, so this is a non-negotiable if the DaskRunner is to mature into a fully-fledged Beam deployment target. ↩
Beam currently lacks any comparable easy-to-deploy, performant target for running Python-based pipelines on a single machine (their single-machine runner is explicitly described as being for testing data serialization only, and does not scale), HPC, or frankly even AWS. (The next-best AWS option is Apache Spark or Flink, but these come with significant, ahem Java, overhead and are not-trivial for even highly competent Pangeo/Python users to navigate.) ↩

itamarst · 2023-12-09T15:07:24Z

Notes on joblib.hashing

joblib implements the recursive python object traversal using Pickle with a special-case for NumPy: https://github.com/joblib/joblib/blob/master/joblib/hashing.py . Some problems:

There is no guarantee that two semantically identical objects will pickle to the same bytes. They have special-cased code for dict and hash but you can imagine a custom map or set type to pickle itself inconsistently as regards to order.
The ordering they do for dicts sometimes depends on hash() 😢 https://github.com/joblib/joblib/blob/6310841f66352bbf958cc190a973adcca611f4c7/joblib/hashing.py#L144

Notes on the hash algorithm (separate from the traversal method)

joblib uses md5/sha and those are likely somewhat slower (they're cryptographic, which isn't necessary here), but also might be fast enough. There are hash algorithms specifically designed for stability across machines and time, often designed for very fast hashing larger amounts of data where the output will be used as a key in persistent storage or distributed systems. HighwayHash is a good example: https://github.com/google/highwayhash#versioning-and-stability ("input -> hash mapping will not change"); not sure if there's a seed but if so just need to set a fixed seed. There are other alternatives as well.

Another, perhaps unworkable option: convert Python `hash()` to a stable hash

Given the seed, which you can extract from CPython internals, it might be possible mathematically to take the output of hash(obj) and turn it into a stable value by mathematically undoing the impact of the seed.

Another option: a deterministic/canonical serialization format.

Some serialization formats are specifically designed to be deterministic on inputs, in order to allow for cryptographic hashing/signatures for comparison. E.g. https://borsh.io/ is one, with wrapper in Python at https://pypi.org/project/borsh-python/. Unfortunately it requires a schema... so you'd need to find a self-describing (i.e. messages don't require scheme) deterministic serialization format. Deterministic CBOR is in a draft spec, and has at least a Rust implementation: https://docs.rs/dcbor/latest/dcbor/ so that might be a possibility. For best performance what you really want is a streaming self-describing deterministic serializer, dcbor at least isn't streaming.

Given a deterministic serialization, you can then just use a hash function of your choice (md5, HighwayHash, whatever) to get the hash.

TheNeuralBit · 2023-12-09T16:42:53Z

The deterministic serialization approach could be a workaround for the beam runner. Beam has a library of coders for this.

…

On Sat, Dec 9, 2023, 07:07 Itamar Turner-Trauring ***@***.***> wrote: Notes on joblib.hashing joblib implements the recursive python object traversal using Pickle with a special-case for NumPy: https://github.com/joblib/joblib/blob/master/joblib/hashing.py . Some problems: - There is no guarantee that two semantically identical objects will pickle to the same bytes. They have special-cased code for dict and hash but you can imagine a custom map or set type to pickle itself inconsistently as regards to order. - The ordering they do for dicts sometimes depends on hash() 😢 https://github.com/joblib/joblib/blob/6310841f66352bbf958cc190a973adcca611f4c7/joblib/hashing.py#L144 Notes on the hash algorithm (separate from the traversal method) joblib uses md5/sha and those are likely somewhat slower (they're cryptographic, which isn't necessary here), but also might be fast enough. There are hash algorithms specifically designed for stability across machines and time, often designed for very fast hashing larger amounts of data where the output will be used as a key in persistent storage or distributed systems. HighwayHash is a good example: https://github.com/google/highwayhash#versioning-and-stability ("input -> hash mapping will not change"); not sure if there's a seed but if so just need to set a fixed seed. There are other alternatives as well. Another, perhaps unworkable option: convert Python hash() to a stable hash Given the seed, which you can extract from CPython internals, it *might* be possible mathematically to take the output of hash(obj) and turn it into a stable value by mathematically undoing the impact of the seed. Another option: a deterministic/canonical serialization format. Some serialization formats are specifically designed to be deterministic on inputs, in order to allow for cryptographic hashing/signatures for comparison. E.g. https://borsh.io/ is one, with wrapper in Python at https://pypi.org/project/borsh-python/. Unfortunately it requires a schema... so you'd need to find a self-describing (i.e. messages don't require scheme) deterministic serialization format. Deterministic CBOR is in a draft spec, and has at least a Rust implementation: https://docs.rs/dcbor/latest/dcbor/ so that might be a possibility. For best performance what you *really* want is a *streaming* self-describing deterministic serializer, dcbor at least isn't streaming. Given a deterministic serialization, you can then just use a hash function of your choice (md5, HighwayHash, whatever) to get the hash. — Reply to this email directly, view it on GitHub <#6723 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFEZ3ZQGPNWD342SIHPXSDYIR5DPAVCNFSM4SKLAYWKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBUHA2DGNBVG44A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

cisaacstern · 2023-12-15T22:39:03Z

Finding a solution here is still worthwhile IMO, though will presumably unfold over a somewhat longer timescale.

In the meantime, please to report that dask/distributed#8400 does unblock many use cases, including the linked Beam issue.

mrocklin · 2023-12-16T16:11:14Z

Thanks Charles!

…

On Fri, Dec 15, 2023 at 2:39 PM Charles Stern ***@***.***> wrote: Finding a solution here is still worthwhile IMO, though will presumably unfold over a somewhat longer timescale. In the meantime, please to report that dask/distributed#8400 <dask/distributed#8400> does unblock many use cases, including the linked Beam issue. — Reply to this email directly, view it on GitHub <#6723 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTAGYCYUFSWKF6HBQ3TYJTGRDAVCNFSM4SKLAYWKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBVHA2TOMZSGA3A> . You are receiving this because you commented.Message ID: ***@***.***>

cisaacstern · 2023-12-21T08:17:03Z

dask/distributed#8400 does unblock many use cases, including the linked Beam issue

I spoke a bit too soon here.

As described at the bottom of apache/beam#29802 (comment), while dask/distributed#8400 did resolve non-deterministic hashing of strings, I subsequently came to understand that it does not resolve non-deterministic hashing of None. hash(None) is deterministic in Python 3.12, but AFAICT there is no way to make it deterministic in Python < 3.12.

hash(None) being non-deterministic in < Python 3.12 means that any otherwise-hashable container type that contains None, will also hash non-deterministically between Python processes, and therefore be unusable as a key for Bag.groupby in a distributed context. This would include, in the simplest case, (None,), but also extends to frozen dataclasses, etc.

This happens to be the next blocker for me on my motivating Beam issue, but apart from that specific case, I believe this is sufficiently problematic as to warrant moving forward with a fix for this issue.

In terms of a specific proposal, I am curious about others' thoughts on leveraging Dask's existing tokenize function? This would seem to have many advantages:

Minimal code changes: implementation could be as simple as replacing hash(k) with int(tokenize(k), 16) in two places in dask.bag.core ~~(draft PR forthcoming for reference)~~
- Edit: See Tokenize bag groupby keys #10734 (also linked below).
No additional external dependencies
Supports extensibility for using custom serializers, as described in https://docs.dask.org/en/stable/custom-collections.html#implementing-deterministic-hashing

There would also be an opportunity to fast-path types that are known to deterministically hash (e.g. ints). ~~I'll open a draft PR now that sketches out this idea.~~ Here's a draft PR for discussion purposes: #10734.

cisaacstern · 2024-03-19T23:57:43Z

Ping for those following here, I've just opened my proposed fix for this issue, #10734, for review.

itamarst mentioned this issue Oct 9, 2020

Dask requires consistent Python hashing dask/distributed#4141

Open

jsignell added the bag label Oct 13, 2020

jsignell mentioned this issue Feb 26, 2021

Groupby on dask bags does not group correctly #7275

Closed

gmiretti mentioned this issue May 28, 2021

02_bag groupby vs foldby example fails due to known bug dask/dask-tutorial#213

Closed

cisaacstern mentioned this issue Dec 8, 2023

[Bug]: DaskRunner GBK is broken, pending upstream fix in dask apache/beam#29365

Open

16 tasks

cisaacstern mentioned this issue Dec 20, 2023

Fix DaskRunner GBK by bumping dask lower bound apache/beam#29802

Closed

3 tasks

cisaacstern mentioned this issue Dec 21, 2023

Tokenize bag groupby keys #10734

Merged

3 tasks

fjetter closed this as completed in #10734 Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to different, stable hash algorithm in Bag #6723

Switch to different, stable hash algorithm in Bag #6723

itamarst commented Oct 9, 2020

mrocklin commented Oct 12, 2020

itamarst commented Oct 12, 2020

jcrist commented Oct 12, 2020 •

edited

Loading

itamarst commented Oct 12, 2020

itamarst commented Oct 12, 2020

itamarst commented Nov 21, 2020

jsignell commented Nov 25, 2020

cisaacstern commented Dec 9, 2023 •

edited

Loading

itamarst commented Dec 9, 2023

TheNeuralBit commented Dec 9, 2023 via email

cisaacstern commented Dec 15, 2023

mrocklin commented Dec 16, 2023 via email

cisaacstern commented Dec 21, 2023 •

edited

Loading

cisaacstern commented Mar 19, 2024

Switch to different, stable hash algorithm in Bag #6723

Switch to different, stable hash algorithm in Bag #6723

Comments

itamarst commented Oct 9, 2020

mrocklin commented Oct 12, 2020

itamarst commented Oct 12, 2020

jcrist commented Oct 12, 2020 • edited Loading

itamarst commented Oct 12, 2020

itamarst commented Oct 12, 2020

itamarst commented Nov 21, 2020

jsignell commented Nov 25, 2020

cisaacstern commented Dec 9, 2023 • edited Loading

Footnotes

itamarst commented Dec 9, 2023

Notes on joblib.hashing

Notes on the hash algorithm (separate from the traversal method)

Another, perhaps unworkable option: convert Python hash() to a stable hash

Another option: a deterministic/canonical serialization format.

TheNeuralBit commented Dec 9, 2023 via email

cisaacstern commented Dec 15, 2023

mrocklin commented Dec 16, 2023 via email

cisaacstern commented Dec 21, 2023 • edited Loading

cisaacstern commented Mar 19, 2024

jcrist commented Oct 12, 2020 •

edited

Loading

cisaacstern commented Dec 9, 2023 •

edited

Loading

Another, perhaps unworkable option: convert Python `hash()` to a stable hash

cisaacstern commented Dec 21, 2023 •

edited

Loading