Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we get rid of sharding? #7824

Open
crusaderky opened this issue May 10, 2023 · 2 comments
Open

Can we get rid of sharding? #7824

crusaderky opened this issue May 10, 2023 · 2 comments

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented May 10, 2023

The distributed.comm.shard setting, which defaults to 64 MiB, is supposed to split a buffer that is larger than the shard size into separate buffers. The intent is to prevent issues that commonly happen beyond 2 GiB with various compression and network protocols.

This setting does nothing for dask.dataframe:

>>> import numpy
>>> import pandas
>>> from distributed.protocol import serialize_bytelist
>>> a = numpy.random.random((2**23,2))  # 128 MiB
>>> frames = serialize_bytelist(a)
>>> [memoryview(f).nbytes for f in frames]
[32, 236, 67108864, 67108864]
>>> frames = serialize_bytelist(pandas.DataFrame(a))
>>> [memoryview(f).nbytes for f in frames]
[40, 123, 554, 134217728, 0]

This should be a straightforward bug to fix. However, it also strongly indicates that the whole system tries to solve a purely hypothetical problem. No dask.dataframe user, to my knowledge, has ever complained about crashes.

Do we have an inventory of protocols that break beyond 2 GiB, and/or create temporary deep copies of whole buffers while they work on them?

Sharding causes major performance problems with compression, where any algorithm that doesn't support a decompress_into function - all currently supported algorithms; the only ones I know that do are cramjam and blosc2 - is plagued by deep-copies and memory spikes because of this:

Sharding also adds complexity to the already bloated serialization layer.

CC @milesgranger

@jacobtomlinson
Copy link
Member

jacobtomlinson commented May 10, 2023

cc @pentschev @quasiben @madsbk

@pentschev
Copy link
Member

UCX has an internal mechanism to do that which can be controlled via UCX_TCP_{RX,TX}_SEG_SIZE environment variables, so I don't believe sharding is of much relevance for UCX. I'm also not aware of any issues with messages larger than 2GiB with or without UCX, but to be fair I rarely see dataframe (or rather series) larger than a couple hundred MBs being transferred in Dask, so I don't know how common large transfers are in the wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants