WIP: Concurrent block reads and writes #534
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds support for optional concurrent executors to allow concurrently
reading or writing blocks to the underlying store. For stores where the IO is
expensive, for example an S3Map, allowing concurrent reads can be a massive
performance improvement.
The API is designed around Executor objects to allow safe composition. An
executor may be threaded to all of the places where concurrency is desired, both
inside of Zarr itself and in the user's code, to ensure that a shared thread
pool is used and no more than the desired number of threads are launched at the
same time. Executors would allow users to safely read or write blocks
concurrently from different Zarr Array objects at the same time, without
worrying about accidentally spawning too many threads.
I am opening this PR early before adding tests or more documentation to get feedback this proposed API, and discuss the implications of allowing concurrent access to blocks in a zarr array. I didn't want to invest too much time before getting your feedback to make sure this is something that the zarr developers are interested in.
This API doesn't necessarily allow users to do something new, because they could add concurrency on top of the existing interfaces, but adding this interface makes it much easier to align your reads and writes on chunk boundaries without needing to know as much about the implementation of Zarr. This is especially true for integer indexers and boolean mask indexers.
One major downside of this API is that it makes it easier for a user to attempt to access a non thread safe store in a multithreaded context. A user may naively use a
ThreadPoolExecutor
when it is not safe to do so and then get unexpected crashes. This could cause hard to debug (or even notice) data corruption depending on the given store. Another issue is that users may open many duplicate issues about this which adds a maintenance burden for the zarr developers.TODO: