-
-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async in zarr #536
Comments
Have been wondering the same thing. Thanks for this nice write-up and performing this experiment. 😄 |
This is supported in zarr.js, for the same reasons as you wrote here. Here's the relevant PR and issue which also explains why it was a good idea (with some timings for less pathological cases where even there the speedup can be large). Pinging @manzt who implemented above functionality, the JS implementation follows Python pretty closely here, so a similar implementation approach may be possible |
Thanks for pinging me! |
That is a crazy speedup, amazing.
I'd be very grateful if someone could give me an intuition for why it's so
much faster. Is it that multiple http requests can basically be opened
concurrently? If so, is there an upper bound on the number of requests that
can be active concurrently, and would we need to manage that ourselves?
…On Mon, 17 Feb 2020, 04:26 Trevor Manz, ***@***.***> wrote:
Thanks for pinging me!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#536?email_source=notifications&email_token=AAFLYQTUJNQX5RGJT4LD6XTRDIGYHA5CNFSM4KVQPIV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL5AUNY#issuecomment-586811959>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQTNPUFTZGMXVG23XD3RDIGYHANCNFSM4KVQPIVQ>
.
|
I'm not at familiar with Python's In the example above, async with aiohttp.ClientSession() as session:
tic = time.time()
futures = [get_chunk_http_async(n, session) for n in range(za.shape[0] // za.chunks[0])]
all_data = await asyncio.gather(*futures)
print(f"{time.time() - tic} seconds") To handle concurrency, in zarr.js we use a dependency called const queue = new PQueue({ concurrency: concurrencyLimit });
for (const proj of indexer.iter()) {
const chunkValue = this.getChunkValue(proj, indexer, value, selectionShape); // returns promise
queue.add(() => this.chunkSetItem(proj.chunkCoords, proj.chunkSelection, chunkValue)); // returns promise that queue resolves
}
// guarantees that all work on queue has finished
await queue.onIdle(); I think One concern I have here is that the API for accessing a slice in an array in zarr.js is naturally awaitable since everything in javascript is async. We need to wrap our requests in async functions and await the responses. This in zarr-python, arr = za[:] is the same as this in zarr.js, const arr = await za.get(null); regardless of the store type. Ideally you'd hide all the async fetching from the end user and have the same API independent of whether the array is on disk or over HTTP. Someone smarter than me probably knows if this is possible. |
To Alistair's point, if this is just about requests, I wonder if simply a |
This is what I had in mind. I think the main performance benefit is when reading / writing many smallish chunks concurrently. We definitely don't want to force zarr users to use async. Maybe a store could declare itself to use async for certain operations, and then zarr would know to use asyncio when performing loops with I/O operations. Otherwise it would default to the current, synchronous behavior. I think there are cases where asyncio would hurt us. I did a few tests with files and found that, in some cases, asyncio was slower. I assume this is because issuing serial reads is more efficient than concurrent reads for some disks. |
I wonder if this would be a good opportunity to add a batch retrieval
method to the store API. E.g., something like store.multi_get(keys) ->
iterable of values. This would be an optional method, which if not
implemented would mean zarr falls back to current behaviour of calling
__getitem__ multiple times. If it was implemented, then the store
internally could choose to implement via asyncio if appropriate, or not.
I.e., the zarr core module would not need to be aware of async, it would be
entirely internal to the store implementation. Also, stores could leverage
this in other ways. E.g., if a store was wrapping a database, it could
batch together all the key requests into a single transaction. Or, e.g., if
the store was wrapping a cloud service that provided some API for batching
multiple requests into a single HTTP call, it could use that. Just a
thought.
…On Tue, 18 Feb 2020 at 05:48, Ryan Abernathey ***@***.***> wrote:
I wonder if simply a Store leveraging asyncio internally would be
sufficient to give you the same benefits
This is what I had in mind. I think the main performance benefit is when
reading / writing many smallish chunks concurrently. We definitely don't
want to force zarr users to use async.
Maybe a store could declare itself to use async for certain operations,
and then zarr would know to use asyncio when performing loops with I/O
operations. Otherwise it would default to the current, synchronous behavior.
I think there are cases where asyncio would hurt us. I did a few tests
with files and found that, in some cases, asyncio was slower. I assume this
is because issuing serial reads is more efficient than concurrent reads for
some disks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#536?email_source=notifications&email_token=AAFLYQSSQWPHM5IIN3PBBWDRDNZBXA5CNFSM4KVQPIV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMAVJDY#issuecomment-587289743>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQXW2ELB23IKXTEDGWLRDNZBXANCNFSM4KVQPIVQ>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
Yeah this came up in issue ( #384 ). Agree this would be a good solution to this problem (amongst others). |
@alimanfoo would you envision the stores' API being synchronous, and just using asyncio under the hood for multi-gets? I've found asyncio quite difficult to work with because it "poisons" the entire stack above it, forcing the entire pipeline into peppering async/await annotations everywhere. You can get the event loop and resync inside a function call but it's not pretty. |
Yes that was my thought. |
If we just want to parallelise network requests, we don't need to go full asyncio, right? We could do thread-based parallelism. Not fully convinced we need to buy into the whole asyncio package just to parallelise a few network requests. |
That's a good point. We can probably live with something as simple as TBH mapping with async, threads, or otherwise could just be specified as part of the API of this function ( |
Async and parallel are not the same thing. From my point of view, I use dask to handle actual parallel I/O. The example I posted here is one where I don't actually want dask involved. It's a tiny amount of data. When using xarray with zarr, the dimension coordinates are always read eagerly and without dask. That's the use case that motivated this issue. Furthermore, I would worry about any internal multithreading within zarr conflicting with dask's own parallelism. This would have to be handled carefully. |
I'm very far from expert in async programming, but my sense is that Ryan
makes an important point here. There may be an important difference between
how asyncio behaves and how a simple multithreaded program behaves. Throw
in the GIL and it gets more complicated. And being mindful of dask is
important, when using dask we don't want a situation where the user runs a
parallel program with 8 dask workers for the 8 cores on their machine, but
each dask worker runs a zarr read which tries to use 8 threads, this leads
to terrible thrashing of I/O and probably other resources. Although of
course not everyone is using dask.
FWIW I would suggest these are all implementation choices that can be made
by a store class. E.g., you could have a store that implements asyncio. Or
you could have a store class which implements multithreaded I/O (although
not recommended for use with dask). I would suggest we don't expose any
such parameters in the multiget() method, rather it's a choice of which
store class to use (or maybe parameters to the store class constructor).
…On Thu, 5 Mar 2020, 04:42 Ryan Abernathey, ***@***.***> wrote:
Async and parallel are not the same thing. From my point of view, I use
dask to handle actual parallel I/O. The example I posted here is one where
I don't actually want dask involved. It's a tiny amount of data. When using
xarray with zarr, the dimension coordinates are always read eagerly and
without dask. That's the use case that motivated this issue.
Furthermore, I would worry about any internal multithreading within zarr
conflicting with dask's own parallelism. This would have to be handled
carefully.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#536?email_source=notifications&email_token=AAFLYQUNY52X74TVJY7OJULRF4UVHA5CNFSM4KVQPIV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN3V4SA#issuecomment-595025480>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQXW32LDCO4K45T7SG3RF4UVHANCNFSM4KVQPIVQ>
.
|
I ran across this thread because I was having some very slow write performance for small data using |
@nbren12 - thanks for sharing! I would personally encourage us to be trying to make upstream performance improvements to gcsfs. My limited experience with these cloud filesystem-spec implementations suggests that they do a lot of unnecessary checking / caching (e.g. fsspec/s3fs#285). Are your 4x performance enhancements due to async, or something else?
I believe this is what is being worked on in #534. |
I'm definitely open to contributing this to gcsfs, but just wanted to put it out in the wild, since I personally just need the mapping capability, not the "fs". I believe the performance enhancement is due to async from examining the debugging level output for the example code in the README. This output shows the HTTP requests made for each call. fsspec does about 10 HTTP operations each taking about .5s to complete in sequence. OTOH, the async code I wrote seems to complete these same requests...well...asynchronously. I didn't notice any gratuitous list operations like #534, and the directory only had about 2 items in it to begin with. |
Here's an example of this output: https://gist.github.com/nbren12/10bed8494d067c3ff6c880c368561602. Part of the problem is that gcsfs seems to be using resumable uploads (requiring 3 HTTP requests for each chunk), but I still suspect async is speeding things up dramatically. The chunk-uploading only takes 0.3-1 seconds with async, but 13 seconds w/o. |
Very interesting, thanks for sharing. Pining @martindurant for the gcsfs perspective. |
Thanks @nbren12 for sharing. There isn't support for |
Note that |
Quick comment for the resumable upload in gcsfs: s3fs has implemented a shortcut to defer creation of the resumable upload and use a one-call version when the total file size is small enough (<5MB). The method for the single-call version exists in GCSFile, but the shortcut to execute is is not in place. It could be copied from s3fs. |
Yah it would be great to have generic async support in fsspec. I expect async will have a lot less overhead than a dask worker for small data, but don't have the benchmarking to back this up. |
Async and dask are also not incompatible. We can imagine a use case where we have many zarr chunks for each dask chunk. Using async in zarr would speed up the load time for each dask chunk.
…Sent from my iPhone
On Mar 30, 2020, at 7:28 PM, Noah D Brenowitz ***@***.***> wrote:
Yah it would be great to have generic async support in fsspec. I expect async will have a lot less overhead than a dask worker for small data, but don't have the benchmarking to back this up.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yah. I will just echo Ryan. Basically this is similar to my workflow, dask distributed for coarse-grained parallelism, and then async for faster I/O. |
Dask just indexes into the array with normal Alternatively, it looks like dask can work with asyncio, so you could feasibly use the same event loop to prevent just forking and forking and forking. Then it's up the executor to decide how many jobs to run concurrently; zarr using asyncio just defines the dependency graph of the operation, not how it's executed. If possible, exposing an async Array API would be great, and seems to be doable with # has a member called `aio` with async API
array = Array(...)
# - get chunk idxs
# - do `await sync_to_async(store.__getitem__)(chunk_key)` for each
# - assemble chunks into output array and return
out = await array.aio[:]
# - basically just calls async_to_sync(self.aio.__getitem__)(key)
out2 = array[:] There may be some overhead in doing this, of course. |
This captures my understanding very clearly. |
Of course not! I only mention that this is probably the case that you are most interested in. |
I believe it is easier to create an async API and syncify it, than the opposite, Draft of spec v3 is (so far) completely async and expose a sync API like what @martindurant in fsspec. I think it's more a question of laying down the groundwork for zarr to be async than making it async and have all the internal api be |
From the point of view of fsspec (not zarr), the availability of the sync API is essential, both because that is the only one currently available, and because some backends will not be async-able. When that is done, zarr may choose to first use a sync multiget with internal async, and then move to fully async; or whatever works. Is there anything missing in the HTTP POC for zarr to be able to make use of both of these avenues? |
Update:
|
What's the current status of an implementation that calls |
Currently, zarr calls getitem on each thing it needs, in sequence. To make use of my getitems and any concurrency, the code would need to be changed. I'm not aware of anyone having done this. |
I think this is still of interest. Is this something that you would be willing to contribute, @bilts? 🙂 |
@jakirkham I would be / have been willing. I have a prototype implementation in a fork mentioned above for people to try it out. It's proven way, way faster for our needs. I think making a more robust change may require someone more able to make choices about zarr-python's internal organization than I am. |
Hurray! |
That's probably easier/faster to address through the normal review process in a PR 😉 |
I believe we can now close this issue, now that #630 and #606 have both been merged! Just to close the loop, I just did a simple benchmarking exercise from Pangeo Cloud in Google Cloud. import zarr
import os
import fsspec
import gcsfs
import numpy as np
import uuid
# get a temporary place to write to GCS
uid = uuid.uuid1()
path = os.environ['PANGEO_SCRATCH'] + str(uid) + '.zarr'
gcs = gcsfs.GCSFileSystem()
mapper = gcs.get_mapper(path)
# create a zarr array with many small chunks
shape = (100, 1000)
chunks = (1, 1000)
arr = zarr.create(shape, chunks=chunks, store=mapper)
# time write
%time arr[:] = 9
# time read
%time _ = arr[:] My test environment included the following versions:
My old (pre-async) environment was
Here is a comparison of the speeds
This is a fantastic improvement, particularly for reading (over 20x)! 👏 👏 👏 for all the effort by @martindurant and the other devs who helped make this happen! |
One thing I'll add: parallel fetching of chunks in Zarr (as is supported here by |
@shoyer I have a question about this. I am currently fetching 500 GB to 2 TB chunked datases (~1MB chunks). Running on a 48 physical 96 logical core machine. I am using I am using a When I recreate the Another example is 8 workers with 12 threads, and that scales up to 8x performance from 1 worker, but still much slower than 48 workers. So it only seems to scale up with processes, not threads? I was wondering if I am doing something wrong? It sounds like the thread only client is NOT fetching multiple parts / chunks at the same time. I am not using any
|
@tasansal can you please raise a new issue? Also please include a minimal reproducer |
@jakirkham, of course, wasn't sure if it should be an issue of its own. Will do that. |
Thanks! It just makes it easier for us to keep track of these things :) |
I think there are some places where zarr would benefit immensely from some async capabilities when reading and writing data. I will try to illustrate this with the simplest example I can.
Let's consider a zarr array stored in a public S3 bucket, which we can read with fsspec's
HTTPFileSystem
interface (no special S3 API needed, just regular http calls).Note that this is a highly sub-optimal choice of chunks. The 1D array of shape (6443,) is stored in chunks of only (5,) items, resulting in over 1000 tiny chunks. Reading this data takes forever, over 5 minutes
I believe fsspec is introducing some major overhead by not reusing a connectionpool. But regardless, zarr is iterating synchronously over each chunk to load the data:
zarr-python/zarr/core.py
Lines 1023 to 1028 in 994f244
As a lower bound on how fast this approach could be, we bypass zarr and fsspec and just fetch all the chunks with requests:
As expected, reusing a connection pool sped things up, but it still takes 100 s to read the array.
Finally, we can try the same thing with asyncio
This is a MAJOR speedup!
I am aware that using dask could possibly help me here. But I don't have big data here, and I don't want to use dask. I want zarr to support asyncio natively.
I am quite new to async programming and have no idea how hard / complicated it would be to do this. But based on this experiment, I am quite sure there are major performance benefits to be had, particularly when using zarr with remote storage protocols.
Thoughts?
cc @cgentemann
The text was updated successfully, but these errors were encountered: