Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyodide & Zarr #43

Open
jakirkham opened this issue Feb 17, 2022 · 30 comments
Open

Pyodide & Zarr #43

jakirkham opened this issue Feb 17, 2022 · 30 comments

Comments

@jakirkham
Copy link
Member

Learned recently that Pyodide is including Zarr, which is great to hear 😄 Also happy to learn folks are able to build Numcodecs.

Would be interested to learn more about how folks are using Pyodide & Zarr together. So opening this issue for that discussion.

@jakirkham
Copy link
Member Author

cc @rth (in case you have insights here 🙂)

@martindurant
Copy link
Member

I wonder how many fsspec implementations run in pyodide? Perhaps nothing that uses async...
Maybe we need a zarr JS frontend less than we thought :)

@jakirkham
Copy link
Member Author

I didn't think async was an issue, but threads and sockets are ( dask/dask#7764 (comment) ). IDK how requests would work there, but it seems some effort may be needed ( pyodide/pyodide#1956 ), which seems like the bigger blocker to using fsspec with Pyodide.

Yeah this came up in the discussion about Zarr.js ( #24 ). Namely that WASM may eventually solve things here.

@davidbrochart
Copy link

Pyodide supports async, but not threads.
I'm wondering how far we are from having Xarray and Dask to work in JupyterLite. It would be nice for cloud-based data/processing, since the browser would just be used for Dask scheduling and visualization.

@martindurant
Copy link
Member

Dask uses threads for sure, and fsspec also uses a dedicated IO thread for async.

@jakirkham
Copy link
Member Author

One could use the single threaded scheduler with Dask. Think that is what people trying Dask with Pyodide have generally done. Some discussion in issue ( dask/dask#7764 ).

@oeway
Copy link

oeway commented Feb 18, 2022

I think if zarr can somehow support async store natively, we will be able to create something like fsspec for pyodide.

Without that, it's going to be very hard and we will need to either wait for pyodide to support multi-threading, or using some tricks or dark magic described here.

@joshmoore
Copy link
Member

@jakirkham : guess a discussion probably leaves my question at #14 (comment) open :)

@oeway : do you have an idea of what the minimum steps would be to having a working implementation? (Do I need to stroll through zarr-developers/zarr-python#536 and look to revive items?)

@martindurant
Copy link
Member

Another idea from @d70-t was to enable a fully async mode for zarr, like x = await arr[:], which could then use an async backend such as the browser HTTP.

fsspec/filesystem_spec#960 is a small prototype of an async fsspec filesystem for HTTP; it works, but cannot be called by zarr currently, since the mapper interface makes blocking calls.

@d70-t
Copy link

d70-t commented May 27, 2022

My current understanding of how things in a Browser work roughly is: you simply can't do any blocking wait for doing any IO or waiting for other "threads" (web workers). Instead you always have to use either callbacks or async functions. Even if you do async IO in a separate worker, you can't do a blocking wait in your main thread for the worker to finish a job...

Running fsspec fully async is fine and @martindurant's backend just needs a little polishing but works already now. Using it, what you can do right now is: load all the data asynchronously and once data is there, use zarr synchronously (e.g. on a MemoryStore). This of course is very clumsy and what we really want is to delay loading data until e.g. .isel within xarray has happened. For this to work, I believe, we must have an async interface all the way down. That interface probably should be explicit, e.g. the equivalent of x = await arr[...] or even x = await ds.var.isel(...) as mentioned here in zarr-developers/zarr-python#536, but had been postponed as the second step (which seems to be a reasonable thing, because it will touch a lot of code...).

Maybe there's a way to hack around it using the dark magic mentioned by @oeway (which as far as I understand, basically tries to implement async without using the async-Keyword).

If it would be for the fully async implementaion (I can't really estimate how feasible that would be), another question would be if we even need something like x = await (await ds.var).isel(...) in stead of x = await ds.var.isel(...). Probably we don't want to have that (but there are ways to avoid this).

@oeway
Copy link

oeway commented May 27, 2022

@d70-t I think you are right! And I like the idea of going for fully async implementation.

Not just for pyodide, it can also be super beneficial for running zarr in native Python on the server side, e.g. serving a large number of users. In any async programming (e.g. a FastAPI server app), we don't want blocking code in order to achieve higher concurrency. Having native async suppoert in zarr would be useful when combined with async IO libraries such as aiofiles or aiohttp.

For the current pyodide (without mutli-threading support), the easiest way for me is to use XMLHTTPRequest to "convert" async request to sync, and that works well in e.g. JupyterLite.

I have a working demo where I use Zarr-Python with a ZipStore to fetch and visualize a remote large tissue image with http requests:
https://jupyter.imjoy.io/lab/index.html?load=https://gist.github.com/oeway/391b4352ea57b5682366ce3dc2fa9174&open=1

However, this only works if the Pyodide code runs in a web-worker (like a thread in the browser), and it will either fail or block the web UI when used in the main thread.

PyScript would be a nice way for using zarr-python in the browser, but I am not sure whether it run python in a separate web-worker or not.

@martindurant
Copy link
Member

Is there anything preventing the use of that sync file-like object in general pyscript code? Are there any restrictions, can I make an fsspec filesystem out of it? I don't immediately see where a webworker is used, and whether it is necessary for just the IO part. That would immediately enable intake.open, pandas.read_csv, etc. and anything else that currently depends on requests. Of course, async is better for zarr when accessing multiple chunks...

PyScript would be a nice way for using zarr-python in the browser, but I am not sure whether it run python in a separate web-worker or not.

I believe no, but the situation is rapidly changing, and the whole topic of async and/or threads is high on the priority list.

@d70-t
Copy link

d70-t commented May 27, 2022

I didn't know that a synchronous request actually exists in the Web (but XMLHttpRequest.open() with async=False clearly is one).

Based on this it should be relatively easy to build a synchronous HTTP fsspec-filesystem, which in turn should then work with current zarr and xarray. But it will be slow and likely have some bad user experience (e.g. due to blocking the user thread).

@d70-t
Copy link

d70-t commented May 27, 2022

So here's some discussion on synchronous XHR: it's deprecated. (And even more deprecated on the main thread). So it will work 😄, but we shouldn't do it 😬 ...

@martindurant
Copy link
Member

we shouldn't do it

I am prepared to make this FS for demonstration purposes. You probably shouldn't be loading too much data into the browser VM anyway, right? Since intake only reads little files for catalogs, it would be a nice thing for me; not so much for zarr/xarray.

@martindurant
Copy link
Member

martindurant commented May 27, 2022

Also, it means that normal python sync code can be OK, so long as it's in a webworker (and there are separate discussions about how to proxy the DOM, etc). It's good enough to create large xarray objects and select small parts of them, not to actually load a lot of data. That's fine for now, for the current state of pyscript. We'll adapt as more things are clarified on their end.

The original discussion about a fully async zarr/xarray is pertinent anyway. One feature I wold like, for instance, is to be able to fetch just one chunk but from many variables, concurrently. That's not possible now even with an async storage backend.

@d70-t
Copy link

d70-t commented May 27, 2022

For a demonstration, maybe.... but I guess once that's out everyone wants to hop on that (I'll try it for sure) and we'll hit huge performance issues. Probably there should at least be some big warning. It's not only the bad UX, but also that will effectively disable everything which has been gained using zarr-developers/zarr-python#536.

For the size, I think if people know they are working with larger datasets, loading a GB into browser memory might still be fine... (although keeping it low should always be a goal).

@martindurant
Copy link
Member

I find that fsspec's github implementation is surprisingly popular, despite being sync. Of course, files are limited in size and range requests don't work on github anyway. Python users are used to waiting for an operation to complete, browser users are not.

@d70-t
Copy link

d70-t commented May 27, 2022

Python users are used to waiting for an operation to complete

😄 yes, that's probably so true... I'm convinced and very happy to try the demonstration 👍.

@dopplershift
Copy link

One other thing about XHR: it doesn't work for binary data.

@martindurant
Copy link
Member

@dopplershift , I think you just killed the idea, which is a shame, as I've been working on it

@martindurant
Copy link
Member

(although YAML and CSV are fine, of course - but this gets rather limiting)

@dopplershift
Copy link

Well, these docs make it seem like it should be possible, but pyodide's open_url() explicitly says it doesn't support binaries--I assume for a good reason.

I ran into this same issue last week trying to make a drop-in to replace the use of requests in my siphon library so that I could get some other protocol working. I came to the conclusion that async is the only option here, which I'm guessing is why we haven't seen a simple requests-replacement.

@d70-t
Copy link

d70-t commented May 27, 2022

@dopplershift, thanks for pointing this out. I've been digging a bit deeper, and there are clear intentions behind this. In principle, it's possible to retrieve binary data via XHR. To do so, one has to set the responseType to "arraybuffer" or "blob". but the usage notes say that one can't change the responseType if one is in synchronous mode and not in a Worker. The usage notes point out the following:

This restriction is designed in part to help ensure that synchronous operations aren't used for large transactions that block the browser's main thread, thereby bogging down the user experience.

This kind of usage also matches to what @oeway observed previously:

However, this only works if the Pyodide code runs in a web-worker (like a thread in the browser), and it will either fail or block the web UI when used in the main thread.

So probably this is a little less limiting and maybe still worth a shot. But still, it seems like async zarr is much more what we want.

@d70-t
Copy link

d70-t commented May 27, 2022

I made a little test: A main script, a Worker and a binary file. The main script starts the worker, the worker fetches the binary file synchronously and posts the content back to the main script. The main script shows the content of the array buffer in the DOM. You can run it yourself by starting a little webserver, e.g.:

python -m http.server

binary_sync_request.zip

@d70-t
Copy link

d70-t commented May 27, 2022

pyodide/issues#400 might be relevant here as well... i.e. if this works, probably open_url() could be extended to support binary reads if being executed in a Worker.

@martindurant
Copy link
Member

martindurant commented May 27, 2022

The worker is sync, but the main thread is async; so how could this be called from pyscript? I am thinking "not at all" until we (someone else!) figure out how to use workers, threads, etc.

In fsspec/filesystem_spec#960, I put a text-only requests clone for pyodide/pyscript.

Interestingly: you do get data for binary fetches, but the non-ascii characters are rendered as '�', which would be b'\xef\xbf\xbd' in utf8.

@martindurant
Copy link
Member

@d70-t
Copy link

d70-t commented May 27, 2022

Yes... the main thread has to be async, because (as far as I know) you can't synchronously wait for a worker. Most likely because people only invented Workers after discovering that synchronous waits (i.e. XHR are not so nice).

But one could use this to run the entire pyodide, zarr, xarray, fsspec etc... stuff within a (synchronous) Worker. One only has to find out how to post the results back to UI. But maybe that's what jupyterlite does anyways ?

@d70-t
Copy link

d70-t commented May 27, 2022

So does https://github.com/imjoy-team/imjoy-utils/blob/main/imjoy_utils/__init__.py#L21 do the right thing?

Yes, it does. But you have to run it in a Worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants