Pyodide & Zarr #43

jakirkham · 2022-02-17T20:10:44Z

Learned recently that Pyodide is including Zarr, which is great to hear 😄 Also happy to learn folks are able to build Numcodecs.

Would be interested to learn more about how folks are using Pyodide & Zarr together. So opening this issue for that discussion.

jakirkham · 2022-02-17T20:18:13Z

cc @rth (in case you have insights here 🙂)

martindurant · 2022-02-17T20:50:08Z

I wonder how many fsspec implementations run in pyodide? Perhaps nothing that uses async...
Maybe we need a zarr JS frontend less than we thought :)

jakirkham · 2022-02-17T21:12:12Z

I didn't think async was an issue, but threads and sockets are ( dask/dask#7764 (comment) ). IDK how requests would work there, but it seems some effort may be needed ( pyodide/pyodide#1956 ), which seems like the bigger blocker to using fsspec with Pyodide.

Yeah this came up in the discussion about Zarr.js ( #24 ). Namely that WASM may eventually solve things here.

davidbrochart · 2022-02-17T21:18:10Z

Pyodide supports async, but not threads.
I'm wondering how far we are from having Xarray and Dask to work in JupyterLite. It would be nice for cloud-based data/processing, since the browser would just be used for Dask scheduling and visualization.

martindurant · 2022-02-17T21:19:09Z

Dask uses threads for sure, and fsspec also uses a dedicated IO thread for async.

jakirkham · 2022-02-17T21:23:53Z

One could use the single threaded scheduler with Dask. Think that is what people trying Dask with Pyodide have generally done. Some discussion in issue ( dask/dask#7764 ).

oeway · 2022-02-18T21:10:53Z

I think if zarr can somehow support async store natively, we will be able to create something like fsspec for pyodide.

Without that, it's going to be very hard and we will need to either wait for pyodide to support multi-threading, or using some tricks or dark magic described here.

joshmoore · 2022-05-26T15:18:11Z

@jakirkham : guess a discussion probably leaves my question at #14 (comment) open :)

@oeway : do you have an idea of what the minimum steps would be to having a working implementation? (Do I need to stroll through zarr-developers/zarr-python#536 and look to revive items?)

martindurant · 2022-05-26T15:25:09Z

Another idea from @d70-t was to enable a fully async mode for zarr, like x = await arr[:], which could then use an async backend such as the browser HTTP.

fsspec/filesystem_spec#960 is a small prototype of an async fsspec filesystem for HTTP; it works, but cannot be called by zarr currently, since the mapper interface makes blocking calls.

d70-t · 2022-05-27T14:46:47Z

My current understanding of how things in a Browser work roughly is: you simply can't do any blocking wait for doing any IO or waiting for other "threads" (web workers). Instead you always have to use either callbacks or async functions. Even if you do async IO in a separate worker, you can't do a blocking wait in your main thread for the worker to finish a job...

Running fsspec fully async is fine and @martindurant's backend just needs a little polishing but works already now. Using it, what you can do right now is: load all the data asynchronously and once data is there, use zarr synchronously (e.g. on a MemoryStore). This of course is very clumsy and what we really want is to delay loading data until e.g. .isel within xarray has happened. For this to work, I believe, we must have an async interface all the way down. That interface probably should be explicit, e.g. the equivalent of x = await arr[...] or even x = await ds.var.isel(...) as mentioned here in zarr-developers/zarr-python#536, but had been postponed as the second step (which seems to be a reasonable thing, because it will touch a lot of code...).

Maybe there's a way to hack around it using the dark magic mentioned by @oeway (which as far as I understand, basically tries to implement async without using the async-Keyword).

If it would be for the fully async implementaion (I can't really estimate how feasible that would be), another question would be if we even need something like x = await (await ds.var).isel(...) in stead of x = await ds.var.isel(...). Probably we don't want to have that (but there are ways to avoid this).

oeway · 2022-05-27T15:20:28Z

@d70-t I think you are right! And I like the idea of going for fully async implementation.

Not just for pyodide, it can also be super beneficial for running zarr in native Python on the server side, e.g. serving a large number of users. In any async programming (e.g. a FastAPI server app), we don't want blocking code in order to achieve higher concurrency. Having native async suppoert in zarr would be useful when combined with async IO libraries such as aiofiles or aiohttp.

For the current pyodide (without mutli-threading support), the easiest way for me is to use XMLHTTPRequest to "convert" async request to sync, and that works well in e.g. JupyterLite.

I have a working demo where I use Zarr-Python with a ZipStore to fetch and visualize a remote large tissue image with http requests:
https://jupyter.imjoy.io/lab/index.html?load=https://gist.github.com/oeway/391b4352ea57b5682366ce3dc2fa9174&open=1

However, this only works if the Pyodide code runs in a web-worker (like a thread in the browser), and it will either fail or block the web UI when used in the main thread.

PyScript would be a nice way for using zarr-python in the browser, but I am not sure whether it run python in a separate web-worker or not.

martindurant · 2022-05-27T15:38:05Z

Is there anything preventing the use of that sync file-like object in general pyscript code? Are there any restrictions, can I make an fsspec filesystem out of it? I don't immediately see where a webworker is used, and whether it is necessary for just the IO part. That would immediately enable intake.open, pandas.read_csv, etc. and anything else that currently depends on requests. Of course, async is better for zarr when accessing multiple chunks...

PyScript would be a nice way for using zarr-python in the browser, but I am not sure whether it run python in a separate web-worker or not.

I believe no, but the situation is rapidly changing, and the whole topic of async and/or threads is high on the priority list.

d70-t · 2022-05-27T15:42:02Z

I didn't know that a synchronous request actually exists in the Web (but XMLHttpRequest.open() with async=False clearly is one).

Based on this it should be relatively easy to build a synchronous HTTP fsspec-filesystem, which in turn should then work with current zarr and xarray. But it will be slow and likely have some bad user experience (e.g. due to blocking the user thread).

d70-t · 2022-05-27T15:49:44Z

So here's some discussion on synchronous XHR: it's deprecated. (And even more deprecated on the main thread). So it will work 😄, but we shouldn't do it 😬 ...

martindurant · 2022-05-27T15:53:36Z

we shouldn't do it

I am prepared to make this FS for demonstration purposes. You probably shouldn't be loading too much data into the browser VM anyway, right? Since intake only reads little files for catalogs, it would be a nice thing for me; not so much for zarr/xarray.

martindurant · 2022-05-27T15:57:25Z

Also, it means that normal python sync code can be OK, so long as it's in a webworker (and there are separate discussions about how to proxy the DOM, etc). It's good enough to create large xarray objects and select small parts of them, not to actually load a lot of data. That's fine for now, for the current state of pyscript. We'll adapt as more things are clarified on their end.

The original discussion about a fully async zarr/xarray is pertinent anyway. One feature I wold like, for instance, is to be able to fetch just one chunk but from many variables, concurrently. That's not possible now even with an async storage backend.

d70-t · 2022-05-27T16:02:54Z

For a demonstration, maybe.... but I guess once that's out everyone wants to hop on that (I'll try it for sure) and we'll hit huge performance issues. Probably there should at least be some big warning. It's not only the bad UX, but also that will effectively disable everything which has been gained using zarr-developers/zarr-python#536.

For the size, I think if people know they are working with larger datasets, loading a GB into browser memory might still be fine... (although keeping it low should always be a goal).

martindurant · 2022-05-27T16:07:13Z

I find that fsspec's github implementation is surprisingly popular, despite being sync. Of course, files are limited in size and range requests don't work on github anyway. Python users are used to waiting for an operation to complete, browser users are not.

d70-t · 2022-05-27T16:09:44Z

Python users are used to waiting for an operation to complete

😄 yes, that's probably so true... I'm convinced and very happy to try the demonstration 👍.

dopplershift · 2022-05-27T19:02:55Z

One other thing about XHR: it doesn't work for binary data.

martindurant · 2022-05-27T19:03:39Z

@dopplershift , I think you just killed the idea, which is a shame, as I've been working on it

martindurant · 2022-05-27T19:04:12Z

(although YAML and CSV are fine, of course - but this gets rather limiting)

dopplershift · 2022-05-27T19:08:56Z

Well, these docs make it seem like it should be possible, but pyodide's open_url() explicitly says it doesn't support binaries--I assume for a good reason.

I ran into this same issue last week trying to make a drop-in to replace the use of requests in my siphon library so that I could get some other protocol working. I came to the conclusion that async is the only option here, which I'm guessing is why we haven't seen a simple requests-replacement.

d70-t · 2022-05-27T19:47:09Z

@dopplershift, thanks for pointing this out. I've been digging a bit deeper, and there are clear intentions behind this. In principle, it's possible to retrieve binary data via XHR. To do so, one has to set the responseType to "arraybuffer" or "blob". but the usage notes say that one can't change the responseType if one is in synchronous mode and not in a Worker. The usage notes point out the following:

This restriction is designed in part to help ensure that synchronous operations aren't used for large transactions that block the browser's main thread, thereby bogging down the user experience.

This kind of usage also matches to what @oeway observed previously:

However, this only works if the Pyodide code runs in a web-worker (like a thread in the browser), and it will either fail or block the web UI when used in the main thread.

So probably this is a little less limiting and maybe still worth a shot. But still, it seems like async zarr is much more what we want.

d70-t · 2022-05-27T20:13:54Z

I made a little test: A main script, a Worker and a binary file. The main script starts the worker, the worker fetches the binary file synchronously and posts the content back to the main script. The main script shows the content of the array buffer in the DOM. You can run it yourself by starting a little webserver, e.g.:

python -m http.server

binary_sync_request.zip

d70-t · 2022-05-27T20:19:14Z

pyodide/issues#400 might be relevant here as well... i.e. if this works, probably open_url() could be extended to support binary reads if being executed in a Worker.

martindurant · 2022-05-27T20:21:09Z

The worker is sync, but the main thread is async; so how could this be called from pyscript? I am thinking "not at all" until we (someone else!) figure out how to use workers, threads, etc.

In fsspec/filesystem_spec#960, I put a text-only requests clone for pyodide/pyscript.

Interestingly: you do get data for binary fetches, but the non-ascii characters are rendered as '�', which would be b'\xef\xbf\xbd' in utf8.

martindurant · 2022-05-27T20:25:52Z

So does https://github.com/imjoy-team/imjoy-utils/blob/main/imjoy_utils/__init__.py#L21 do the right thing?

d70-t · 2022-05-27T20:27:03Z

Yes... the main thread has to be async, because (as far as I know) you can't synchronously wait for a worker. Most likely because people only invented Workers after discovering that synchronous waits (i.e. XHR are not so nice).

But one could use this to run the entire pyodide, zarr, xarray, fsspec etc... stuff within a (synchronous) Worker. One only has to find out how to post the results back to UI. But maybe that's what jupyterlite does anyways ?

d70-t · 2022-05-27T20:27:35Z

So does https://github.com/imjoy-team/imjoy-utils/blob/main/imjoy_utils/__init__.py#L21 do the right thing?

Yes, it does. But you have to run it in a Worker.

jakirkham mentioned this issue Feb 17, 2022

release wheel for 3.10 zarr-developers/numcodecs#308

Closed

davidbrochart mentioned this issue May 27, 2022

Add xeus-python services jupyterlite/xeus-python-kernel#20

Closed

agriyakhetarpal mentioned this issue May 22, 2024

Add Pyodide support and CI jobs for Zarr zarr-developers/zarr-python#1902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyodide & Zarr #43

Pyodide & Zarr #43

jakirkham commented Feb 17, 2022

jakirkham commented Feb 17, 2022

martindurant commented Feb 17, 2022

jakirkham commented Feb 17, 2022

davidbrochart commented Feb 17, 2022

martindurant commented Feb 17, 2022

jakirkham commented Feb 17, 2022

oeway commented Feb 18, 2022

joshmoore commented May 26, 2022

martindurant commented May 26, 2022

d70-t commented May 27, 2022

oeway commented May 27, 2022

martindurant commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

martindurant commented May 27, 2022

martindurant commented May 27, 2022 •

edited

Loading

d70-t commented May 27, 2022

martindurant commented May 27, 2022

d70-t commented May 27, 2022

dopplershift commented May 27, 2022

martindurant commented May 27, 2022

martindurant commented May 27, 2022

dopplershift commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

martindurant commented May 27, 2022 •

edited

Loading

martindurant commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

Pyodide & Zarr #43

Pyodide & Zarr #43

Comments

jakirkham commented Feb 17, 2022

jakirkham commented Feb 17, 2022

martindurant commented Feb 17, 2022

jakirkham commented Feb 17, 2022

davidbrochart commented Feb 17, 2022

martindurant commented Feb 17, 2022

jakirkham commented Feb 17, 2022

oeway commented Feb 18, 2022

joshmoore commented May 26, 2022

martindurant commented May 26, 2022

d70-t commented May 27, 2022

oeway commented May 27, 2022

martindurant commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

martindurant commented May 27, 2022

martindurant commented May 27, 2022 • edited Loading

d70-t commented May 27, 2022

martindurant commented May 27, 2022

d70-t commented May 27, 2022

dopplershift commented May 27, 2022

martindurant commented May 27, 2022

martindurant commented May 27, 2022

dopplershift commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

martindurant commented May 27, 2022 • edited Loading

martindurant commented May 27, 2022

d70-t commented May 27, 2022

d70-t commented May 27, 2022

martindurant commented May 27, 2022 •

edited

Loading

martindurant commented May 27, 2022 •

edited

Loading