-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyodide & Zarr #43
Comments
cc @rth (in case you have insights here 🙂) |
I wonder how many fsspec implementations run in pyodide? Perhaps nothing that uses async... |
I didn't think async was an issue, but threads and sockets are ( dask/dask#7764 (comment) ). IDK how Yeah this came up in the discussion about Zarr.js ( #24 ). Namely that WASM may eventually solve things here. |
Pyodide supports async, but not threads. |
Dask uses threads for sure, and fsspec also uses a dedicated IO thread for async. |
One could use the single threaded scheduler with Dask. Think that is what people trying Dask with Pyodide have generally done. Some discussion in issue ( dask/dask#7764 ). |
I think if zarr can somehow support async store natively, we will be able to create something like fsspec for pyodide. Without that, it's going to be very hard and we will need to either wait for pyodide to support multi-threading, or using some tricks or dark magic described here. |
@jakirkham : guess a discussion probably leaves my question at #14 (comment) open :) @oeway : do you have an idea of what the minimum steps would be to having a working implementation? (Do I need to stroll through zarr-developers/zarr-python#536 and look to revive items?) |
Another idea from @d70-t was to enable a fully async mode for zarr, like fsspec/filesystem_spec#960 is a small prototype of an async fsspec filesystem for HTTP; it works, but cannot be called by zarr currently, since the mapper interface makes blocking calls. |
My current understanding of how things in a Browser work roughly is: you simply can't do any blocking wait for doing any IO or waiting for other "threads" (web workers). Instead you always have to use either callbacks or async functions. Even if you do async IO in a separate worker, you can't do a blocking wait in your main thread for the worker to finish a job... Running Maybe there's a way to hack around it using the dark magic mentioned by @oeway (which as far as I understand, basically tries to implement If it would be for the fully async implementaion (I can't really estimate how feasible that would be), another question would be if we even need something like |
@d70-t I think you are right! And I like the idea of going for fully async implementation. Not just for pyodide, it can also be super beneficial for running zarr in native Python on the server side, e.g. serving a large number of users. In any async programming (e.g. a FastAPI server app), we don't want blocking code in order to achieve higher concurrency. Having native async suppoert in zarr would be useful when combined with async IO libraries such as aiofiles or aiohttp. For the current pyodide (without mutli-threading support), the easiest way for me is to use XMLHTTPRequest to "convert" async request to sync, and that works well in e.g. JupyterLite. I have a working demo where I use Zarr-Python with a ZipStore to fetch and visualize a remote large tissue image with http requests: However, this only works if the Pyodide code runs in a web-worker (like a thread in the browser), and it will either fail or block the web UI when used in the main thread. PyScript would be a nice way for using zarr-python in the browser, but I am not sure whether it run python in a separate web-worker or not. |
Is there anything preventing the use of that sync file-like object in general pyscript code? Are there any restrictions, can I make an fsspec filesystem out of it? I don't immediately see where a webworker is used, and whether it is necessary for just the IO part. That would immediately enable intake.open, pandas.read_csv, etc. and anything else that currently depends on
I believe no, but the situation is rapidly changing, and the whole topic of async and/or threads is high on the priority list. |
I didn't know that a synchronous request actually exists in the Web (but XMLHttpRequest.open() with Based on this it should be relatively easy to build a synchronous HTTP fsspec-filesystem, which in turn should then work with current |
So here's some discussion on synchronous XHR: it's deprecated. (And even more deprecated on the main thread). So it will work 😄, but we shouldn't do it 😬 ... |
I am prepared to make this FS for demonstration purposes. You probably shouldn't be loading too much data into the browser VM anyway, right? Since intake only reads little files for catalogs, it would be a nice thing for me; not so much for zarr/xarray. |
Also, it means that normal python sync code can be OK, so long as it's in a webworker (and there are separate discussions about how to proxy the DOM, etc). It's good enough to create large xarray objects and select small parts of them, not to actually load a lot of data. That's fine for now, for the current state of pyscript. We'll adapt as more things are clarified on their end. The original discussion about a fully async zarr/xarray is pertinent anyway. One feature I wold like, for instance, is to be able to fetch just one chunk but from many variables, concurrently. That's not possible now even with an async storage backend. |
For a demonstration, maybe.... but I guess once that's out everyone wants to hop on that (I'll try it for sure) and we'll hit huge performance issues. Probably there should at least be some big warning. It's not only the bad UX, but also that will effectively disable everything which has been gained using zarr-developers/zarr-python#536. For the size, I think if people know they are working with larger datasets, loading a GB into browser memory might still be fine... (although keeping it low should always be a goal). |
I find that fsspec's github implementation is surprisingly popular, despite being sync. Of course, files are limited in size and range requests don't work on github anyway. Python users are used to waiting for an operation to complete, browser users are not. |
😄 yes, that's probably so true... I'm convinced and very happy to try the demonstration 👍. |
One other thing about XHR: it doesn't work for binary data. |
@dopplershift , I think you just killed the idea, which is a shame, as I've been working on it |
(although YAML and CSV are fine, of course - but this gets rather limiting) |
Well, these docs make it seem like it should be possible, but pyodide's I ran into this same issue last week trying to make a drop-in to replace the use of requests in my siphon library so that I could get some other protocol working. I came to the conclusion that async is the only option here, which I'm guessing is why we haven't seen a simple requests-replacement. |
@dopplershift, thanks for pointing this out. I've been digging a bit deeper, and there are clear intentions behind this. In principle, it's possible to retrieve binary data via
This kind of usage also matches to what @oeway observed previously:
So probably this is a little less limiting and maybe still worth a shot. But still, it seems like |
I made a little test: A main script, a Worker and a binary file. The main script starts the worker, the worker fetches the binary file synchronously and posts the content back to the main script. The main script shows the content of the array buffer in the DOM. You can run it yourself by starting a little webserver, e.g.: python -m http.server |
pyodide/issues#400 might be relevant here as well... i.e. if this works, probably |
The worker is sync, but the main thread is async; so how could this be called from pyscript? I am thinking "not at all" until we (someone else!) figure out how to use workers, threads, etc. In fsspec/filesystem_spec#960, I put a text-only requests clone for pyodide/pyscript. Interestingly: you do get data for binary fetches, but the non-ascii characters are rendered as |
So does https://github.com/imjoy-team/imjoy-utils/blob/main/imjoy_utils/__init__.py#L21 do the right thing? |
Yes... the main thread has to be async, because (as far as I know) you can't synchronously wait for a worker. Most likely because people only invented Workers after discovering that synchronous waits (i.e. But one could use this to run the entire pyodide, zarr, xarray, fsspec etc... stuff within a (synchronous) Worker. One only has to find out how to post the results back to UI. But maybe that's what jupyterlite does anyways ? |
Yes, it does. But you have to run it in a Worker. |
Learned recently that Pyodide is including Zarr, which is great to hear 😄 Also happy to learn folks are able to build Numcodecs.
Would be interested to learn more about how folks are using Pyodide & Zarr together. So opening this issue for that discussion.
The text was updated successfully, but these errors were encountered: