Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Xarray interface #334

Open
benbovy opened this issue Oct 13, 2022 · 5 comments
Open

Idea: Xarray interface #334

benbovy opened this issue Oct 13, 2022 · 5 comments

Comments

@benbovy
Copy link

benbovy commented Oct 13, 2022

I stumbled upon the Open-EO project a few times and find it very interesting. As a Xarray developer I've been wondering whether it would benefit from an Xarray interface.

I see that Xarray is already used here, but as far as I understand it is for defining user defined functions. My suggestion is rather interacting with Open-EO directly via the Xarray API, which may be complementary to using Xarray for UDFs.

New Xarray developments are going towards very flexible containers, with the recent addition of IO backends, alternative array backends (cupy, sparse, pytorch...), alternative parallel execution backends, flexible indexes (pydata/xarray#7041, https://github.com/pydata/xarray/projects/1), and accessors.

Leveraging Xarray's flexibility, I can imagine something like this:

import openeo
import xarray as xr

connection = openeo.connect("https://earthengine.openeo.org")

ds = xr.open_dataset(
    connection,
    engine="openeo",
    collection="COPERNICUS/S1_GRD",
    spatial_extent={"west": 16.06, "south": 48.06, "east": 16.65, "north": 48.35},
    temporal_extent=["2017-03-01", "2017-06-01"],
    bands=["VV"],
)

# internally calls DataCube.filter_temporal() via a custom Xarray Index
# attached to dataset and returns a new xarray.Dataset
ds_march = ds.sel(time=slice("2017-03-01", "2017-04-01"))

# internally calls DataCube.mean_time()
ds_mean_march = ds_march.mean("time")

# sends the processing job, waits for its execution and downloads
# the result into a new xarray.Dataset
ds_result = ds_mean_march.compute()

# or only sends the job...
ds_mean_march.persist()

# ...and later waits for the job to finish its execution and downloads the result
ds_result = ds_mean_march.load()

I think that such an Xarray interface could be built on top of this client library (perhaps in another repository). To make things easier, ideally openeo.DataCube would need to implement some duck array API.

The main advantage is that users can interact with OpenEO using an API they are already familiar with (assuming they already know about Xarray). They can also further process the data (results) locally using the same interface.

This is a very rough idea that I just wanted to share here, though. I'm pretty sure that providing an Xarray interface would represent quite some work with lots of challenging issues (and likely things to address in Xarray). I'd be happy to read what you think about this idea! (Sorry, I'm not sure if it is the right place here for discussing this)

@m-mohr
Copy link
Member

m-mohr commented Oct 13, 2022

On the other hand, if the Array API specification gets adopted in the Python world, that might be the more general choice: https://data-apis.org/array-api/latest/API_specification/index.html

@jdries
Copy link
Collaborator

jdries commented Oct 14, 2022 via email

@benbovy
Copy link
Author

benbovy commented Oct 14, 2022

On the other hand, if the Array API specification gets adopted in the Python world, that might be the more general choice

Yes that would make things even easier! If OpenEO datacubes implement (a part of) the Array API Standard, then many things should already work seamlessly through Xarray (linking a discussion about testing the integration of any Array class with Xarray: pydata/xarray#6894).

For OpenEO datacube integration with Xarray, there is a few more things we could consider beyond the Array API:

  • Intergrate the OpenEO remote execution framework with Xarray, i.e., via the DataArray.compute(), DataArray.persist() and/or DataArray.load() methods. The latter methods are currently specific to Dask or Xarray's IO backends, we would need on the Xarray side to make them more abstract.
  • Integrate label-based filtering / indexing of OpenEO datacubes with Xarray, i.e., via DataArray.sel() and alignment. This could probably be done now with a custom Xarray index.
  • Propagate dimension names from Xarray to OpenEO datacubes, e.g., how to call openeo.DataCube.mean_time() via xarray.DataArray.mean("time")? This is less clear to me as Xarray handles the labels independently of any encapsulated array. Maybe this would be possible if an OpenEO datacube holds a mapping of axis position -> dimension name (I guess this would be required for implementing the Array API standard?), and then let Xarray handles the labels... but I don't know if that is a silly idea?

@soxofaan
Copy link
Member

Interesting idea indeed. Could be valuable to align with some kind of standardized array/cube API .

@clausmichele
Copy link
Member

@benbovy maybe you could be interest in the client side processing activity we are working on. The main idea consist in allowing the user to process data with openEO processes locally and not only in the cloud, using the xArray implementations of the openEO processes.
This is the PR #338 and there is a draft notebook showing some implemented functionalities, that you can look at in this rendered notebook: https://gistcdn.githack.com/clausmichele/9e2cf9589f6392262bc8626bb7e12a32/raw/b487860a4e8cdb2e7dac6402796eb8fbdefca2ce/client_side_proc_sample.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants