Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Design for Xarray Backends #1970

Open
jhamman opened this issue Mar 6, 2018 · 9 comments
Open

API Design for Xarray Backends #1970

jhamman opened this issue Mar 6, 2018 · 9 comments

Comments

@jhamman
Copy link
Member

jhamman commented Mar 6, 2018

It has come time to formalize the API for Xarray backends. We now have the following backends implemented in xarray:

Backend Read Write
netcdf4-python x x
h5netcdf x x
pydap x
pynio x
scipy x x
rasterio* x
zarr x x

* currently does not inherit from backends.AbstractDatastore

And there are conversations about adding additional backends, for example:

However, as anyone who has worked on implementing or optimizing any of our current backends can attest, the existing DataStore API is not particularly user/developer friendly. @shoyer asked me to open an issue to discuss what a more user friendly backend API would look like so that is what this issue will be. I have left out a thorough description of the current API because, well, I don't think it can done in a succinct manner (thats the problem).

Note that @shoyer started down a API refactor some time ago in #1087 but that effort has stalled, presumably because we don't have a well defined set of development goals here.

cc @pydata/xarray

@shoyer
Copy link
Member

shoyer commented Mar 6, 2018

Backend needs that have changed since I drafted an API refactor in #1087:

  • pickle support, required to support backends with dask distributed
  • caching for managing open files, required for efficiently loading data from multiple files at once (see file_manager.py)
  • locking, required for threadsafe operations. Note that xarray backends are only threadsafe after files are opened, not during open_dataset (open_dataset is not thread-safe #4100).
  • support for array indexing modes, required for supporting various forms of indexing in xarray
  • customized conventions for encoding/decoding data (e.g., required for zarr)

It would be nice to figure out how to abstract away these details from backend authors as much as possible. Most of the ingredients for these features exist in xarray.backends (e.g., see CachingFileManager), but the lack of a clean-separation between internal and public APIs makes it hard to write backends outside of xarray.

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2018

What is the role of the netCDF API in the backend API?

My understanding of the point of h5netcdf was to provide a netCDF-like interface for HDF5, thereby making it easier to interface with xarray. So one potential answer to the backend API question is simply: make a netCDF-like interface for your library and then xarray can use it.

However we still need a separate h5netcdf backend within xarray, so this design is perhaps not as clean as we would like.

@shoyer
Copy link
Member

shoyer commented Mar 6, 2018

What is the role of the netCDF API in the backend API?

A netCDF-like API is a good starting place for xarray backends, since our data model is strongly modeled on netCDF. But that's not quite unambiguous enough for us. There are lots of details like indexing, dtypes and locking that need awareness of both how xarray works and the specific backend. So I think we are unlikely to be able to eliminate the need for adapter classes.

My understanding of the point of h5netcdf was to provide a netCDF-like interface for HDF5, thereby making it easier to interface with xarray.

Yes, this was a large point of h5netcdf, although there are also users of h5netcdf without xarray. The main reason why it's a separate project is facilitate separation of concerns: xarray backends should be about how to adapt storage systems to work with xarray, not focused on details of another file format.

h5netcdf is now up to about 1500 lines of code (including tests), and that's definitely big enough that I'm happy I wrote it as a separate project. The full netCDF4 data model turns out to involve a fair amount of nuance.

Alternatively, if adaptation to the netCDF data model is easy (e.g., <100 lines of code), then it may not be worth the separate package. This is currently the case for zarr.

@darothen
Copy link

@jhamman What do you think would be involved in fleshing out the integration between xarray and rasterio in order to output cloud-optimized GeoTiffs? I

@jhamman
Copy link
Member Author

jhamman commented Mar 13, 2018

@darothen - not sure exactly. We probably just need to put in the machinery for the to_rasterio method. I'm also not sure what necessary considerations would need to be made for the cloud part -- presumably that would fall on rasterio.

@ebo
Copy link

ebo commented Mar 14, 2018

Not sure what would be involved, but I am consistently having to roll my own (typically with GDAL post processing) to save to GeoTIFF's. On thing missing from the reads so far is that the attributes only read the standard metadata and not that user defined. In particular if I use DG WV02 imagery, I have not yet figured out how to access the sun - satellite geomerty. Even having a first pass for xarray.to_geotiff would be helpful.

@jhamman
Copy link
Member Author

jhamman commented Jul 26, 2019

@danielballan wrote a blog post on how entrypoints can be used for 3rd party libraries can register plugins: https://blog.danallan.com/posts/2019-07-24-use-entrypoints-more/. I'm thinking this could be a particularly convenient way to allow libraries to register new backend engines to open_dataset in a standard way.

Of course, we still need to come up with a standard API for backends but the entrypoint idea could be the solution to hooking them into xarray.

@snowman2
Copy link
Contributor

snowman2 commented Jul 5, 2020

Reference for writing from xarray to GeoTIFF or any other GDAL supported format: https://corteva.github.io/rioxarray/stable/examples/convert_to_raster.html

@shoyer
Copy link
Member

shoyer commented Oct 6, 2020

I wrote up a proposal for grouping together decoding options into a single argument: #4490. Feedback would be very welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants