Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enriching datacube STAC items with more array metadata #18

Open
ghidalgo3 opened this issue Jul 24, 2024 · 2 comments
Open

Enriching datacube STAC items with more array metadata #18

ghidalgo3 opened this issue Jul 24, 2024 · 2 comments

Comments

@ghidalgo3
Copy link

Hello, I am interested in expanding the datacube STAC extension to support more multidimensional array metadata for assets, particularly array metadata found in NetCDF, HDF5, and GRIB2 files. I think I'm caught up on the great discussions of the past:

And the STAC catalog items that I've been working with are all hosted on Microsoft's Planetary Computer platform, specifically:

For context, my goal is to be one day be able to do something like this with Xarray:

>>> import xarray as xr
>>> items = stac_catalog.search(...)
>>> vds = xr.open_mfdataset(items, engine="stac") # This call should do no I/O!
>>> vds
<xarray.Dataset> Size: 1GB
Dimensions:  (lat: 600, lon: 1440, time: 360)
Coordinates:
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time     (time) float64 3kB 3.6e+04 3.6e+04 3.6e+04 ... 3.636e+04 3.636e+04
Data variables:
    pr       (time, lat, lon) float32 1GB ManifestArray<shape=(360, 600, 1440...
>>> vds.variable.sum() # Here the IO runs 
42.0

In that example, assume that STAC items returned in the search contain assets which are the files themselves. I don't want to actually read the asset, I want the STAC item to contain enough information to create a manipulable dataset that Xarray understands. Reading comes after searching, merging, filtering, and projecting away the variables I'm not interested in.


This proposal is heavily based on ZarrV3 though I believe any multidimensional array handling system will care to know the same information.

I propose the following additional properties on only cube:variables:

Field Name Type Description
data_type string numpy parseable datatype
chunk_shape [number] The size of a chunk by element count
fill_value number|string|null Needed to handle sparse arrays
(optional) dimensions [string] The subset of cube:dimensions that index this variable. If not set, all dimensions index this variable. This may happen with single GRIB2 files that contain multiple datacubes.
codecs [object] An ordered list of codec configurations

A new property that applies to either cube:variables or cube:dimensions:

Field Name Type Description
attrs object Lifted key-value attributes from the original source file.

In the previous discussion on this topic #8 , a suggestion was made to use the files extension to store chunk metadata, but I don't think that extension is appropriate for this purpose. Similarly, I don't think the Bands RFC radiantearth/stac-spec#1254 addresses this problem, it is solving something entirely different.

CC @TomAugspurger we can handle chunk manifests later, they are ultimately just assets. Similarly, coordinate transforms are separate and probably better to wait for GeoZarr to standardize.

I'd like to know your thoughts on this proposal, or if perhaps this something worth putting into a hypothetical Zarr extension instead. IMO, I think the only thing that is very Zarr specific is the codecs property, everything else is very mappable with the underlying source files (even then, the files themselves define codecs too though they may not call them that).

@TomAugspurger
Copy link
Contributor

Thanks. Having the all the information needed to construct a Dataset from a STAC item / list of items would be great.

Some comments on the proposed fields:

  1. The raster extension defines a list of dtypes at https://github.com/stac-extensions/raster?tab=readme-ov-file#data-types. In general, the STAC metadata should probably match values used in STAC extensions. And then particular applications (like this xarray STAC engine) can map from the STAC names to the names it needs (NumPy dtypes)
  2. For chunk_shape, there is a proposed ZEP for variable chunking: https://zarr.dev/zeps/draft/ZEP0003.html. Instead of a list[int] with a length equal to the number of dimensions, you give a list[list[int]], where the number of inner lists matches the number of dimensions, and the length of each inner list is the number of chunks. IMO a list[number] is fine and we can generalize stuff if/when that ZEP is accepted.
  3. For the "optional" part of dimensions, do you have a recommendation for how to interpret dimensions being None? Does that mean all the dims in cube:dimensions apply?

@ghidalgo3
Copy link
Author

  1. Agreed, we should re-use the numeric datatypes from the raster extension.
  2. I suppose it doesn't add any complexity to the extension to specify that chunk_shape could be either a list[int] or a list[list[int]]. The first case will be most common but if/when that ZEP is accepted then the extension will already specify variable chunks.
  3. Actually I lost my reasoning for making dimensions optional. The current spec says:

REQUIRED. The dimensions of the variable. This should refer to keys in the cube:dimensions object or be an empty list if the variable has no dimensions.

Maybe I was trying to avoid a possible ambiguity regarding which cube:dimensions, the asset-level dimensions or the item-level dimensions? But until I can regain that thread of thought, I take back any changes to the dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants