Enriching datacube STAC items with more array metadata #18

ghidalgo3 · 2024-07-24T22:58:41Z

Hello, I am interested in expanding the datacube STAC extension to support more multidimensional array metadata for assets, particularly array metadata found in NetCDF, HDF5, and GRIB2 files. I think I'm caught up on the great discussions of the past:

And the STAC catalog items that I've been working with are all hosted on Microsoft's Planetary Computer platform, specifically:

For context, my goal is to be one day be able to do something like this with Xarray:

>>> import xarray as xr
>>> items = stac_catalog.search(...)
>>> vds = xr.open_mfdataset(items, engine="stac") # This call should do no I/O!
>>> vds
<xarray.Dataset> Size: 1GB
Dimensions:  (lat: 600, lon: 1440, time: 360)
Coordinates:
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time     (time) float64 3kB 3.6e+04 3.6e+04 3.6e+04 ... 3.636e+04 3.636e+04
Data variables:
    pr       (time, lat, lon) float32 1GB ManifestArray<shape=(360, 600, 1440...
>>> vds.variable.sum() # Here the IO runs 
42.0

In that example, assume that STAC items returned in the search contain assets which are the files themselves. I don't want to actually read the asset, I want the STAC item to contain enough information to create a manipulable dataset that Xarray understands. Reading comes after searching, merging, filtering, and projecting away the variables I'm not interested in.

This proposal is heavily based on ZarrV3 though I believe any multidimensional array handling system will care to know the same information.

I propose the following additional properties on only cube:variables:

Field Name	Type	Description
`data_type`	`string`	`numpy` parseable datatype
`chunk_shape`	`[number]`	The size of a chunk by element count
`fill_value`	`number\|string\|null`	Needed to handle sparse arrays
(optional) `dimensions`	`[string]`	The subset of `cube:dimensions` that index this variable. If not set, all dimensions index this variable. This may happen with single GRIB2 files that contain multiple datacubes.
`codecs`	`[object]`	An ordered list of codec configurations

A new property that applies to either cube:variables or cube:dimensions:

Field Name	Type	Description
`attrs`	`object`	Lifted key-value attributes from the original source file.

In the previous discussion on this topic #8 , a suggestion was made to use the files extension to store chunk metadata, but I don't think that extension is appropriate for this purpose. Similarly, I don't think the Bands RFC radiantearth/stac-spec#1254 addresses this problem, it is solving something entirely different.

CC @TomAugspurger we can handle chunk manifests later, they are ultimately just assets. Similarly, coordinate transforms are separate and probably better to wait for GeoZarr to standardize.

I'd like to know your thoughts on this proposal, or if perhaps this something worth putting into a hypothetical Zarr extension instead. IMO, I think the only thing that is very Zarr specific is the codecs property, everything else is very mappable with the underlying source files (even then, the files themselves define codecs too though they may not call them that).

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2024-07-26T18:02:08Z

Thanks. Having the all the information needed to construct a Dataset from a STAC item / list of items would be great.

Some comments on the proposed fields:

The raster extension defines a list of dtypes at https://github.com/stac-extensions/raster?tab=readme-ov-file#data-types. In general, the STAC metadata should probably match values used in STAC extensions. And then particular applications (like this xarray STAC engine) can map from the STAC names to the names it needs (NumPy dtypes)
For chunk_shape, there is a proposed ZEP for variable chunking: https://zarr.dev/zeps/draft/ZEP0003.html. Instead of a list[int] with a length equal to the number of dimensions, you give a list[list[int]], where the number of inner lists matches the number of dimensions, and the length of each inner list is the number of chunks. IMO a list[number] is fine and we can generalize stuff if/when that ZEP is accepted.
For the "optional" part of dimensions, do you have a recommendation for how to interpret dimensions being None? Does that mean all the dims in cube:dimensions apply?

ghidalgo3 · 2024-07-30T15:40:35Z

Agreed, we should re-use the numeric datatypes from the raster extension.
I suppose it doesn't add any complexity to the extension to specify that chunk_shape could be either a list[int] or a list[list[int]]. The first case will be most common but if/when that ZEP is accepted then the extension will already specify variable chunks.
Actually I lost my reasoning for making dimensions optional. The current spec says:

REQUIRED. The dimensions of the variable. This should refer to keys in the cube:dimensions object or be an empty list if the variable has no dimensions.

Maybe I was trying to avoid a possible ambiguity regarding which cube:dimensions, the asset-level dimensions or the item-level dimensions? But until I can regain that thread of thought, I take back any changes to the dimensions.

m-mohr mentioned this issue Jul 30, 2024

Update common band names table stac-extensions/eo#22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enriching datacube STAC items with more array metadata #18

Enriching datacube STAC items with more array metadata #18

ghidalgo3 commented Jul 24, 2024

TomAugspurger commented Jul 26, 2024

ghidalgo3 commented Jul 30, 2024

Enriching datacube STAC items with more array metadata #18

Enriching datacube STAC items with more array metadata #18

Comments

ghidalgo3 commented Jul 24, 2024

TomAugspurger commented Jul 26, 2024

ghidalgo3 commented Jul 30, 2024