Add cube:variables definition #6

TomAugspurger · 2021-05-12T14:31:37Z

Apologies for the WIP here, but I haven't worked with JSON schema or a STAC spec before and could use some help. I'll leave specific points inline, but a few general questions:

Is an optional cube:variables the right approach?
Are people OK with requiring dimensions, a list of strings, on each variable?

For reference, I'm trying this out on the daymet dataset. If you have xarray, zarr, and adlfs installed, this should work

import fsspec
import xarray as xr

store = fsspec.get_mapper('az://daymet-zarr/daily/hi.zarr', account_name="daymeteuwest")
ds = xr.open_zarr(store, consolidated=True)

print(ds)

which prints out the repr

<xarray.Dataset>
Dimensions:                  (nv: 2, time: 14600, x: 284, y: 584)
Coordinates:
    lat                      (y, x) float32 dask.array<chunksize=(292, 284), meta=np.ndarray>
    lon                      (y, x) float32 dask.array<chunksize=(292, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 1980-01-01T12:00:00 ... 20...
  * x                        (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
  * y                        (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
    dayl                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    lambert_conformal_conic  int16 ...
    prcp                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    srad                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(14600, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    yearday                  (time) int16 dask.array<chunksize=(14600,), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        [1980]

cc @m-mohr if you have a chance to provide feedback here I'd be extremely grateful.

Closes #1

TomAugspurger · 2021-05-12T14:33:44Z

examples/daymet-hi-annual.json

+        "data": {
+            "href": "az://daymet-zarr/annual/hi.zarr",
+            "title": "Zarr store",
+            "type": "application/fsspec/zarr",
+            "description": "Root URL of the Zarr store",
+            "roles": [
+                "data"
+            ]


This is likely wrong, but the intent is to provide a URL that can be loaded with xarray / zarr, which represents the entire collection / dataset.

store = fsspec.get_mapper('az://daymet-zarr/daily/hi.zarr', account_name="daymeteuwest") ds = xr.open_zarr(store, consolidated=True)

That said, I have no idea what the right media type would be for that.

If there's no media type and we can't agree on one with zarr folks, we may just not provide a type for now and instead specify a zarr role, e.g. roles: ["data", "zarr-collection"] or however this should be called...

By the way, what is the az:// protocol?

By the way, what is the az:// protocol?

It's specific to fsspec / adlfs, unfortunately: https://github.com/dask/adlfs/. No clients outside of that ecosystem would (correctly) interpret it.

examples/daymet-hi-annual.json

json-schema/schema.json

README.md

m-mohr · 2021-05-12T14:40:53Z

examples/daymet-hi-annual.json

+            ]
+        }
+    },
+    "cube:coordinates": {


What is cube:coordinates?

This is currently invalid, according to the updated schema.

As described in https://xarray.pydata.org/en/stable/user-guide/terminology.html, this is a nnon-dimension-coordinate. https://xarray.pydata.org/en/stable/examples/multidimensional-coords.html has an example, but it's trying to describe an n-dimensional array that's really a coordinate (rather than an actual data value). The example there is something like latitude, which will depend on the (y, x) coordinates.

My suggestion: just put it in cube:variables and somehow mark it as a coordinate with a type field that's either data or coordinate. Does that sound reasonable?

I've found another comment above that describes the issue, I've replied there:

In terms of this extension, coordinates are "Additional dimensions" with type set to "coordinate" (or so), If I understand it correctly.

But maybe I'm not understanding correctly. The data cubes I work with don't have the variables and coordinates concepts. We only have dimensions.

Yes, now that I look again I think your right... Even in the xarray repr above (#6 (comment)) lat is under "Dimensions" rather than "Data variables". My apologies.

One thing, though, do you have thoughts on allowing (not requiring) the objects in cube:coordinates to include a dimensions field? To correctly describe this dataset, that would be need to be allowed.

One thing, though, do you have thoughts on allowing (not requiring) the objects in cube:coordinates to include a dimensions field? To correctly describe this dataset, that would be need to be allowed.

I'm confused. I thought coordinates goes away and will be dimensions? And is this the same question as discussed in #6 (comment), last part?

I'm actually also confused what "coordinates" are useful for? Isn't it implicitly clear that the spatial dimensions form the coordinates?

Sorry, I meant allowing cube:dimensions objects to themselves have dimensions. You're correct that in my latest commit cube:coordinates is gone.

I'm actually also confused what "coordinates" are useful for? Isn't it implicitly clear that the spatial dimensions form the coordinates?

Maybe a bit clearer to say that the values along the spatial (and time, and additional) dimensions form the coordinates. The coordinates are the actual labels / values along that dimension.

In my example dataset, things like latitude and longitude aren't actually a dimension of any of the variables (note in the repr that none the variables include them as a dimension; e.g. prcp has (time, y, x)). The netCDF docs call this an auxiliary coordinate and xarray calls it a non-dimension coordinate.

I don't know if @rabernat has thoughts here or can ping an xarray / netCDF developer who can explain this better than I am. I'm still hazy on these concepts.

Me neither, our data cubes don't have these concepts, we just have dimensions (and potentially variables derived from these dimensions), but I don't really get the point behint those coordinates.

The concept of coordinates come from the CF Conventions, the gold standard for metadata in weather and climate science.

The commonest use of coordinate variables is to locate the data in space and time, but coordinates may be provided for any other continuous geophysical quantity (e.g. density, temperature, radiation wavelength, zenith angle of radiance, sea surface wave frequency) or discrete category (see Section 4.5, "Discrete Axis", e.g. area type, model level number, ensemble member number) on which the data variable depends.

A special type of coordinate is a "coordinate variable", which is 1D and has the same name as a dimension. But coordinates can be ND. CF calls these "auxiliary coordinate variables", and Xarray calls these "non-dimension coordinates." For example, for a 3D datacube of temperature on a curvilinear grid, the dimensions might be time, y, x, and we would typically have coordinates lat(y, x) and lon(y, x). Note that this treatment of coordinates is considerably different from the typical geospatial raster scenario, which would instead specify a projection rather than explicit arrays of lon / lat. But this is ubiquitous in weather / climate data.

At a pure data level, coordinates are just variables. But at a metadata level, they are treated differently by analysis libraries. For example, in xarray, if I add two datasets together

ds1 + ds2

Their data variables will be added, but their coordinates will not. It would not make sense to add two latitudes.

Pure NetCDF and Zarr (without CF conventions) do not have an explicit way to mark a variable as a coordinate. Xarray follows CF conventions and decodes a variable to the list of coordinates (as opposed to data variables) if it meets either of the two criteria:

it is 1D and has the same name as a dimension (coordinate variable)

it appears in another variables coordinates attribute (auxiliary coordinate variable)

I am not certain that the notion of CF coordinates needs to be in the datacube extension. To answer this, I would want to better understand the relationship between this extension and the CF conventions.

m-mohr · 2021-05-12T14:59:59Z

Apologies for the WIP here

No problem, but I made this a draft PR now.

but I haven't worked with JSON schema or a STAC spec before and could use some help.

I'm happy to help, especially with JSON Schema.

Is an optional cube:variables the right approach?

Yes, sounds good.

Are people OK with requiring dimensions, a list of strings, on each variable?

I am, but I'm not using variables so others should better sound in,.

rabernat · 2021-05-13T18:14:49Z

Is every netCDF dataset a datacube? I think we need to answer this question.

As an alternative to squeezing every existing netCDF dataset into the datacube extension, which will clearly pose a challenge to the fairly limited scope of this extension, perhaps we could consider a new, distinct extension called "CF". The CF Convetions are the data model that will meet the needs of the weather and climate community. All our important datasets, and 99% of netCDF files out there in the wild, comply with it. Lots of thought has gone into it by people who deeply understand our datasets.

TomAugspurger · 2021-05-13T18:53:13Z

perhaps we could consider a new, distinct extension called "CF".

I've been wondering about that too. There seem to be some things that don't quite align between different group's definitions. I'm not sure if that's just a translation issue, or if it reflects something deeper.

Perhaps I take a bit of time to sketch out a cf extension, which will give me a chance to better understand JSON schema and STAC extensions. With concrete examples we'll be able to better evaluate how to handle the overlap between various extensions. I do worry a bit about fragmentation among datacube-like things, so I won't be sad if a cf extension ends up not being needed.

m-mohr · 2021-05-13T19:02:38Z

You could also just extend the datacube extension by adding a cf:coordinates that uses things from this extension. They can surely co-exist and work together.

TomAugspurger · 2021-05-13T20:13:09Z

Inspired by Ryan's comment:

At a pure data level, coordinates are just variables.

I've made another change to facilitate "auxiliary" coordinates ("non-dimension" coordinates in xarray terms) in 0d6c550. Now all cube:variables must have a type that is either data or auxiliary. So for precipitation, a data variable, we have

    "prcp": {
      "type": "data",
      "description": "The total accumulated precipitation over the monthly period of the daily total precipitation. Sum of all forms of precipitation converted to a water-equivalent depth.",
      "extent": [0, null],
      "unit": "mm",
      "dimensions": ["time", "y", "x"]
    }

While for lat, an auxiliary coordinate, we have

    "lat": {
      "type": "auxiliary",
      "extent": [17.960035, 23.512327],
      "description": "latitude coordinate",
      "unit": "degrees_north",
      "dimensions": ["y", "x"]
    }

TomAugspurger · 2021-05-14T14:41:56Z

@m-mohr I'm reasonably happy with putting these "auxiliary" / non-dimension coordinates in cube:variables with a type: "auxiliary". What do you thing about requiring type for all the values in cube:variables? Since this will be type: "data" for most variables, does it make sense to not require it, and let consumers assume / infer what it should be?

I think that CI should pass now. Since this is my first PR GitHub requires maintainer approval to run the CI.

m-mohr · 2021-05-14T14:44:34Z

I have no strong opinion on the variables as they don't exist in the data cubes we use. So we hopefully get others to sound in whether that works for their environment, too.

TomAugspurger · 2021-05-14T14:52:16Z

@schwehr you commented on #1. Do you have a time to look through this proposal for variables to see if it fits your needs?

And maybe @tomkralidis and @jhamman might be interested, since you were involved in radiantearth/stac-spec#713.

m-mohr · 2021-05-14T14:54:37Z

I hope these people are not just all working with the same tools (e.g. netCDF) so that we get a variety of tools covered. ;-)

rabernat · 2021-05-14T14:58:56Z

Also pinging CEDA folks, @agstephens @PhilKershaw, who are interested in this.

rabernat · 2021-05-14T15:00:48Z

I hope these people are not just all working with the same tools (e.g. netCDF) so that we get a variety of tools covered. ;-)

Just a clarification that netCDF is not a tool but a data format. What other tools did you have in mind?

m-mohr · 2021-05-14T15:04:26Z

Nothing specifically (but for example OGC Coverages, R stars, some other domains), I just want to ensure we get a variety of opinions across datacube tools, formats, models, ...

TomAugspurger · 2021-05-14T15:15:43Z

From the rstac & e-sensing/sits side, cc @gqueiroz, @gilbertocamara, @OldLipe. For reference, the datacube STAC extension currently defines a dimension object for describing a datacube. This pull request is updating the extension to also define a variable object to describe the values that might be in a datacube. https://github.com/stac-extensions/datacube/blob/c4f4819fe037586067c80c7b9ab7339f6e35dfaf/README.md#variable-object is probably the place to look, which describes the new variable object.

m-mohr · 2021-05-14T16:02:34Z

Let's wait for a couple of days and incorporate feedback as it comes in and then merge. If no feedback comes in, then I'll make a final review and we can probably take it as it is...

m-mohr

Added some comments and requests for changes, especially about issues in the JSON Schema.

README.md

m-mohr · 2021-05-27T08:44:21Z

README.md

+| dimensions       | \[string]                    | **REQUIRED.** The dimensions of the variable. This should refer to keys in the ``cube:dimensions`` object or be an empty list if the variable has no dimensions. |
+| type             | string                       | **REQUIRED.** Type of the variable, either `data` or `auxiliary`. | 
+| description      | string                       | Detailed multi-line description to explain the dimension. [CommonMark 0.29](http://commonmark.org/) syntax MAY be used for rich text representation. |
+| extent           | \[number\|string\|null]      | If the dimension consists of [ordinal](https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale) values, the extent (lower and upper bounds) of the values as two-dimensional array. Use `null` for open intervals. |


Hmm, a string extent would be usually for date-times, right? On the other hand, a step can't be specified as duration like in the Temporal Dimension. The definition of the Variable Object seems it tries to merge all kinds of types for which we have different Dimension Objects into a single Object, but that doesn't feel very clean. Maybe we need to split this up a bit more or so?

Yes, if JSON / json schema models dates as strings then having strings here seems necessary. So I think step should be allowed to be a string (to be more precise, it should be an ISO 8601 duration if and only if extent is a datetime string, but I'm not sure how to write that in JSON schema).

That said, I don't think that step is too useful for variables (they're extremely useful for dimensions / coordinates).

I'll defer to you on the best way to model this, but having a single JSON object for all variables feels natural to me. When you're using the data they all end up in n-dimensional arrays, just with different data types.

So maybe we can simplify here and remove the step for now from variables? I still need to think through whether to split the Variables Object or not...

* Fixed typos * Dimension -> variable in variable descriptions

TomAugspurger · 2021-06-03T13:22:33Z

Anything else to do here @m-mohr? IIUC, the main outstanding point is a concern that variable objects might be doing a bit too much? (#6 (comment))

m-mohr · 2021-06-03T13:27:29Z

The remainging point for me is to check whether the Variables Object should be split similar to the Dimensions (see #6 (comment) ), but I'm pretty busy right now.

TomAugspurger · 2021-06-03T13:43:11Z

Perfect, thanks, just wanted to make sure I wasn't blocking anything. I'm comfortable working off this branch for now, so no rush on my account.

TomAugspurger · 2021-06-25T16:49:52Z

Hi Matthias, just a friendly ping here if you have time for a decision on #6 (comment) I'm still fine working off a branch, but wanted to make sure this was still somewhere on your probably too long backlog.

FWIW, I think modeling Variables as a single object makes the most sense. In the context of a library like xarray, multiple datacubes will live together in a Dataset (a container for Variables like this). Each Variable is stored using the same Python object, a DataArray, regardless of their data type.

m-mohr · 2021-06-30T22:21:15Z

Yes, sorry, I have it on my overly long to do list.

m-mohr · 2021-07-23T08:54:08Z

Okay, I think this is all good. Thanks for working on it.

m-mohr · 2021-07-23T08:55:04Z

@TomAugspurger One last issue: Could you please add a CHANGELOG entry?

TomAugspurger · 2021-07-26T11:25:34Z

Thanks, done.

CHANGELOG.md

TomAugspurger · 2021-07-26T11:47:53Z

Thanks for all your help with this @m-mohr!

m-mohr · 2021-07-26T11:48:43Z

@TomAugspurger I'm about to fix #4 and then I'll probably release later today.

m-mohr · 2021-07-26T12:26:19Z

Whoop, whoop: https://github.com/stac-extensions/datacube/releases/tag/v2.0.0

…dered values. #6

[WIP]: Add cube:variables definition

fb622e1

m-mohr marked this pull request as draft May 12, 2021 14:36

TomAugspurger commented May 12, 2021

View reviewed changes

m-mohr reviewed May 12, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

m-mohr reviewed May 12, 2021

View reviewed changes

Tom Augspurger added 6 commits May 13, 2021 06:41

updates

ec33ffc

lint

4468c18

fixup schema

0229029

fixup schema

1117c32

fixup schema

7c631fb

Remove type restriction on data values

c9483a9

passing

0d6c550

Added variable to item

d2067d2

TomAugspurger changed the title ~~[WIP]: Add cube:variables definition~~ Add cube:variables definition May 14, 2021

TomAugspurger marked this pull request as ready for review May 14, 2021 11:13

fix rendering

c4f4819

m-mohr requested changes May 26, 2021

View reviewed changes

Tom Augspurger added 3 commits May 26, 2021 12:41

fixups

55d16ae

fix step

442bd95

fixup

bc1cbd9

m-mohr reviewed May 27, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

m-mohr reviewed May 27, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

m-mohr reviewed May 27, 2021

View reviewed changes

Tom Augspurger added 3 commits May 27, 2021 06:20

Fixups

dd82452

* Fixed typos * Dimension -> variable in variable descriptions

Bump stac version everywhere

e429754

remove step

70b2516

TomAugspurger mentioned this pull request Jun 25, 2021

Use descriptors for Extensions? stac-utils/pystac#367

Open

m-mohr approved these changes Jul 23, 2021

View reviewed changes

m-mohr added this to the v2.0.0 milestone Jul 23, 2021

Merge remote-tracking branch 'upstream/main' into feature/variables

c7d733e

m-mohr reviewed Jul 26, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update CHANGELOG.md

b595038

m-mohr merged commit 9bdb50e into stac-extensions:main Jul 26, 2021

TomAugspurger deleted the feature/variables branch July 26, 2021 11:47

m-mohr added a commit that referenced this pull request Mar 15, 2022

Clarified that values must be an ordered list and not a set of unor…

a891b55

…dered values. #6

Add cube:variables definition #6

Add cube:variables definition #6

Conversation

TomAugspurger commented May 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger May 12, 2021 • edited Loading

Choose a reason for hiding this comment

m-mohr May 12, 2021 • edited Loading

Choose a reason for hiding this comment

TomAugspurger May 13, 2021 • edited Loading

Choose a reason for hiding this comment

m-mohr May 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat May 13, 2021 • edited Loading

Choose a reason for hiding this comment

m-mohr commented May 12, 2021

rabernat commented May 13, 2021

TomAugspurger commented May 13, 2021

m-mohr commented May 13, 2021

TomAugspurger commented May 13, 2021

TomAugspurger commented May 14, 2021

m-mohr commented May 14, 2021

TomAugspurger commented May 14, 2021

m-mohr commented May 14, 2021 • edited Loading

rabernat commented May 14, 2021

rabernat commented May 14, 2021

m-mohr commented May 14, 2021

TomAugspurger commented May 14, 2021 • edited Loading

m-mohr commented May 14, 2021 • edited Loading

m-mohr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 3, 2021

m-mohr commented Jun 3, 2021

TomAugspurger commented Jun 3, 2021

TomAugspurger commented Jun 25, 2021

m-mohr commented Jun 30, 2021

m-mohr commented Jul 23, 2021

m-mohr commented Jul 23, 2021

TomAugspurger commented Jul 26, 2021

TomAugspurger commented Jul 26, 2021

m-mohr commented Jul 26, 2021

m-mohr commented Jul 26, 2021

TomAugspurger commented May 12, 2021 •

edited

Loading

TomAugspurger May 12, 2021 •

edited

Loading

m-mohr May 12, 2021 •

edited

Loading

TomAugspurger May 13, 2021 •

edited

Loading

m-mohr May 13, 2021 •

edited

Loading

rabernat May 13, 2021 •

edited

Loading

m-mohr commented May 14, 2021 •

edited

Loading

TomAugspurger commented May 14, 2021 •

edited

Loading

m-mohr commented May 14, 2021 •

edited

Loading