Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cube:variables definition #6

Merged
merged 18 commits into from
Jul 26, 2021

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented May 12, 2021

Apologies for the WIP here, but I haven't worked with JSON schema or a STAC spec before and could use some help. I'll leave specific points inline, but a few general questions:

  1. Is an optional cube:variables the right approach?
  2. Are people OK with requiring dimensions, a list of strings, on each variable?

For reference, I'm trying this out on the daymet dataset. If you have xarray, zarr, and adlfs installed, this should work

import fsspec
import xarray as xr

store = fsspec.get_mapper('az://daymet-zarr/daily/hi.zarr', account_name="daymeteuwest")
ds = xr.open_zarr(store, consolidated=True)

print(ds)

which prints out the repr

<xarray.Dataset>
Dimensions:                  (nv: 2, time: 14600, x: 284, y: 584)
Coordinates:
    lat                      (y, x) float32 dask.array<chunksize=(292, 284), meta=np.ndarray>
    lon                      (y, x) float32 dask.array<chunksize=(292, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 1980-01-01T12:00:00 ... 20...
  * x                        (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
  * y                        (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
    dayl                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    lambert_conformal_conic  int16 ...
    prcp                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    srad                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(14600, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    yearday                  (time) int16 dask.array<chunksize=(14600,), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        [1980]

cc @m-mohr if you have a chance to provide feedback here I'd be extremely grateful.

Closes #1

@m-mohr m-mohr marked this pull request as draft May 12, 2021 14:36
Comment on lines 33 to 40
"data": {
"href": "az://daymet-zarr/annual/hi.zarr",
"title": "Zarr store",
"type": "application/fsspec/zarr",
"description": "Root URL of the Zarr store",
"roles": [
"data"
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely wrong, but the intent is to provide a URL that can be loaded with xarray / zarr, which represents the entire collection / dataset.

store = fsspec.get_mapper('az://daymet-zarr/daily/hi.zarr', account_name="daymeteuwest")
ds = xr.open_zarr(store, consolidated=True)

That said, I have no idea what the right media type would be for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no media type and we can't agree on one with zarr folks, we may just not provide a type for now and instead specify a zarr role, e.g. roles: ["data", "zarr-collection"] or however this should be called...

By the way, what is the az:// protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, what is the az:// protocol?

It's specific to fsspec / adlfs, unfortunately: https://github.com/dask/adlfs/. No clients outside of that ecosystem would (correctly) interpret it.

examples/daymet-hi-annual.json Outdated Show resolved Hide resolved
examples/daymet-hi-annual.json Outdated Show resolved Hide resolved
examples/daymet-hi-annual.json Outdated Show resolved Hide resolved
json-schema/schema.json Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
]
}
},
"cube:coordinates": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is cube:coordinates?

Copy link
Contributor Author

@TomAugspurger TomAugspurger May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently invalid, according to the updated schema.

As described in https://xarray.pydata.org/en/stable/user-guide/terminology.html, this is a nnon-dimension-coordinate. https://xarray.pydata.org/en/stable/examples/multidimensional-coords.html has an example, but it's trying to describe an n-dimensional array that's really a coordinate (rather than an actual data value). The example there is something like latitude, which will depend on the (y, x) coordinates.

My suggestion: just put it in cube:variables and somehow mark it as a coordinate with a type field that's either data or coordinate. Does that sound reasonable?

Copy link
Contributor

@m-mohr m-mohr May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found another comment above that describes the issue, I've replied there:

In terms of this extension, coordinates are "Additional dimensions" with type set to "coordinate" (or so), If I understand it correctly.

But maybe I'm not understanding correctly. The data cubes I work with don't have the variables and coordinates concepts. We only have dimensions.

Copy link
Contributor Author

@TomAugspurger TomAugspurger May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that I look again I think your right... Even in the xarray repr above (#6 (comment)) lat is under "Dimensions" rather than "Data variables". My apologies.

One thing, though, do you have thoughts on allowing (not requiring) the objects in cube:coordinates to include a dimensions field? To correctly describe this dataset, that would be need to be allowed.

Copy link
Contributor

@m-mohr m-mohr May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing, though, do you have thoughts on allowing (not requiring) the objects in cube:coordinates to include a dimensions field? To correctly describe this dataset, that would be need to be allowed.

I'm confused. I thought coordinates goes away and will be dimensions? And is this the same question as discussed in #6 (comment), last part?

I'm actually also confused what "coordinates" are useful for? Isn't it implicitly clear that the spatial dimensions form the coordinates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant allowing cube:dimensions objects to themselves have dimensions. You're correct that in my latest commit cube:coordinates is gone.

I'm actually also confused what "coordinates" are useful for? Isn't it implicitly clear that the spatial dimensions form the coordinates?

Maybe a bit clearer to say that the values along the spatial (and time, and additional) dimensions form the coordinates. The coordinates are the actual labels / values along that dimension.

In my example dataset, things like latitude and longitude aren't actually a dimension of any of the variables (note in the repr that none the variables include them as a dimension; e.g. prcp has (time, y, x)). The netCDF docs call this an auxiliary coordinate and xarray calls it a non-dimension coordinate.

I don't know if @rabernat has thoughts here or can ping an xarray / netCDF developer who can explain this better than I am. I'm still hazy on these concepts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither, our data cubes don't have these concepts, we just have dimensions (and potentially variables derived from these dimensions), but I don't really get the point behint those coordinates.

Copy link

@rabernat rabernat May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of coordinates come from the CF Conventions, the gold standard for metadata in weather and climate science.

The commonest use of coordinate variables is to locate the data in space and time, but coordinates may be provided for any other continuous geophysical quantity (e.g. density, temperature, radiation wavelength, zenith angle of radiance, sea surface wave frequency) or discrete category (see Section 4.5, "Discrete Axis", e.g. area type, model level number, ensemble member number) on which the data variable depends.

A special type of coordinate is a "coordinate variable", which is 1D and has the same name as a dimension. But coordinates can be ND. CF calls these "auxiliary coordinate variables", and Xarray calls these "non-dimension coordinates." For example, for a 3D datacube of temperature on a curvilinear grid, the dimensions might be time, y, x, and we would typically have coordinates lat(y, x) and lon(y, x). Note that this treatment of coordinates is considerably different from the typical geospatial raster scenario, which would instead specify a projection rather than explicit arrays of lon / lat. But this is ubiquitous in weather / climate data.

At a pure data level, coordinates are just variables. But at a metadata level, they are treated differently by analysis libraries. For example, in xarray, if I add two datasets together

ds1 + ds2

Their data variables will be added, but their coordinates will not. It would not make sense to add two latitudes.

Pure NetCDF and Zarr (without CF conventions) do not have an explicit way to mark a variable as a coordinate. Xarray follows CF conventions and decodes a variable to the list of coordinates (as opposed to data variables) if it meets either of the two criteria:

  • it is 1D and has the same name as a dimension (coordinate variable)
  • it appears in another variables coordinates attribute (auxiliary coordinate variable)

I am not certain that the notion of CF coordinates needs to be in the datacube extension. To answer this, I would want to better understand the relationship between this extension and the CF conventions.

@m-mohr
Copy link
Contributor

m-mohr commented May 12, 2021

Apologies for the WIP here

No problem, but I made this a draft PR now.

but I haven't worked with JSON schema or a STAC spec before and could use some help.

I'm happy to help, especially with JSON Schema.

  1. Is an optional cube:variables the right approach?

Yes, sounds good.

  1. Are people OK with requiring dimensions, a list of strings, on each variable?

I am, but I'm not using variables so others should better sound in,.

@rabernat
Copy link

Is every netCDF dataset a datacube? I think we need to answer this question.

As an alternative to squeezing every existing netCDF dataset into the datacube extension, which will clearly pose a challenge to the fairly limited scope of this extension, perhaps we could consider a new, distinct extension called "CF". The CF Convetions are the data model that will meet the needs of the weather and climate community. All our important datasets, and 99% of netCDF files out there in the wild, comply with it. Lots of thought has gone into it by people who deeply understand our datasets.

@TomAugspurger
Copy link
Contributor Author

perhaps we could consider a new, distinct extension called "CF".

I've been wondering about that too. There seem to be some things that don't quite align between different group's definitions. I'm not sure if that's just a translation issue, or if it reflects something deeper.

Perhaps I take a bit of time to sketch out a cf extension, which will give me a chance to better understand JSON schema and STAC extensions. With concrete examples we'll be able to better evaluate how to handle the overlap between various extensions. I do worry a bit about fragmentation among datacube-like things, so I won't be sad if a cf extension ends up not being needed.

@m-mohr
Copy link
Contributor

m-mohr commented May 13, 2021

You could also just extend the datacube extension by adding a cf:coordinates that uses things from this extension. They can surely co-exist and work together.

@TomAugspurger
Copy link
Contributor Author

Inspired by Ryan's comment:

At a pure data level, coordinates are just variables.

I've made another change to facilitate "auxiliary" coordinates ("non-dimension" coordinates in xarray terms) in 0d6c550. Now all cube:variables must have a type that is either data or auxiliary. So for precipitation, a data variable, we have

    "prcp": {
      "type": "data",
      "description": "The total accumulated precipitation over the monthly period of the daily total precipitation. Sum of all forms of precipitation converted to a water-equivalent depth.",
      "extent": [0, null],
      "unit": "mm",
      "dimensions": ["time", "y", "x"]
    }
 

While for lat, an auxiliary coordinate, we have

    "lat": {
      "type": "auxiliary",
      "extent": [17.960035, 23.512327],
      "description": "latitude coordinate",
      "unit": "degrees_north",
      "dimensions": ["y", "x"]
    }

@TomAugspurger TomAugspurger changed the title [WIP]: Add cube:variables definition Add cube:variables definition May 14, 2021
@TomAugspurger TomAugspurger marked this pull request as ready for review May 14, 2021 11:13
@TomAugspurger
Copy link
Contributor Author

@m-mohr I'm reasonably happy with putting these "auxiliary" / non-dimension coordinates in cube:variables with a type: "auxiliary". What do you thing about requiring type for all the values in cube:variables? Since this will be type: "data" for most variables, does it make sense to not require it, and let consumers assume / infer what it should be?

I think that CI should pass now. Since this is my first PR GitHub requires maintainer approval to run the CI.

@m-mohr
Copy link
Contributor

m-mohr commented May 14, 2021

I have no strong opinion on the variables as they don't exist in the data cubes we use. So we hopefully get others to sound in whether that works for their environment, too.

@TomAugspurger
Copy link
Contributor Author

@schwehr you commented on #1. Do you have a time to look through this proposal for variables to see if it fits your needs?

And maybe @tomkralidis and @jhamman might be interested, since you were involved in radiantearth/stac-spec#713.

@m-mohr
Copy link
Contributor

m-mohr commented May 14, 2021

I hope these people are not just all working with the same tools (e.g. netCDF) so that we get a variety of tools covered. ;-)

@rabernat
Copy link

Also pinging CEDA folks, @agstephens @PhilKershaw, who are interested in this.

@rabernat
Copy link

I hope these people are not just all working with the same tools (e.g. netCDF) so that we get a variety of tools covered. ;-)

Just a clarification that netCDF is not a tool but a data format. What other tools did you have in mind?

@m-mohr
Copy link
Contributor

m-mohr commented May 14, 2021

Nothing specifically (but for example OGC Coverages, R stars, some other domains), I just want to ensure we get a variety of opinions across datacube tools, formats, models, ...

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented May 14, 2021

From the rstac & e-sensing/sits side, cc @gqueiroz, @gilbertocamara, @OldLipe. For reference, the datacube STAC extension currently defines a dimension object for describing a datacube. This pull request is updating the extension to also define a variable object to describe the values that might be in a datacube. https://github.com/stac-extensions/datacube/blob/c4f4819fe037586067c80c7b9ab7339f6e35dfaf/README.md#variable-object is probably the place to look, which describes the new variable object.

@m-mohr
Copy link
Contributor

m-mohr commented May 14, 2021

Let's wait for a couple of days and incorporate feedback as it comes in and then merge. If no feedback comes in, then I'll make a final review and we can probably take it as it is...

Copy link
Contributor

@m-mohr m-mohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments and requests for changes, especially about issues in the JSON Schema.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated
| dimensions | \[string] | **REQUIRED.** The dimensions of the variable. This should refer to keys in the ``cube:dimensions`` object or be an empty list if the variable has no dimensions. |
| type | string | **REQUIRED.** Type of the variable, either `data` or `auxiliary`. |
| description | string | Detailed multi-line description to explain the dimension. [CommonMark 0.29](http://commonmark.org/) syntax MAY be used for rich text representation. |
| extent | \[number\|string\|null] | If the dimension consists of [ordinal](https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale) values, the extent (lower and upper bounds) of the values as two-dimensional array. Use `null` for open intervals. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a string extent would be usually for date-times, right? On the other hand, a step can't be specified as duration like in the Temporal Dimension. The definition of the Variable Object seems it tries to merge all kinds of types for which we have different Dimension Objects into a single Object, but that doesn't feel very clean. Maybe we need to split this up a bit more or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if JSON / json schema models dates as strings then having strings here seems necessary. So I think step should be allowed to be a string (to be more precise, it should be an ISO 8601 duration if and only if extent is a datetime string, but I'm not sure how to write that in JSON schema).

That said, I don't think that step is too useful for variables (they're extremely useful for dimensions / coordinates).

I'll defer to you on the best way to model this, but having a single JSON object for all variables feels natural to me. When you're using the data they all end up in n-dimensional arrays, just with different data types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we can simplify here and remove the step for now from variables? I still need to think through whether to split the Variables Object or not...

Tom Augspurger added 3 commits May 27, 2021 06:20
* Fixed typos
* Dimension -> variable in variable descriptions
@TomAugspurger
Copy link
Contributor Author

Anything else to do here @m-mohr? IIUC, the main outstanding point is a concern that variable objects might be doing a bit too much? (#6 (comment))

@m-mohr
Copy link
Contributor

m-mohr commented Jun 3, 2021

The remainging point for me is to check whether the Variables Object should be split similar to the Dimensions (see #6 (comment) ), but I'm pretty busy right now.

@TomAugspurger
Copy link
Contributor Author

Perfect, thanks, just wanted to make sure I wasn't blocking anything. I'm comfortable working off this branch for now, so no rush on my account.

@TomAugspurger
Copy link
Contributor Author

Hi Matthias, just a friendly ping here if you have time for a decision on #6 (comment) I'm still fine working off a branch, but wanted to make sure this was still somewhere on your probably too long backlog.

FWIW, I think modeling Variables as a single object makes the most sense. In the context of a library like xarray, multiple datacubes will live together in a Dataset (a container for Variables like this). Each Variable is stored using the same Python object, a DataArray, regardless of their data type.

@m-mohr
Copy link
Contributor

m-mohr commented Jun 30, 2021

Yes, sorry, I have it on my overly long to do list.

@m-mohr
Copy link
Contributor

m-mohr commented Jul 23, 2021

Okay, I think this is all good. Thanks for working on it.

@m-mohr
Copy link
Contributor

m-mohr commented Jul 23, 2021

@TomAugspurger One last issue: Could you please add a CHANGELOG entry?

@m-mohr m-mohr added this to the v2.0.0 milestone Jul 23, 2021
@TomAugspurger
Copy link
Contributor Author

Thanks, done.

CHANGELOG.md Outdated Show resolved Hide resolved
@m-mohr m-mohr merged commit 9bdb50e into stac-extensions:main Jul 26, 2021
@TomAugspurger
Copy link
Contributor Author

Thanks for all your help with this @m-mohr!

@TomAugspurger TomAugspurger deleted the feature/variables branch July 26, 2021 11:47
@m-mohr
Copy link
Contributor

m-mohr commented Jul 26, 2021

@TomAugspurger I'm about to fix #4 and then I'll probably release later today.

@m-mohr
Copy link
Contributor

m-mohr commented Jul 26, 2021

Whoop, whoop: https://github.com/stac-extensions/datacube/releases/tag/v2.0.0

m-mohr added a commit that referenced this pull request Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Variables
4 participants