-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cube:variables definition #6
Add cube:variables definition #6
Conversation
examples/daymet-hi-annual.json
Outdated
"data": { | ||
"href": "az://daymet-zarr/annual/hi.zarr", | ||
"title": "Zarr store", | ||
"type": "application/fsspec/zarr", | ||
"description": "Root URL of the Zarr store", | ||
"roles": [ | ||
"data" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely wrong, but the intent is to provide a URL that can be loaded with xarray / zarr, which represents the entire collection / dataset.
store = fsspec.get_mapper('az://daymet-zarr/daily/hi.zarr', account_name="daymeteuwest")
ds = xr.open_zarr(store, consolidated=True)
That said, I have no idea what the right media type would be for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no media type and we can't agree on one with zarr folks, we may just not provide a type for now and instead specify a zarr role, e.g. roles: ["data", "zarr-collection"]
or however this should be called...
By the way, what is the az:// protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, what is the az:// protocol?
It's specific to fsspec / adlfs, unfortunately: https://github.com/dask/adlfs/. No clients outside of that ecosystem would (correctly) interpret it.
examples/daymet-hi-annual.json
Outdated
] | ||
} | ||
}, | ||
"cube:coordinates": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is cube:coordinates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently invalid, according to the updated schema.
As described in https://xarray.pydata.org/en/stable/user-guide/terminology.html, this is a nnon-dimension-coordinate. https://xarray.pydata.org/en/stable/examples/multidimensional-coords.html has an example, but it's trying to describe an n-dimensional array that's really a coordinate (rather than an actual data value). The example there is something like latitude, which will depend on the (y, x)
coordinates.
My suggestion: just put it in cube:variables and somehow mark it as a coordinate with a type
field that's either data
or coordinate
. Does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found another comment above that describes the issue, I've replied there:
In terms of this extension, coordinates are "Additional dimensions" with type set to "coordinate" (or so), If I understand it correctly.
But maybe I'm not understanding correctly. The data cubes I work with don't have the variables and coordinates concepts. We only have dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now that I look again I think your right... Even in the xarray repr above (#6 (comment)) lat
is under "Dimensions" rather than "Data variables". My apologies.
One thing, though, do you have thoughts on allowing (not requiring) the objects in cube:coordinates
to include a dimensions
field? To correctly describe this dataset, that would be need to be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing, though, do you have thoughts on allowing (not requiring) the objects in
cube:coordinates
to include adimensions
field? To correctly describe this dataset, that would be need to be allowed.
I'm confused. I thought coordinates goes away and will be dimensions? And is this the same question as discussed in #6 (comment), last part?
I'm actually also confused what "coordinates" are useful for? Isn't it implicitly clear that the spatial dimensions form the coordinates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant allowing cube:dimensions
objects to themselves have dimensions
. You're correct that in my latest commit cube:coordinates
is gone.
I'm actually also confused what "coordinates" are useful for? Isn't it implicitly clear that the spatial dimensions form the coordinates?
Maybe a bit clearer to say that the values along the spatial (and time, and additional) dimensions form the coordinates. The coordinates are the actual labels / values along that dimension.
In my example dataset, things like latitude
and longitude
aren't actually a dimension of any of the variables (note in the repr that none the variables include them as a dimension; e.g. prcp
has (time, y, x)
). The netCDF docs call this an auxiliary coordinate and xarray calls it a non-dimension coordinate.
I don't know if @rabernat has thoughts here or can ping an xarray / netCDF developer who can explain this better than I am. I'm still hazy on these concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither, our data cubes don't have these concepts, we just have dimensions (and potentially variables derived from these dimensions), but I don't really get the point behint those coordinates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of coordinates come from the CF Conventions, the gold standard for metadata in weather and climate science.
The commonest use of coordinate variables is to locate the data in space and time, but coordinates may be provided for any other continuous geophysical quantity (e.g. density, temperature, radiation wavelength, zenith angle of radiance, sea surface wave frequency) or discrete category (see Section 4.5, "Discrete Axis", e.g. area type, model level number, ensemble member number) on which the data variable depends.
A special type of coordinate is a "coordinate variable", which is 1D and has the same name as a dimension. But coordinates can be ND. CF calls these "auxiliary coordinate variables", and Xarray calls these "non-dimension coordinates." For example, for a 3D datacube of temperature on a curvilinear grid, the dimensions might be time, y, x
, and we would typically have coordinates lat(y, x)
and lon(y, x)
. Note that this treatment of coordinates is considerably different from the typical geospatial raster scenario, which would instead specify a projection rather than explicit arrays of lon / lat. But this is ubiquitous in weather / climate data.
At a pure data level, coordinates are just variables. But at a metadata level, they are treated differently by analysis libraries. For example, in xarray, if I add two datasets together
ds1 + ds2
Their data variables will be added, but their coordinates will not. It would not make sense to add two latitudes.
Pure NetCDF and Zarr (without CF conventions) do not have an explicit way to mark a variable as a coordinate. Xarray follows CF conventions and decodes a variable to the list of coordinates (as opposed to data variables) if it meets either of the two criteria:
- it is 1D and has the same name as a dimension (coordinate variable)
- it appears in another variables
coordinates
attribute (auxiliary coordinate variable)
I am not certain that the notion of CF coordinates needs to be in the datacube extension. To answer this, I would want to better understand the relationship between this extension and the CF conventions.
No problem, but I made this a draft PR now.
I'm happy to help, especially with JSON Schema.
Yes, sounds good.
I am, but I'm not using variables so others should better sound in,. |
Is every netCDF dataset a datacube? I think we need to answer this question. As an alternative to squeezing every existing netCDF dataset into the datacube extension, which will clearly pose a challenge to the fairly limited scope of this extension, perhaps we could consider a new, distinct extension called "CF". The CF Convetions are the data model that will meet the needs of the weather and climate community. All our important datasets, and 99% of netCDF files out there in the wild, comply with it. Lots of thought has gone into it by people who deeply understand our datasets. |
I've been wondering about that too. There seem to be some things that don't quite align between different group's definitions. I'm not sure if that's just a translation issue, or if it reflects something deeper. Perhaps I take a bit of time to sketch out a |
You could also just extend the datacube extension by adding a cf:coordinates that uses things from this extension. They can surely co-exist and work together. |
Inspired by Ryan's comment:
I've made another change to facilitate "auxiliary" coordinates ("non-dimension" coordinates in xarray terms) in 0d6c550. Now all "prcp": {
"type": "data",
"description": "The total accumulated precipitation over the monthly period of the daily total precipitation. Sum of all forms of precipitation converted to a water-equivalent depth.",
"extent": [0, null],
"unit": "mm",
"dimensions": ["time", "y", "x"]
}
While for "lat": {
"type": "auxiliary",
"extent": [17.960035, 23.512327],
"description": "latitude coordinate",
"unit": "degrees_north",
"dimensions": ["y", "x"]
} |
@m-mohr I'm reasonably happy with putting these "auxiliary" / non-dimension coordinates in I think that CI should pass now. Since this is my first PR GitHub requires maintainer approval to run the CI. |
I have no strong opinion on the variables as they don't exist in the data cubes we use. So we hopefully get others to sound in whether that works for their environment, too. |
@schwehr you commented on #1. Do you have a time to look through this proposal for variables to see if it fits your needs? And maybe @tomkralidis and @jhamman might be interested, since you were involved in radiantearth/stac-spec#713. |
I hope these people are not just all working with the same tools (e.g. netCDF) so that we get a variety of tools covered. ;-) |
Also pinging CEDA folks, @agstephens @PhilKershaw, who are interested in this. |
Just a clarification that netCDF is not a tool but a data format. What other tools did you have in mind? |
Nothing specifically (but for example OGC Coverages, R stars, some other domains), I just want to ensure we get a variety of opinions across datacube tools, formats, models, ... |
From the rstac & e-sensing/sits side, cc @gqueiroz, @gilbertocamara, @OldLipe. For reference, the |
Let's wait for a couple of days and incorporate feedback as it comes in and then merge. If no feedback comes in, then I'll make a final review and we can probably take it as it is... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments and requests for changes, especially about issues in the JSON Schema.
README.md
Outdated
| dimensions | \[string] | **REQUIRED.** The dimensions of the variable. This should refer to keys in the ``cube:dimensions`` object or be an empty list if the variable has no dimensions. | | ||
| type | string | **REQUIRED.** Type of the variable, either `data` or `auxiliary`. | | ||
| description | string | Detailed multi-line description to explain the dimension. [CommonMark 0.29](http://commonmark.org/) syntax MAY be used for rich text representation. | | ||
| extent | \[number\|string\|null] | If the dimension consists of [ordinal](https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale) values, the extent (lower and upper bounds) of the values as two-dimensional array. Use `null` for open intervals. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, a string extent would be usually for date-times, right? On the other hand, a step can't be specified as duration like in the Temporal Dimension. The definition of the Variable Object seems it tries to merge all kinds of types for which we have different Dimension Objects into a single Object, but that doesn't feel very clean. Maybe we need to split this up a bit more or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if JSON / json schema models dates as strings then having strings here seems necessary. So I think step
should be allowed to be a string (to be more precise, it should be an ISO 8601 duration if and only if extent is a datetime string, but I'm not sure how to write that in JSON schema).
That said, I don't think that step
is too useful for variables (they're extremely useful for dimensions / coordinates).
I'll defer to you on the best way to model this, but having a single JSON object for all variables feels natural to me. When you're using the data they all end up in n-dimensional arrays, just with different data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe we can simplify here and remove the step for now from variables? I still need to think through whether to split the Variables Object or not...
Anything else to do here @m-mohr? IIUC, the main outstanding point is a concern that variable objects might be doing a bit too much? (#6 (comment)) |
The remainging point for me is to check whether the Variables Object should be split similar to the Dimensions (see #6 (comment) ), but I'm pretty busy right now. |
Perfect, thanks, just wanted to make sure I wasn't blocking anything. I'm comfortable working off this branch for now, so no rush on my account. |
Hi Matthias, just a friendly ping here if you have time for a decision on #6 (comment) I'm still fine working off a branch, but wanted to make sure this was still somewhere on your probably too long backlog. FWIW, I think modeling Variables as a single object makes the most sense. In the context of a library like xarray, multiple datacubes will live together in a Dataset (a container for Variables like this). Each Variable is stored using the same Python object, a |
Yes, sorry, I have it on my overly long to do list. |
Okay, I think this is all good. Thanks for working on it. |
@TomAugspurger One last issue: Could you please add a CHANGELOG entry? |
Thanks, done. |
Thanks for all your help with this @m-mohr! |
@TomAugspurger I'm about to fix #4 and then I'll probably release later today. |
Apologies for the WIP here, but I haven't worked with JSON schema or a STAC spec before and could use some help. I'll leave specific points inline, but a few general questions:
cube:variables
the right approach?dimensions
, a list of strings, on each variable?For reference, I'm trying this out on the daymet dataset. If you have xarray, zarr, and adlfs installed, this should work
which prints out the repr
cc @m-mohr if you have a chance to provide feedback here I'd be extremely grateful.
Closes #1