-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dimension names as core array metadata #73
Comments
I would very much appreciate having an "official" way to define dimension names. Currently I mimic the xarray conventions in my Julia code but this feels a bit risky since these conventions are not properly versioned so if there is a change in the future in how these conventions are handled this could lead to unexpected bugs. So I don't mind if this is in the core protocol or in some extension as long as there is a clean way to find out programmatically after which convention dimension names are defined. |
I agree with this proposal. It seems like we definitely want to synchronize this with whatever @DennisHeimbigner, @WardF, and the rest of the Unidata crew decide to do about dimension names. |
This crosses a problem discussed in the meeting today.
There is a strong feeling that the v3 spec should support
asyncronous read and write to the degree possible.
This is driven by cloud storage models.
One consequence is that it should be possible for a process to directly
create and write
a variable without having to synchronize with any other process.
However, it is unclear how this applies to shared dimensions. Should
asynchronous creation of a named dimension by a process be allowed?
=Dennis Heimbigner
Unidata
…On 6/3/2020 2:06 PM, Ryan Abernathey wrote:
I agree with this proposal.
It seems like we definitely want to synchronize this with whatever
@DennisHeimbigner <https://github.com/DennisHeimbigner>, @WardF
<https://github.com/WardF>, and the rest of the Unidata crew decide to
do about dimension names.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#73 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG47W4YVMW3OWCKF7ZWN5TRU2UNRANCNFSM4NGT647A>.
|
I would suggest that, if we support dimension names in the v3 spec, then they are simply string labels for the dimensions of an array. Nothing else is implied. I.e., if two arrays happen to use the same name for a particular dimension, then at the level of the v3 protocol, that does not imply anything. It could mean that the two arrays have a "shared dimension" in the netCDF sense, it could just be coincidence, at least as far as a vanilla implementation of the v3 protocol is concerned. A library that supports the full netCDF data model might then choose to treat these dimension names as names for shared dimensions, that would be fine and up to the netCDF layer implementation to manage. Hope that makes sense. |
However, the dimension name and size must be stored in the metadata
independent of any variable. So adding a dimension may interfere
with asynchronicity.
=Dennis Heimbigner
Unidata
…On 6/3/2020 3:13 PM, Alistair Miles wrote:
I would suggest that, if we support dimension names in the v3 spec,
then they are simply string labels for the dimensions of an array.
Nothing else is implied. I.e., if two arrays happen to use the same
name for a particular dimension, then at the level of the v3 protocol,
that does not imply anything. It could mean that the two arrays have a
"shared dimension" in the netCDF sense, it could just be coincidence,
at least as far as a vanilla implementation of the v3 protocol is
concerned.
A library that supports the full netCDF data model might then choose
to treat these dimension names as names for shared dimensions, that
would be fine and up to the netCDF layer implementation to manage.
Hope that makes sense.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#73 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG47WY3COBTE55ODXJSAXLRU24IBANCNFSM4NGT647A>.
|
I may need some help from @rabernat here, there's a few different "dimensions" to this problem (sorry for the very bad pun :-) Note that in this proposal I am simply proposing a metadata property for giving names to the dimensions (axes) of an array. Perhaps the property should be called E.g., with this feature I could create an array with shape (10, 5) and name the dimensions ("foo", "bar"). In the zarr protocol, it would be totally fine to create another array with shape (100, 5) and name the dimensions ("foo", "qux"). I.e., creating each of these arrays is an independent operation, and the names are just labels for the axes of the arrays, not necessarily shared. I.e., a vanilla zarr implementation would just offer the ability to provide names for the dimensions (axes) of an array, and might show those names when providing a visual representation of the array, but that would be it. Now, a higher-level library implementing the netCDF data model might choose to interpret these as names for shared dimensions, under certain circumstances. I.e., if two arrays within the same group both have the name "foo" for one of their dimensions, then assume they are referring to a shared dimension. This is similar to what xarray does currently. The main difference is that xarray uses an attribute called Perhaps it would be easier to avoid potential confusion, and for zarr to not try to cross into the netCDF space, and rather allow that to be dealt with via a set of usage conventions that properly deal with the netCDF semantics, such as the xarray approach or the nzcarr approach. |
Also noting that IIUC this is not necessarily true, e.g., the xarray approach does not separately store dimension names and sizes. This is different from the nczarr proposal. Note that I have no opinion on which of these two approaches is best, just noting the difference. |
👍 This is how I have been thinking of it. Rather than calling the axes 0, 1, 2, we can call them time, lat, lon. Additional extensions or application could decide to interpret this in different ways, such as in the netCDF data model.
I don't see why. The dimension size is the determined by the shape of the array. |
I am glad we have these kinds of discussions; I am to some degree
captive of the historical development of netcdf and its assumptions.
Does this interpretation seem reasonable WRT the xarray model?
1. the definition of a named dimension is distributed (an important word)
to all of the variables which use it. There is no single
centralized definition
as in netcdf.
2. The costs for the xarray approach are:
a. inconsistency between the distributed named dimension
definitions is possible
b. the cost in storing the named dimension info in multiple variables.
The cost for (2b) seems very small and so is not a big issue.
The (2a) case is no different than any other hidden data used in, say,
netcdf.
Presumably the inconsistencies can only occur if the dataset is modified
outside
of the library.
Since in netcdf, dimensions are scoped by groups, one would need to use the
fully qualified names (FQNs) for named dimensions: e.g.. /g1/g2/dim1.
It would seem that some kind of search is needed to guarantee dimension
name uniqueness.
It potentially requires looking at all variables within the group part
of the FQN
of the new dimension to ensure that the name is unique.
Does xarray do a similar search when a client defines a new dimension?
In any case, the distributed approach is attractive because it potentially
allows asynchronous definition of dimensions if certain constraints can
be met
so that search can be avoided or minimized.
Comments?
=Dennis Heimbigner
Unidata
…On 6/4/2020 3:49 AM, Alistair Miles wrote:
However, the dimension name and size must be stored in the
metadata independent of any variable.
Also noting that IIUC this is not necessarily true, e.g., the xarray
approach
<http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification>
does not separately store dimension names and sizes. This is different
from the nczarr proposal
<https://drive.google.com/file/d/1UUGcQMpWqKllMdRFCu97CoL7fB_GWXvg/view>.
Note that I have no opinion on which of these two approaches is best,
just noting the difference.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#73 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG47WY3CMRUWUZ5I2VT2BLRU5U2JANCNFSM4NGT647A>.
|
Thinking out loud somewhat, I wonder if restricting |
Update RFC to say this is something we'd like input on. |
I would also like to see built-in support for dimension names, and would also suggest that, for simplicity, the zarr specification itself make no assumptions about "shared dimensions" between multiple arrays. Aside from possible constraints on the allowed characters, I think that empty labels should be allowed (and indicate an unnamed dimension), and non-empty labels must be distinct. Not specifying the dimension names at all would be equivalent to specify all empty strings as the dimension labels. |
What's the advantage of allowing empty labels? |
Given that dimension names would be optional, it seems natural to me to allow that optionality on a per-dimension basis. E.g. maybe you are computing some sort of multiplication or partial reduction between two zarr arrays A and B, where A has labels and B does not. If the result has some dimensions corresponding to dimensions of A and some dimensions corresponding to dimensions of B, we would like to preserve the dimension labels from A without having to invent fake labels for B. However, I don't feel too strongly about allowing empty labels. |
I assume that this would operate like _ARRAY_DIMENSIONS |
Another point. Unless you require all dimension names to be "global",
|
WRT anonymous dimensions. One approach is to merge the shape and dimension keys |
If we allow anonymous dimensions, then I would say they indeed have to be specified by their index rather than name, but of course named dimensions could also be specified by index. And in many contexts, e.g. for display to a user, I agree that it would be very natural to display just the index in place of the name for anonymous dimensions. Although the dimension names could be quoted to avoid ambiguity, it might also be good to disallow dimension names that consist only of digits 0-9. However, I'm unclear exactly what you are proposing as far as having dimension names be either strings or integers. Would that just be a concern of a specific implementation, rather than the zarr spec itself? Also as far as referencing dimensions by path, as far as I can tell nothing in the current spec requires referencing dimensions; I suppose you are thinking from the context of an extension like ome-zarr or a version of netcdf built on top of zar While I agree that the netcdf data model makes a lot of sense in many cases, I'm not sure how well the unique dimension names constraint / consistent size for every named dimension constraint fits with all intended uses of zarr v3. I guess users could always work around that issue by putting each zarr array in a separate zarr repository, but users might wish to get other data organizational advantages of having multiple arrays in a single zarr repository without constraining themselves to the netcdf data model. |
That is the reason I made the string vs number distinction. And the fact that netcdf allows |
I do not understand this comment.
Suppose we have another variable v2 in group /g2.
How do we know that the two dim17's refer to the same dimension? Of course, this assumes one wants the shared dimension name semantics |
It seems like just using a unique dimension name might be more natural than specifying a dimension by reference to another array, but I am not sure. Certainly netcdf shared dimension semantics are applicable in some applications, but I think there are other applications where dimension names are useful but the constraint that all dimensions with a given name should have the same extent is not useful. For example:
|
In a sense I agree which is why netcdf declares dimensions separately from variables. |
Your examples still prove my point. You are assuming that the dimensions with the same |
I think that coordinate variables are important in this discussion. Suppose we have the following:
The temp variable represents the temperature at a given latitude and longitude. The longitude values are, say, -1deg. thru 2deg.
This concept of coordinate variables is extremely useful but it relies on |
I agree that shared names to indicate "shared semantics" in some sense is the point of named dimensions, but I think exactly what those "shared semantics" are depends on the application. If zarr were to use the netcdf data model, where shared name means shared domain, then how do you propose to deal with the use case of a single zarr repository where the root group contains a collection of arrays named |
In netcdf, you put the various dimensions in different groups (possibly with the relevant |
This adds support for dimension names (zarr-developers#73) and non-zero origins (zarr-developers#122).
Crosslinking #149 (comment) |
Resolved via #162. |
Several domains make use of named dimensions, i.e., for a given array with N dimensions, each of those N dimensions is given a human-readable name.
Given the broad utility of this, should we include this within the core array metadata in the v3 protocol? E.g., add a
dimensions
property within the array metadata document, whose value should be a list of strings:One question this raises is how to handle the case where no names are provided, or only some dimensions are named but not others. I.e., dimension names should probably be optional.
The alternative is that we leave this to the community to define a usage convention to store dimension names in the user attributes, e.g., similar to what xarray currently does using the "_ARRAY_DIMENSIONS" attribute name.
The text was updated successfully, but these errors were encountered: