Drop dimension_names from v3 #219

normanrz · 2023-03-06T12:02:21Z

I was going through the v3 spec and noticed the dimension_names attribute. I was wondering if that might better be placed in the upcoming metadata convention ZEP? There are already community-specific conventions for assigning names (and other metadata) to dimensions, e.g. OME-Zarr or xarray zarr dimension encoding.

The text was updated successfully, but these errors were encountered:

jbms · 2023-03-08T08:30:03Z

There was already quite a bit of discussion regarding whether to;

add this as a core metadata attribute;
add this as a user metadata attribute;
just leave it up to external tools / specifications to define something, like xarray and ome-zarr.

See previous discussion:
#73
#144
#162

While there wasn't a strong reason to favor 1 vs 2, given that dimension names are quite broadly applicable across many different zarr implementations and use cases, and some zarr implementations (such as Neuroglancer and TensorStore) directly make use of dimension names, there was strong support for choosing either 1 or 2, in favor of 3, in order to promote a single way to specify dimension names for better common denominator interoperability between e.g. xarray, netcdf, and OME-zarr.

My hope is that for zarr v3, xarray, netcdf and OME-zarr all make use of this dimension_names metadata field for specifying array dimension names rather than introducing a new user-defined attribute.

I think the general argument in favor of (1) rather than (2) is that dimension names are more universally applicable than other metadata like units.

normanrz · 2023-03-08T16:29:48Z

Thanks for explaining the previous discussion. I agree that dimension names are broadly applicable. However, a flat list of strings may not be enough to specify all the metadata that a community needs, which will create parallel metadata. For example, in OME-Zarr we already store metadata for axes (=dimensions) with additional fields such as type (e.g. time, space, channel) and unit.
To solve this, dimension names could become user metadata (as mentioned by you) or made extensible, e.g., "dimensions": [{"name": "x", "attributes": { "type": "space", "unit": "nanometer" }].

jbms · 2023-03-08T19:25:03Z

Agreed that there are many other per-dimension attributes that may be useful.

For example, in addition to the ones you mentioned, I'm planning to propose a zarr extension for non-zero origins, which would need a way to specify a lower bound and grid offset for each dimension.

I'm also planning to propose a "resizable" attribute (#212).

In general, there is the choice of whether to represent per-dimension attributes using a "row"-based organization similar to your example above, e.g.:

{"dimensions": [
  {"name": "x", "attributes": {"unit": "meter", "type": "space"}},
  {"name": "y", "attributes": {"unit": "meter", "type": "space"}},
  {"name":"c"},
  {}
],
...
}

or to use an equivalent columnar representation:

{"dimension_names": ["x", "y", "c", null],
  "attributes": {
    "dimension_units": ["meter", "meter", null, null],
    "dimension_type": ["space", "space", null, null]
  },
  ...
}

For the row-based organization we are also effectively adding the concept of per-dimension user-defined attributes.

As I see it, we can basically accomplish the same thing with either representation, but there are pros and cons to each approach:

The existing per-dimension zarr metadata, namely shape and chunk_shape, uses a columnar representation. For shape we could easily switch to row representation. For chunk_shape that would not be very natural unless we want to make the grid type a per-dimension attribute (which actually might make sense).
If we want to add a must_understand = False per-dimension attribute to the core metadata, it would become rather verbose:

{"dimensions": [{"name": "x", "coordinate_array": {"must_understand": false, "path": "xxxxx"}, ...],
 ...}

However, it is not clear whether adding an optional core metadata per-dimension attribute is particularly useful given that it could instead be added as a user attribute.

Some attributes, like the inner chunk grid for sharding, really don't work well with a row representation.
If there is an explicit concept of per-dimension user-defined attributes, then if the zarr implementation supports "virtual views" for operations like transpose, then these attributes can be properly mapped even without knowledge of any particular attribute. However, in some cases this might only partially transform per-dimension metadata. For example, some attributes might relate to multiple dimensions, e.g. a transformation matrix, and could not be easily represented as per-dimension (scalar) metadata. If these virtual views only partially transform the metadata, that may be more confusing than not transforming it at all.
The row representation is more human readable in some cases, especially if the same per-dimension attributes are not present for all dimensions. On the other hand for some attributes like shape it may be less human readable.
The row representation may be problematic for some future extensions: for example, a way to specify certain integer coordinate transforms, such as transposing the dimension order, reversing dimensions, and adding/removing singleton dimensions. With such an extension, some per-dimension attributes might relate to the "input" space while other per-dimension attributes might relate to the "output" space. In my mind this is the biggest problem with the row representation./

jstriebel · 2023-03-15T09:53:56Z

To solve this, dimension names could become user metadata (as mentioned by you) or made extensible, e.g., "dimensions": [{"name": "x", "attributes": { "type": "space", "unit": "nanometer" }].

This is something that could still be added later, as an alternative to dimension_names, if needed. Also, this could could have the following form (either as a top-level field, or in the user attributes):

"dimensions": {
  "x": {                                  // "x" refers to a dimension name
    "type": "space",
    "unit": "nanometer"
  }
}

This would then be in addition to dimension_names, decoupling the order of dimensions and their properties. This might even work nicely for dimensions with the same name, which often have the same attributes (e.g. when being nodes in a graph).

Simply having dimension names without further properties was requested by multiple people in different threads, which I think justifies adding it to v3 atm. I don't think that this is the case for any more complex fields. Since this can be changed and added later as well, I'm strongly in favor of keeping the spec as-is for now.

normanrz · 2023-03-15T10:34:51Z

I don't think that this is the case for any more complex fields.

Well, OME-NGFF has that. But, I don't want this to block ZEP1, either.

jstriebel · 2023-05-05T15:54:56Z

@normanrz Can we close this issue?

jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Mar 13, 2023

jstriebel added this to ZEP1 Mar 13, 2023

jstriebel moved this to In Discussion in ZEP1 Mar 13, 2023

normanrz closed this as completed May 5, 2023

github-project-automation bot moved this from In Discussion to Done in ZEP1 May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop dimension_names from v3 #219

Drop dimension_names from v3 #219

normanrz commented Mar 6, 2023

jbms commented Mar 8, 2023

normanrz commented Mar 8, 2023

jbms commented Mar 8, 2023

jstriebel commented Mar 15, 2023

normanrz commented Mar 15, 2023

jstriebel commented May 5, 2023

Drop dimension_names from v3 #219

Drop dimension_names from v3 #219

Comments

normanrz commented Mar 6, 2023

jbms commented Mar 8, 2023

normanrz commented Mar 8, 2023

jbms commented Mar 8, 2023

jstriebel commented Mar 15, 2023

normanrz commented Mar 15, 2023

jstriebel commented May 5, 2023