Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codec specification in v3 #293

Open
d-v-b opened this issue May 7, 2024 · 11 comments
Open

codec specification in v3 #293

d-v-b opened this issue May 7, 2024 · 11 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented May 7, 2024

I will summarize a few concerns I have about the way codecs are handled in the v3 spec, and propose some changes that I think could improve this situation.

the codec problem space

We need Zarr implementations across multiple languages to agree on standard JSON serialization for different codecs. This protects users from fragmentation, e.g. a situation where we end up with multiple flavors of JSON serialization for the same popular codec. At the same time, we want to make it easy for users to experiment with and create new codecs; this enables users to get the most from Zarr.

Also, codecs are generally useful for users outside of Zarr. There are plenty of non-Zarr use cases for compressing / rearranging array data. So I think the codec standardization should support these non-Zarr use cases.

concerns with codecs in the v3 spec

  • The v3 spec explicitly states that it does not define a list of codecs, but it does define a list of codecs. We can't have blatant contradictions in the spec, so this needs to be sorted out at a minimum, regardless of whatever decisions we make. The contradiction between the text of the spec and the codec definitions was already a source of confusion in a pull request in zarr-python.
  • Suppose we resolve the above contradiction by stating that zarr v3 does in fact define a fixed set of codecs, where are listed in in the spec. This leads to two sub-problems:
    • How does someone design and use a new codec? We cannot require PRs against the spec for every new codec. If writing a new codec started with getting a PR accepted in zarr-specs, nobody would ever write a new codec.
    • What happens if an implementation does not support a codec from the standard list? There is no enforcement mechanism for the requirement that an implementation support that fixed set, so practically the requirement is toothless, which means it cannot be a requirement. Requirements in the spec should be restricted to essential features, but supporting the Gzip compressor is simply not essential, for users who don't work with Gzip-compressed data. So any list of codecs should be a recommendation, not a requirement.
  • The v3 spec states that the unique identifier for a codec must be "... a URI that dereferences to a human-readable specification of the codec".
    Software cannot check if a URI dereferences to a human-readable document. If we want Zarr v3 hierarchies to be validated by software, we must remove this requirement.

how to resolve these concerns

I don't think naming a closed set of "official codecs" in the spec is realistic. There is no enforcement mechanism, and ultimately users don't care if an implementation doesn't support a codec they don't use. That is, if an implementation doesn't support codec X, and none of the users of that implementation use codec X, then IMO this is fine.

To express this differently, I think the Zarr spec should not enumerate the features / behavior an implementation must have. The Zarr spec should just describe the Zarr format, and we leave it to implementations to choose how they implement that format.

Extending this logic, the Zarr format is actually agnostic with respect to particular codecs. So specific codecs should not appear in the Zarr spec! I actually think codecs should be defined entirely in another spec, and we refer to this spec in the Zarr spec, e.g. "codecs is a JSON array of JSON objects that implement the Numcodecs spec (link to the numcodecs spec)" (we can choose a different name for the codecs spec, but it shouldn't refer to zarr).

Recall that In Zarr v2, codecs were basically standardized by the behavior of the numcodecs python library, which was a stand-alone library with no Zarr dependency. I think this illustrates the right relationship between codecs and the zarr format, but we shouldn't rely on a python library to define a standard for a cross-language concern. Zarr v3 tries to fix the latter problem by folding codec definition inside the spec itself, but as I have argued, this introduces a different set of problems. The solution is to define codecs separately, and make the zarr spec depend on that codec spec. The codec specification can manage a registry of codecs, etc, thereby abstracting the current behavior of numcodecs in a language-agnostic way.

Another advantage of a separate spec for codecs is that this spec could be used by any project that wants to compress arrays in a standard way. There is nothing Zarr-specific about serializing GZip parameters to JSON, so lets reflect this in the structure of the specification document.

tldr; I think the list of codecs in v3 is trying to solve a problem (a language-agnostic list of codecs) that we can solve in a better way: by migrating the codec specification from Zarr v3 into its own spec.

is this too much churn in the spec

I know it sucks to hear complaints about the spec after it's been finalized. Sorry. But I want zarr v3 to be really good, and I think the way we do codecs in v3 right now is very problematic; if my concerns are valid, then we owe it to users to get this resolved as soon as possible.

@LDeakin
Copy link

LDeakin commented May 7, 2024

I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.

A zarr implementation does not need to support every codec to be conformant, but spec'ing codecs and supporting them across more than just one implementation is essential to move forward and increase adoption. What better place to put zarr codec specs than alongside the zarr spec?

We cannot require PRs against the spec for every new codec. If writing a new codec started with getting a PR accepted in zarr-specs, nobody would ever write a new codec.

A codec does not have to start with a spec, it can start with an experimental implementation. That is basically what most of the codecs in numcodecs are. Similarly, I have multiple experimental Zarr V3 codecs implemented in zarrs that I plan to put forward once the new ZEP process has been figured out.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 7, 2024

I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.

I agree with this completely. My concern here is not whether we should standardize codecs; it's whether we should standardize codecs inside the Zarr specification document, or in a separate specification document.

What better place to put zarr codec specs than alongside the zarr spec?

I think outside the Zarr spec entirely is the best place to put the codec specs. The codecs don't depend on Zarr; instead, Zarr depends on them.

A codec does not have to start with a spec, it can start with an experimental implementation.

That's a good idea, but technically your codecs cannot start with an experimental implementation. According to the text of the spec, your experimental codec is only valid when it is defined in a separate specification, and you give your codec a URI that resolves to a human-readable specification of the codec. Personally I don't think this is a reasonable requirement for experimental codecs.

@normanrz
Copy link
Contributor

normanrz commented May 7, 2024

Just copying my response from the zarr-python thread here:

I think it is useful to have minimal set of codecs that we expect any zarr impl to support (e.g. bytes, transpose, blosc). Other codecs can be optional. I think the zarr specification is a actually good place to list available codec specs.

I feel quite strongly, that non-standard codecs need to be labeled as such (e.g. through URI-style naming instead of short names). Having multiple codecs (even if the encoded format is only slightly different) with the same name would be a desaster. Perhaps zarr-python should even enforce that (ie. don't allow short names for non-standard codecs).

@d-v-b
Copy link
Contributor Author

d-v-b commented May 7, 2024

@normanrz could you elaborate on these points a bit? Do you think the spec should require or merely suggest that implementations support a fixed set of codecs? If you want this to be a requirement, how would we enforce it?

Given that the spec currently requires that all codecs have a specification, how do we formally distinguish "standard" from "non-standard" codecs? What is the process for converting a "non-standard" codec to a "standard codec", or vice versa?

@normanrz
Copy link
Contributor

normanrz commented May 7, 2024

Do you think the spec should require or merely suggest that implementations support a fixed set of codecs?

Some codecs are essential to how Zarr works and should be required by all implementations. Most minimally, that is the bytes codec. Other codecs are so popular and general that all implementations should implement it, e.g. blosc, transpose, gzip, zstd, sharding_indexed. Then, there might be codecs that might only be relevant for a subset of the community, such as image or segmentation compression codecs. These might be optional from a Zarr pov but required by higher level format (e.g. OME-Zarr).

If you want this to be a requirement, how would we enforce it?

I like to think that enforcement of the Zarr spec comes through validation from multiple implementations. When opening an array or group, implementations parse the metadata and therefore implicitly or explicitly validate the metadata.
If you only ever use your data with a single implementation, you might not get that validation. But then you also might not care about the interoperability that the spec provides.
Of course, we could (and maybe should) also provide validation tools alongside the spec (e.g. json schema).

Given that the spec currently requires that all codecs have a specification, how do we formally distinguish "standard" from "non-standard" codecs?

"Standard" codec get a short name assigned by the Zarr spec (e.g. bytes). "Non-standard" codecs have a URI-style name (e.g. https://zarr.dev/numcodecs/lz4). That way, we minimize the risk of non-standard codecs conflicting each other. I think we can drop the requirement that the URI points to a human readable codec spec. A unique name should suffice for my concerns.

What is the process for converting a "non-standard" codec to a "standard codec", or vice versa?

I think we can use the ZEP process for that. Implementations that support non-standard codecs might need to support both names once a codec becomes standardized.

What better place to put zarr codec specs than alongside the zarr spec?

I think outside the Zarr spec entirely is the best place to put the codec specs. The codecs don't depend on Zarr; instead, Zarr depends on them.

From a theoretical pov, I can see that splitting the codec spec from Zarr might make sense. From a practical pov, I don't see how that would make anything easier or facilitate interoperability among the Zarr impls. I think it is best to keep the codec spec in the Zarr spec.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 8, 2024

I think it is best to keep the codec spec in the Zarr spec.

Is the current set of codecs inside the zarr spec? I think this is actually the root of my concern.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 8, 2024

given that the zarr v3 spec document itself says that it doesn't define a list of codecs (and this claim is internally consistent -- that document does not in fact define a list of codecs), what spec are are the codec definitions part of?

@normanrz
Copy link
Contributor

normanrz commented May 8, 2024

Is the current set of codecs inside the zarr spec?

I think they are.

given that the zarr v3 spec document itself says that it doesn't define a list of codecs (and this claim is internally consistent -- that document does not in fact define a list of codecs), what spec are are the codec definitions part of?

I think it is unfortunate that the paragraph you cite did not get updated during the v3 spec process (a quick git blame shows that). I agree that it is inconsistent because the spec actually lists codecs. Most implementations have implemented this list of codecs. We should certainly revise this paragraph.

@dstansby
Copy link

dstansby commented Aug 11, 2024

I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.

I agree with this from two points of view:

  • As a data provider I want to ensure that the codec I use is supported by all applications that (claim to) open zarr datasets
  • As an implementation writer (this is a hypothetical point of view 😄), I would want to know exactly which codecs I need to implement to support reading all zarr data

So I am in favour of having a finite set of codecs included in the zarr spec that implementations must support.

To come back on some of the concerns above:

If writing a new codec started with getting a PR accepted in zarr-specs, nobody would ever write a new codec.

I'm not sure this is true - most of (all?) the codecs currently used by zarr were developed independently of the zarr spec by teams outside the zarr developers, so would exist regardless of zarr existing.

Requirements in the spec should be restricted to essential features, but supporting the Gzip compressor is simply not essential, for users who don't work with Gzip-compressed data. So any list of codecs should be a recommendation, not a requirement.

Supporting sharding is not essentials for users who don't want sharded data, but it is a useful enough feature for enough people that it's worth mandating it as part of the spec, so for those users who want to use it they know it is guarenteed to be supported. I think the same argument holds for a list of standard codecs - I might not want to use all of them, but I want to be guarenteed that the one I do use is supported by all implementations.

There is no enforcement mechanism

Well, there's no 'enforcement mechanism' for any of the spec, but if someone wants to claim the have written an implementation then they have to implement the whole spec. I'm not sure why codecs would be any different here?

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 11, 2024

So it seems like most people in this conversation believe that the v3 spec should specify a set of codecs that Zarr implementations must support. This is at variance with the language of the spec today:

To allow for flexibility to define and implement new codecs, this specification does not define any codecs, nor restrict the set of codecs that may be used. Each codec must be defined via a separate specification. In order to refer to codecs in array metadata documents, each codec must have a unique identifier, which is a URI that dereferences to a human-readable specification of the codec. A codec specification must declare the codec identifier, and describe (or cite documents that describe) the encoding and decoding algorithms and the format of the encoded data.
...
The Zarr core development team maintains a repository of codec specifications, which are hosted alongside this specification in the zarr-specs GitHub repository, and which are published on the zarr-specs documentation Web site. For ease of discovery, it is recommended that codec specifications are contributed to the zarr-specs GitHub repository. However, codec specifications may be maintained by any group or organisation and published in any location on the Web. For further details of the process for contributing a codec specification to the zarr-specs GitHub repository, see ZEP 0 which describes the process for Zarr specification changes.

To make the spec document match the general opinion expressed in this issue (i.e., that the spec should list a required set of codecs), we need to make the following changes:

  • Add language to the spec stating that the spec does contain a set of codec specifications, and all implementations MUST support the codecs defined in this set. We could explicitly list the codecs that implementations must support by name, or simply say "implementations MUST support the codecs defined in link_to_codecs.html".
  • Remove the requirement that codecs MUST have a URI that resolves to a human-readable specification, because this can't be checked. It's also not clear that the codec identifier should be a URI in the first place. See Add wrappers for zarr v3 numcodecs#524, for codec implementations where a URI was not used as the identifier.
  • Remove the "under construction" filler text in the header of this document, and replace it with some explanatory content: https://github.com/zarr-developers/zarr-specs/blob/main/docs/v3/codecs.rst
  • Remove this statement: "For ease of discovery, it is recommended that codec specifications are contributed to the zarr-specs GitHub repository."; codecs should only be added to the zarr-specs repo if we believe that all zarr implementations must add support for that codec; in practice, I think adding a codec to the zarr-specs repo would only make sense if all the zarr implementations already supported that codec.

Do these changes seem sufficient? If so, we can start writing up a ZEP.

@zoj613
Copy link
Contributor

zoj613 commented Sep 11, 2024

Regarding the Zstd draft spec in #256, is the checksum parameter really necessary? I went though the list of implementations in different languages and it seems the large majority do not support adding a checksum to the compressed output. Wouldn't this limit how many Zarr implementations can support this codec's spec? It sounds to me that we need to be careful not to define a codec's spec based mostly on the features provided by it's python implementation and also consider what features many other languages offer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants