Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the list of codecs in the v3 spec #312

Closed
wants to merge 11 commits into from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Sep 11, 2024

Follow-up from #293

The way the v3 spec describes codecs today is inconsistent with the needs of a large number of users and implementers. Specifically, the spec says that there is no formal list of official codecs, but a number of people believe that a formal list of official codecs is either the true intention of the spec, or simply a much better specification than the alternative.

Changes in this PR:

  • removes the language stating that the spec does not list any codecs. Said language is replaced with a definition of a "core codec" (e.g., the codecs that are de facto part of the spec today), as well as a definition of a "community codec" (e.g., a codec published by a third party).
  • Zarr implementations MUST SHOULD support core codecs
  • Zarr implementations MAY support community codecs.
  • The requirement that codec identifiers be URIs that resolve to a human-readable specification has been removed, as we weren't even following that ourselves and it's not machine-checkable by definition.

Happy to take feedback, in particular on the styling. I'm bad at rst.

I think it's worth considering the application of the same "core" vs "community" distinction for other extension points, e.g. data types.

@zoj613
Copy link
Contributor

zoj613 commented Sep 11, 2024

Does blosc really need to be a necessary codec? I feel as though Gzip is enough for a bytes->bytes core codec since its widely available. Blosc might not have bindings in many languages, making it difficult to be supported by some implementations. For example, zarr-ml cannot support blosc since as of today there aren't any OCaml bindings to the blosc C library. On the other hand, we can confidently assume nearly all modern languages have support for Gzip/Zlib given that it has been around for over 30 years.

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 11, 2024

Does blosc really need to be a necessary codec?

I cannot answer this question. For context, my personal opinion is that it's not a great idea to have a mandatory list of codecs, for exactly the reasons you list. But in #293 I was in the clear minority -- everyone else participating in the discussion preferred a strictly mandated list of codecs. The core problem is that the current v3 spec is incoherent w.r.t codecs, and we need to make it coherent. This PR represents the solution that the majority of people supported in #293.

@zoj613
Copy link
Contributor

zoj613 commented Sep 11, 2024

Does blosc really need to be a necessary codec?

I cannot answer this question. For context, my personal opinion is that it's not a great idea to have a mandatory list of codecs, for exactly the reasons you list. But in #293 I was in the clear minority -- everyone else participating in the discussion preferred a strictly mandated list of codecs. The core problem is that the current v3 spec is incoherent w.r.t codecs, and we need to make it coherent. This PR represents the solution that the majority of people supported in #293.

I see. I wasn't aware of that discussion. I would like to also caution against defining a spec based on the features offered by the python implementations of a particular codec. For example, I noted here that the Zstd spec includes a checksum parameter that does not seem to be supported by many Zstd implementations across different languages. It appears to me that the parameter is influenced by the feature offered by the Zstd python implementation even though many other implementations do not seem to support it. This would make it difficult to full support that spec across Zarr implementations

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 11, 2024

@zoj613 would you mind bringing up your concerns in #293 for visibility?

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 11, 2024

(I think we could salvage this PR by changing some "MUST support" statements to "SHOULD support")

@LDeakin
Copy link

LDeakin commented Sep 11, 2024

(I think we could salvage this PR by changing some "MUST support" statements to "SHOULD support")

This is in line with my thoughts in #293:

A zarr implementation does not need to support every codec to be conformant

and @normanrz:

Some codecs are essential to how Zarr works and should be required by all implementations. Most minimally, that is the bytes codec. Other codecs are so popular and general that all implementations should implement it

@d-v-b d-v-b marked this pull request as ready for review September 12, 2024 11:42
@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 12, 2024

I think this is ready for review. I removed the requirement that zarr implementations MUST support the core codecs. instead, they SHOULD support them.

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 12, 2024

@zarr-developers/steering-council could someone have a look at this?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 3, 2024

@zarr-developers/steering-council I really would appreciate some feedback on this PR. It has been ready for review for 3 weeks with no response. What is the blocker here?

For context, in zarr-python there are efforts to implement v3 versions of zarr v2 compression routines. The spec as currently written does not offer clear guidance for this process, and I believe that the changes contained in this PR will offer much needed clarifications that will be of immediate use to at least one implementation (zarr-python). If the strategy for codecs outlined in this PR is viable, then we can apply the same approach to datatypes.

developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
algorithm underlying the codec.

Community codecs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Community" used elsewhere? Maybe "Extension", which I think is used in other places ("extension data types" at least).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget how I picked the word "community"... I don't think it's a term of art in the spec. If "extension" is a better fit I'm happy to switch to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I favor "extension" over community because it fits with the existing Zarr extension framework.

----------------

Zarr implementations MAY support a codec that is not in the list of core codecs
(hereafter termed a "community codec"), provided the community codec does not use an identifier

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth namespacing codec identifiers somehow? I'm slightly against, since ideally codecs can graduate from "community" (or "extension") to "core" (after careful consideration). It'd be nice to avoid needing to rewrite metadata files when that happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do any namespacing of the identifiers, we should not use any information about whether the codec is core or community / extension, for the reasons you note. Otherwise I don't have a preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what is written here is fine. Namespacing codecs sounds complicated.

rabernat
rabernat previously approved these changes Oct 4, 2024
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this great. I support the change "community codec" -> "extension codec"

developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
algorithm underlying the codec.

Community codecs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I favor "extension" over community because it fits with the existing Zarr extension framework.

----------------

Zarr implementations MAY support a codec that is not in the list of core codecs
(hereafter termed a "community codec"), provided the community codec does not use an identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what is written here is fine. Namespacing codecs sounds complicated.

docs/v3/core/v3.0.rst Outdated Show resolved Hide resolved
Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>
@rabernat
Copy link
Contributor

rabernat commented Oct 7, 2024

@d-v-b did you decide to stick with "community" over "extension"? I weakly favor "extension," but I don't want to block this PR on that point.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 7, 2024

i'm happy to switch over to "extension" but i haven't changed the text yet

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 8, 2024

"community" is out, "extension" is in.

@joshmoore
Copy link
Member

joshmoore commented Oct 9, 2024

Few comments:

  • Thanks to @d-v-b for all the clean up!
  • On the MUST/SHOULD front (link), it might be good to make sure folks from sign off if it's reversing a previous decision.
  • As I mentioned to @d-v-b at a recent meeting, I'd would favor guidance for codec implementors wherever we have a preference on conventions moving forward, e.g., "codec names are cheap; prefer grabbing a new name (my-codec-2) over breaking the schema of your codec configuration."

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 9, 2024

  • I'd would favor guidance for codec implementors wherever we have a preference on conventions moving forward, e.g., "codec names are cheap; prefer grabbing a new name (my-codec-2) over breaking the schema of your codec configuration."

This PR is deliberately non-normative about the content of the extension codecs. The intention is that the maintenance of the extension codecs should fall to their authors. If spec maintainers must enforce a recommendation that extension codec authors avoid breaking changes to their codecs, then we are making spec maintainers assume a maintenance role for those codecs, which is an outcome we want to avoid. In fact, the entire point of this PR is to provide a path for members of the community to publish extension codecs with as little work from spec maintainers as possible. This is the spirit of the following text from the PR:

Publication in the "extension codecs" section does not confer primacy or an official designation to a codec. The list of extension codecs exists expressly as a tool to enable coordinated codec development.

Also, your requested recommendation is not so simple. We cannot simply recommend that codec developers avoid breaking changes to their codecs -- a codec developer can use semver in the configuration for their codec, and they can define in their codec specification how implementations of that codec should handle breaking changes.

It is exactly because this matter is complicated that I do not wish to comment on it in this PR, hence my use of limited language above. If we are to make recommendations to codec developers (and managing versioning is just one of many possible recommendations), it should not be in the zarr v3 specification but rather a "best practices for codec development" document hosted elsewhere. I would be happy to review such a document if anyone wants to open a PR to add one.

Co-authored-by: David Stansby <dstansby@gmail.com>
@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 10, 2024

The goal of this PR is not to carefully map the space of codecs. We should do that somewhere else. The goal of this PR is to clarify the language of the spec in a useful way for codec development, in particular solving two acute, practical problems:

  • defining the status of the codecs that are currently listed in the spec (the "core codecs")
  • defining a minimal framework for publishing codecs that are not listed in the spec ("extension codecs") that solves an active problem: over in zarr-python we are back-porting zarr v2 codecs into zarr v3, and we (the zarr community) needs a place to publish the specs for those codecs so that other implementations can work with them.

So, for the purposes of this PR, there are only two classes of codecs under consideration: those listed in the spec ("core codecs"), and those not listed in the spec ("extension codecs").

For the codecs currently listed in the spec (core codecs), this PR addresses the following question: SHOULD implementations support those codecs, or MUST they support those codecs?

The answer I have reached in this PR is a decisive SHOULD. Suppose someone wants to write a Zarr implementation in a language X that doesn't have any bindings to blosc. If the spec requires that all Zarr implementations MUST support the core codecs, then language X cannot have a valid Zarr implementation until blosc bindings are added, and this is an unfair burden on members of the language X ecosystem. Crucially, this is not a hypothetical scenario. Because the alternative would cause serious problems, this PR alters the spec to state that the core codecs SHOULD be supported by Zarr implementations.

For codecs not listed in the spec (extension codecs), this PR addresses the following question: what is the relationship between extension codecs and core codecs? (answer: they have to use a different name from the core codecs). This PR also defines a service for zarr community members, wherein the zarr spec developers will host their codec. But, as stated in the PR, being listed doesn't confer any special status to an extension codec.

This PR represents my attempt to make the minimal amount of changes to resolve serious ambiguity in the spec document which is a barrier for codec development. If there are proposed changes to this PR that can better solve the problems I am trying to solve then please suggest them.

A reason I think this is important is the relationship to other future extensions. Would we feel comfortable with a similar level of confidence for an "extension data type" that we list in zarr-specs?

I think the changes I am proposing in this PR could work for data types as well. I would use the same two classes, with the same rules for members of each class.

@dstansby
Copy link

Aside from my comments above, I wanted to say that I am 👍 this change overall, and thanks a bunch for opening a concrete proposal to push this forward.

@joshmoore
Copy link
Member

joshmoore commented Oct 10, 2024

The goal of this PR is not to carefully map the space of codecs....

@d-v-b, though I appreciate your immediate goals, since this touches on our shared understanding of terminology it's unfortunately bigger. I was hoping the table would help since I believe a source of conflict here is that the definitions have shifted for some of us: we're not speaking the same language. How would you propose to work towards consensus here?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 10, 2024

The goal of this PR is not to carefully map the space of codecs....

@d-v-b, though I appreciate your immediate goals, since this touches on our shared understanding of terminology it's unfortunately bigger. I was hoping the table would help since I believe a source of conflict here is that the definitions have shifted for some of us: we're not speaking the same language. How would you propose to work towards consensus here?

I don't know what the disagreement or conflict is -- what changes do you specifically want to see in this PR? We have discussed adding recommendations to codec developers, but since we disagree on the content of that recommendation I think it's out of scope for this PR, and should probably be contained in a separate PR; I'd be happy to continue that particular conversation in such a PR.

@joshmoore
Copy link
Member

we disagree on the content of that recommendation I think it's out of scope for this PR

The fact that we disagree means that we will likely be having this conversation in any follow up PR, so unfortunately, I'd say it's in scope until we at least agree on the definition of what's in this PR.

Questions I have are:

  • core
    • Are there any codecs required by all implementations? (A) - I think we're still waiting on some responses to your ping.
  • extensions
    • what's the stability of "extension codecs"? (C vs D)
    • should codecs that are for "evaluation only" as @LDeakin said have their schema listed (D), just be named (E) or maybe not be part of this repo (F)? If not D, does that point to a different type (e.g. "community") again?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 10, 2024

Are there any codecs required by all implementations? (A) - I think we're still waiting on some responses to your ping.

no, pending feedback of course.

what's the stability of "extension codecs"? (C vs D)

should codecs that are for "evaluation only" as @LDeakin said have their schema listed (D), just be named (E) or maybe not be part of this repo (F)? If not D, does that point to a different type (e.g. "community") again?

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs repo. I have labored to keep this PR simple, but it feels to me that you are broadly suggesting that I make this PR more complex. Given that nothing in this PR precludes additional efforts to refine our definition of extension codecs, I would ask that we put those efforts and that conversation in a separate venue, e.g. an issue or a PR.

@joshmoore
Copy link
Member

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs rep

I'm sorry, I disagree. Since this PR is defining these terms, it makes agreeing on that definition important, and will impact those future PRs.

it feels to me that you are broadly suggesting that I make this PR more complex.

No. Apologies. I'm striving to find an agreed understanding of these things you are defining in admittedly nicely simple terms. That's critical since as I described to you on the call and above I feel less than comfortable with the statement "any extension codec maybe be submitted" when the definition of extension is class D above (as an example).

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 10, 2024

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs rep

I'm sorry, I disagree. Since this PR is defining these terms, it makes agreeing on that definition important, and will impact those future PRs.

"codecs in the spec" and "codecs not in the spec" are concepts that exist in the spec already -- I am merely giving these concepts names in my PR.

it feels to me that you are broadly suggesting that I make this PR more complex.

No. Apologies. I'm striving to find an agreed understanding of these things you are defining in admittedly nicely simple terms. That's critical since as I described to you on the call and above I feel less than comfortable with the statement "any extension codec maybe be submitted" when the definition of extension is class D above (as an example).

Why does it make you uncomfortable?

@joshmoore
Copy link
Member

Rather than reiterate my points from above, I'll add that @rabernat's suggestion of a maturity level from https://stac-extensions.github.io/ lessened the nervous. So perhaps the change I'd consider here is how do we not close the door on such an addition in a follow-up PR? Perhaps as with @dstansby's suggestion, we try to say less rather than trying to work out the definitions at this point.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 11, 2024

So perhaps the change I'd consider here is how do we not close the door on such an addition in a follow-up PR?

What specifically in this PR closes the door on follow-up PRs?

@joshmoore
Copy link
Member

In my reading, it's defining extension points in the most permissive way when I would define them more restrictively. If we can agree on text to basically say "there are codecs not in the spec" then we can decide on this maturity categorization later.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 11, 2024

it's defining extension points in the most permissive way when I would define them more restrictively.

This is exactly the kind of thing that can be changed in a follow-up PR, no?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 11, 2024

Maybe it's best if we focus on a real example, because that's what this PR is trying to solve: we are actively porting zarr-python v2 codecs into zarr-python v3. To avoid fragmentation of the zarr ecosystem, we should put the specifications for those codecs somewhere outside of zarr-python to ensure that non-python implementations can support them. This PR is an attempt to solve that problem. Do you have a better solution? Do you think this PR fails at solving that problem?

@joshmoore
Copy link
Member

This may be too far in the other direction but this is something I feel more comfortable with:

diff --git a/docs/v3/core/v3.0.rst b/docs/v3/core/v3.0.rst
index 0ab3cf2c..e944d63e 100644
--- a/docs/v3/core/v3.0.rst
+++ b/docs/v3/core/v3.0.rst
@@ -1233,19 +1233,18 @@ An existing core codec SHOULD be removed from the list when a sufficient number
 developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
 algorithm underlying the codec.

-Extension codecs
-----------------
+Other codecs
+------------

 Zarr implementations MAY support a codec that is not in the list of core codecs
-(hereafter termed an "extension codec"), provided the extension codec does not use an identifier
+provided the extension codec does not use an identifier
 that is already used by a core codec; the identifiers of core codecs are reserved.
-
-This specification places no other constraints on extension codecs. It is possible, through discouraged,
+This specification places no other constraints on these codecs. It is possible, through discouraged,
 that separate developers may define distinct codecs that use the same identifier.
 To minimize the impact of such name collisions, codec developers are strongly encouraged
-to publish their codec specifications as additions to the "extension codecs" section of Zarr v3 specification.
-Publication in the "extension codecs" section does not confer primacy or an official designation to a codec.
-The list of extension codecs exists expressly as a tool to enable coordinated codec development.
+to register their codec names in the XXX section of Zarr v3 specification.
+Listing a codec in this way exists expressly as a tool to enable coordinated codec development
+and does not confer primacy or an official designation to a codec.

 Stores
 ======

https://github.com/d-v-b/zarr-specs/compare/fix/list-of-codecs...joshmoore:zarr-specs:fix/list-of-codecs?expand=1

I note that I was also not sure where these codecs should be listed and so left a XXX. One thing that occurs to me is how to prevent future collisions:

  • Do we try to manage incoming names?
  • Do we state that future core codecs may steal back the names?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 11, 2024

thanks for the suggestions. I think the name "other codecs" is a step down from "extension codecs", but I'm happy to listen to what other people think, in particular those who preferred "extension codecs" over "community codecs" (@TomAugspurger @rabernat).

I note that I was also not sure where these codecs should be listed and so left a XXX.

I think whenever we decide on a concrete location for the non-core codecs we can use a hyperlink to point to that location. So IMO it's fine if the text says "the section for the spec".

One thing that occurs to me is how to prevent future collisions:

Do we try to manage incoming names?
Do we state that future core codecs may steal back the names?

It would be helpful if you could explain in more detail what collisions you are referring to. Name collisions between core codecs and non-core codecs are impossible because the names of core codecs are reserved. Name collisions between non-core codecs are totally possible but discouraged (as per common sense).

Do we try to manage incoming names?

Beyond excluding names that violate github TOS, managing incoming names seems too open-ended. if we attempt to curate the list of non-core codecs in such a way that we try to avoid name collisions between non-core codecs, then we encourage name squatting, e.g. where one implementation of jpeg compression with 1 user reserves the name jpeg before a jpeg codec with 1000 users. This repo does not have the bandwidth to adjudicate these things at this time.

Do we state that future core codecs may steal back the names?

I am guessing here that you are describing a scenario in which a non-core codec is made a core codec, and this results in the name of that codec becoming exclusive, which thereby invalidates any other non-core codecs that were currently using that name. I see a few options:

  • simply allow this outcome
  • prevent this outcome by making the list of core codecs immutable
  • only accept a new core codec if accepting it would not create name collisions
  • avoid name collisions entirely by identifying codecs with UUIDs instead of human-readable names

But before we get into these issues, I want to ask: does this PR prevent us from solving these problems in a follow-up PR?. If so, what changes should we make to this PR so that we can deal with these more complex issues in a follow-up PR? If not, can we have this discussion in a follow-up PR? Because the name collision stuff is entirely hypothetical at this point. For the problems we have today, I think the content of this PR is necessary and sufficient.

@joshmoore
Copy link
Member

I think the name "other codecs" is a step down from "extension codecs"...

Agreed, but I was trying to leave us open to introduce either or both in the follow up.

So IMO it's fine if the text says "the section for the spec".

Just that we might prepare that location in this PR if we're saying to use it.

It would be helpful if you could explain in more detail what collisions you are referring to.

One collision scenario:

  • the wider codec community develops a great new codec (e.g. blosc3)
  • someone is motivated to use it ASAP and registers "blosc3" on the list.
  • core devs decide that they would like to use the well-known name "blosc3"

This repo does not have the bandwidth to adjudicate these things at this time.

Trying to do that would worry me, too. (This is likely the reason for the URI rule in the previous version.)

I am guessing here that you are describing a scenario in which a non-core codec is made a core codec, and this results in the name of that codec becoming exclusive, which thereby invalidates any other non-core codecs that were currently using that name

Yes exactly, thanks.

I see a few options:

💯

  • simply allow this outcome

This could invalidate data, no?

  • prevent this outcome by making the list of core codecs immutable

i.e. pinned to a major version. Certainly lowers the benefit of the list, and might put the SHOULD in question.

  • only accept a new core codec if accepting it would not create name collisions

This would work for the list, but there can still be codecs not in the list (back to data invalidation).

  • avoid name collisions entirely by identifying codecs with UUIDs instead of human-readable names

Ha! I imagine there will be reasons not to, but I like this as a thought experiment.

does this PR prevent us from solving these problems in a follow-up PR?

I think this diff introduces the issue

-Each codec must be defined via a separate
-specification. In order to refer to codecs in array metadata
-documents, each codec must have a unique identifier, which is a URI
-that dereferences to a human-readable specification of the codec.

If so, what changes should we make to this PR so that we can deal with these more complex issues in a follow-up PR? If not, can we have this discussion in a follow-up PR?

If you want to roll back the URI change, then we can deal with that later as well. Another option would be to go to an integration branch and we can collaborate on the overall fix.

Because the name collision stuff is entirely hypothetical at this point. For the problems we have today, I think the content of this PR is necessary and sufficient.

I sincerely and wholeheartedly disagree and I will continue to say this widely. Spec development's impacts are too great to make changes to the main version of this document without considering those hypotheticals. It is exactly the job of zarr-spec core-devs.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 11, 2024

-Each codec must be defined via a separate
-specification. In order to refer to codecs in array metadata
-documents, each codec must have a unique identifier, which is a URI
-that dereferences to a human-readable specification of the codec.

I removed the codec URI statement because that text conflicts with the actual definitions of the codecs we have today, and the actual practice of codec development.

  • we are not using the codec URIs in array metadata documents; therefore those URIs cannot aid in referring to codecs in said documents.
  • it is not practical to check that a URI resolves to a human-readable document, especially for codecs that people are actively developing. Read strictly, this requirement means that a codec cannot start with an implementation, which means that this requirement is worthless for real codec development.

I am happy to consider changes to the use of URIs in the codec definition, but they have to be consistent with how we are doing things today, unless you are interested in breaking changes to all of our codecs.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 11, 2024

I think the name "other codecs" is a step down from "extension codecs"...

Agreed, but I was trying to leave us open to introduce either or both in the follow up.

I'm confused -- if you agree that "other codecs" is a worse name than "extension codecs", then why propose it here?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 11, 2024

Because the name collision stuff is entirely hypothetical at this point. For the problems we have today, I think the content of this PR is necessary and sufficient.

I sincerely and wholeheartedly disagree and I will continue to say this widely. Spec development's impacts are too great to make changes to the main version of this document without considering those hypotheticals. It is exactly the job of zarr-spec core-devs.

This quote omits a crucial part of my statement, which was that we should address the hypothetical issues in a separate PR. I will not say anything more about this.

@joshmoore
Copy link
Member

they have to be consistent with how we are doing things today

I agree, but also with the spec. That's why there's no easy answer.

I'm confused -- if you agree that "other codecs" is a worse name than "extension codecs", then why propose it here?

It was an attempt to keep the names unused for following changes.

This quote omits a crucial part of my statement, which was that we should address the hypothetical issues in a separate PR.

Except that every commit to this repository is for all intents and purposes a release, so (again a hypothetical) someone could read the spec and start reading in that interim period.

@rabernat
Copy link
Contributor

rabernat commented Oct 11, 2024

Maybe it's best if we focus on a real example, because that's what this PR is trying to solve: we are actively porting zarr-python v2 codecs into zarr-python v3. To avoid fragmentation of the zarr ecosystem, we should put the specifications for those codecs somewhere outside of zarr-python to ensure that non-python implementations can support them. This PR is an attempt to solve that problem.

This point of view really resonates with me. As we get closer to the Zarr Python 3 release, it's apparent to me that we currently have a choice between either:

  • Doing things that are out of spec for V3 but needed to support actual users and achieve feature parity with V2 (e.g. consolidated metadata, vlen-utf8 codec, etc.)
  • Evolving the spec somehow to formalize these features

This is forcing us to come face-to-face with problems and inconsistencies in the V3 spec that have been present for a long time. IMO, the root of these problems is the flawed notion that we could somehow write a good spec prior to actually attempting to implement it. But now that we are implementing it, the spec is already frozen and unable to accommodate changes needed to enable these features.

Codecs are the prime example: where do the they belong? How do we add new ones? Does this require ZEPs? The spec is inconsistent and incomplete on this topic, and I'm grateful to @d-v-b for trying to find solutions.


If we accept that the V3 spec is basically immutable, here's the alternative: rather than trying to list the codecs in the spec, we return to the original spirit of the V3 spec, which is to liberally enable extensibility. On the topic of codecs, it says:

To allow for flexibility to define and implement new codecs, this specification does not define any codecs, nor restrict the set of codecs that may be used. Each codec must be defined via a separate specification.

To me, this pairs extremely well with the idea of community-developed extensions. All codecs (except bytes I guess) will be defined via extensions, and those extensions can and should be maintained and evolved separately from the spec and the ZEP process. I think @TomAugspurger's suggestions in #316 are great.

The way we would move forward with this route would be as follows:

  1. Make a ZEP (and possibly clarification update to the spec) to explain (in more precise detail than currently exists) how to use the extension mechanism and the process for creating and sharing extensions.
  2. Define codecs via appropriate extensions, following the process described in 1
  3. Adapt implementations to parse and understand the presence of extensions in Zarr data

This seems especially appropriate since we clearly can't find consensus about which codecs belong in a list of "core" codecs. We need to find a way to move forward with development without such a consensus.


Until something like that happens, Zarr Python will have to continue with the process of implementing things that are out of spec in order to reach feature parity with V2. (Perhaps we could issue warnings in these cases?)

The hope is that we can retroactively adopt the extension approach to bring these features in line with some kind of spec.

How does this sound to people?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 12, 2024

I am going to close this PR. After #293 I believed that there was broad consensus on two points:

  • the current language of the spec on codecs is incoherent
  • we can make the spec coherent in a non-breaking way with a few clarifying changes to the spec.

Beyond the ongoing harm caused by a specification document that is half-baked, there is also an immediate need for guidance as we develop new codecs for zarr v3 (a process that is encouraged by language in the spec).

Thus I attempted to implement the conclusions of #293 in this PR, and with a few exceptions this PR received broad support. I took great care to ensure that my changes were as minimal as possible, and I adapted the text to helpful feedback. With one notable exception, these changes were approved by the members of this discussion. However, it has since become clear that there is no clear path forward for this PR, and perhaps for any PR against the spec. I don't think it's worth the effort to continue here.

Except that every commit to this repository is for all intents and purposes a release, so (again a hypothetical) someone could read the spec and start reading in that interim period.

@joshmoore if this is something you believe, then I think you should feel great urgency to fix the incoherent parts of the spec. The spec has been broken for over a year. That is the cause of much more harm than anything in this PR.

@d-v-b d-v-b closed this Oct 12, 2024
@rabernat
Copy link
Contributor

Davis, I'm sorry for the frustration this conversation has caused. I really appreciate your efforts, I'm committed to working with you to find our way out of this situation so we can keep development moving forward.

@joshmoore
Copy link
Member

then I think you should feel great urgency to fix the incoherent parts of the spec. The spec has been broken for over a year. That is the cause of much more harm than anything in this PR.

I appreciate and agree with your analysis and I like framing it as a call for urgency. For what it's worth I did invest substantial time in this PR because I support what it's trying to do. I found myself, however, running into root causes which I then tried to resolve, recursively.

  • First our common understanding of what should be required (MUST vs SHOULD)
  • and what our definition of an extension is.
  • Followed by how we will transition extensions to core (incl. namespacing)
  • and even how we version the spec.
  • And underlying it all, how we go about making these decisions.

I hope you can appreciate that I was also frustrated at trying to unravel a year-long ball of yarn with the pressure of this concrete change. That's not your fault, but I was also only realizing it in real-time, surprised to find the state of things.

My urgency, then will be directed at those root causes, largely in reverse order, and I'd encourage those who are still motivated to tackle those issues so we can more efficiently tackle the many remaining issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants