Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonicalization: Add recommendations for canonicalization #339

Merged
merged 1 commit into from
Oct 6, 2016

Conversation

wking
Copy link
Contributor

@wking wking commented Sep 23, 2016

Spun off from #259, now that we seem to have reached a consensus for “SHOULD canonical JSON” there. I've set this up so we have space to add canonicalization recommendations for other formats, although the only other basic type discussed in this repository is a gzipped tarball and that's more than I want to bite off at the moment ;).

@wking wking mentioned this pull request Sep 23, 2016
@wking
Copy link
Contributor Author

wking commented Sep 24, 2016

On Sat, Sep 24, 2016 at 10:04:53AM -0700, Jonathan Boulle wrote:

@@ -0,0 +1,19 @@
+# Canonicalization

This feels a bit weird lacking context - could we not wrap it into
the document talking about the json types?

Canonicalization is a way to avoid redundant blobs for a particular
semantic payload. That idea applies to all media types, it's just
that we currently only have canonicalization advice for JSON. Once we
get canonicalization for gzipped tarballs (or any other OCI-related
types), we can add that here too. But as I said in the topic post
here, I'm not clear enough on what gzipped tarball canonicalization
would look like to add it yet ;).

@jonboulle
Copy link
Contributor

OK, that makes sense. How about moving this to a more generic "implementers guide"-type document (as a loose example of this see appc)

@wking
Copy link
Contributor Author

wking commented Sep 26, 2016

On Mon, Sep 26, 2016 at 07:10:31AM -0700, Jonathan Boulle wrote:

OK, that makes sense. How about moving this to a more generic
"implementers guide"-type document (as a loose example of this see
appc)

I'm ok with that, but I prefer a normative SHOULD to informative
“guidelines, tips and suggested best practices”. I expect the
validation tooling to (optionally) warn about violated SHOULDs but not
about violated informative hints. Do you see other distinctions
between SHOULD and informative hints?

Copy link
Contributor

@jonboulle jonboulle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair point, I think I'm just not quite on board with the wording yet. Added some suggestions.

@@ -0,0 +1,19 @@
# Canonicalization

One benefit of content-addressable storage is easy deduplication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more intro context - something like "One of the goals of the OCI Image Specification is to leverage content-addressable storage (CAS), which provides benefits like easy deduplication"


## JSON

[JSON][] content SHOULD be serialized as [canonical JSON][canonical-json].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON content (for example, ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON content (for example all of the bla bla media types or bla bla manifests) SHOULD be serialized as


One benefit of content-addressable storage is easy deduplication.
Many images might depend on a particular [layer](layer.md), the there will only be one blob in the [store](image-layout.md).
However, that same semantic layer serialized slightly differently would have a different hash, and if both versions of the layer are referenced there will be two blobs with the same semantic content.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that same layer, if serialized in a different way, would have an entirely different hash, and hence if both versions of the layer, there will be two underlying blobs in the store with the same semantic comment.

@wking
Copy link
Contributor Author

wking commented Oct 4, 2016

On Tue, Oct 04, 2016 at 06:02:13AM -0700, Jonathan Boulle wrote:

+One benefit of content-addressable storage is easy deduplication.

Slightly more intro context - something like "One of the goals of
the OCI Image Specification is to leverage content-addressable
storage (CAS), which provides benefits like easy deduplication"

I don't think content-addressability is a goal in itself. It's just a
useful pattern to support for safe, efficient distribution. In
0b0c8e63eb6af7, I've rebased around #365 and added another
lead-in sentence pointing out that the current OCI spec does leverage
content-addressability. Let me know if you'd like further rewording
on this.

+## JSON
+
+[JSON][] content SHOULD be serialized as [canonical JSON][canonical-json].

JSON content (for example, ...)

This line is a normative SHOULD, and if the PR lands I will file
image-validation code that warns if a JSON blob is not canonical. I'd
rather not confuse it by adding an open-ended “for example”. Maybe
what you're looking for is a trailing line in the lead-in paragraph
like:

The following sections contain canonicalization recommendations for
OCI-related media types.

+However, that same semantic layer serialized slightly differently would have a different hash, and if both versions of the layer are referenced there will be two blobs with the same semantic content.

that same layer, if serialized in a different way, would have an
entirely different hash, and hence if both versions of the layer,
there will be two underlying blobs in the store with the same
semantic comment.

“hence if both versions of the layer” seems to be missing a word or
two. I've dropped my “However”, shifted the difference condition to
the front, and dropped “slightly” in 3eb6af7.

@jonboulle
Copy link
Contributor

@wking if someone does an inline review, please reply inline, otherwise it's impossible to follow

# Canonicalization

OCI Images [are](descriptor.md) [content-addressable](image-layout.md).
One benefit of content-addressable storage is easy deduplication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I stand corrected. I still think the current wording covers that though. Would you still prefer your initial suggestion?


OCI Images [are](descriptor.md) [content-addressable](image-layout.md).
One benefit of content-addressable storage is easy deduplication.
Many images might depend on a particular [layer](layer.md), the there will only be one blob in the [store](image-layout.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the there" needs fixing

@@ -47,3 +47,5 @@ The following figure shows how the above media types reference each other:

[Descriptors](descriptor.md) are used for all references.
The manifest list being a "fat manifest" references one or more image manifests per target platform. An image manifest references exactly one target configuration and possibly many layers.

[JSON]: http://json.org/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reference-style link label defining JSON. In the in-flight opencontainers/runtime-spec#584 I change a similar runtime-spec link to point at RFC 7159, and I can do that here too if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fine, I asked the wrong question :/. Let me try again: why is it here? (I don't see any JSON references in this document)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a separate file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, so it is. Thanks for banging my head into that until I understood :p.

@jonboulle
Copy link
Contributor

Looks okay overall after a few more fixes

@wking
Copy link
Contributor Author

wking commented Oct 5, 2016

On Wed, Oct 05, 2016 at 05:09:38AM -0700, Jonathan Boulle wrote:

@wking if someone does an inline review, please reply inline,
otherwise it's impossible to follow

I'll try, but GitHub no longer has a way to do that via email. When I
do reply by email, I'll be sure to quote enough context so my reply
makes sense.

@wking
Copy link
Contributor Author

wking commented Oct 5, 2016

I've pushed 3eb6af74729086 which:

  • Fixed “the there” → “but there” 1
  • Added a new sentence linking to media-types.md for our JSON
    types 2. This feels pretty obvious to me, and the SHOULD
    canonicalize also applies to all JSON content regardless of
    whether it's an OCI type, so I'm fine dropping the new
    sentence. Using a new sentence (instead of @jonboulle's
    floated parenthetical) avoids complicating the normative
    SHOULD.


* [Go][]: [github.com/docker/go][], which claims to implement [canonical JSON][canonical-json] except for Unicode normalization.

Of the [OCI Image Format Specification media types](media-types.md), all the types ending in `+json` contain JSON content.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this above the "Implementations" imo.

@wking
Copy link
Contributor Author

wking commented Oct 5, 2016

I've pushed 4729086b695fb3 shifting the media-types JSON link up
before the list of implementations 1.

Spun off from [1], now that we seem to have reached a consensus for
"SHOULD canonical JSON" there.  I've set this up so we have space to
add canonicalization recommendations for other formats, although the
only other basic type discussed in this repository is a gzipped
tarball and that's more than I want to bite off at the moment ;).

[1]: opencontainers#259
     Subject: manifest json fields order

Signed-off-by: W. Trevor King <wking@tremily.us>
@wking
Copy link
Contributor Author

wking commented Oct 5, 2016

I've pushed b695fb31685bae dropping the JSON link label from
media-types.md 1. No idea what I was thinking with that, but thanks
to @jonboulle for pointing it out to me (repeatedly) until I caught
on.

@jonboulle
Copy link
Contributor

jonboulle commented Oct 5, 2016

LGTM for now

Approved with PullApprove

@vbatts
Copy link
Member

vbatts commented Oct 6, 2016

while i'm not in disagreement here, i fell like this is describing something without describing anything.
Like is canonical serialization, the use of symlinks to deduplicate blobs? or something else?

@wking
Copy link
Contributor Author

wking commented Oct 6, 2016

On Thu, Oct 06, 2016 at 09:44:02AM -0700, Vincent Batts wrote:

while i'm not in disagreement here, i fell like this is describing
something without describing anything. Like is canonical
serialization, the use of symlinks to deduplicate blobs? or
something else?

How would symlinks deduplicate blobs? Maybe if you had a cluster of
image-layouts? If so, that's a CAS-engine optimization (reducing the
disk space used to store a set of blobs by leaning on other
image-layouts). This PR is about optimizing the CAS content
(reducing the number of blobs you need to store). I expect both are
useful, but advice about CAS-engine optimization should probably be
part of a separate issue/PR.

@vbatts
Copy link
Member

vbatts commented Oct 6, 2016

LGTM

Approved with PullApprove

@vbatts vbatts merged commit f94caad into opencontainers:master Oct 6, 2016
@stevvooe
Copy link
Contributor

I know we've already merged this, but how can Canonical JSON be considered a viable specification when the only reference is to a wiki page that could be changed at any time? I brought this up at the time when we started using canonical json in docker. Docker Distribution attempts to take this further, but canonical json is insufficient for this purpose.

I am really confused as to why we consider Canonical JSON a viable specification.

@wking
Copy link
Contributor Author

wking commented Oct 12, 2016

On Wed, Oct 12, 2016 at 02:21:25PM -0700, Stephen Day wrote:

I know we've already merged this, but how can Canonical JSON be
considered a viable specification when the only reference is to a
wiki page that could be changed at any time?

I agree that a more formal spec would be nice 1. If you have one,
I'm happy to link to it instead. But this PR landed a SHOULD. In the
event that the wiki mutates under us it isn't a disaster, we just
start warning about any incompatibilities.

And the last change to the spec itself was in 2008 2, so it's fairly
stable.

I brought this up at the time when we started using canonical json
in docker. Docker
Distribution

attempts to take this further, but canonical json is insufficient
for this purpose.

The goal is to reduce the number of blobs carrying the same semantic
content. You don't have to have a complete canonicalization to get
some reduction. In practice, I think the canonical JSON spec will be
sufficient to avoid most duplication. And 3 would be fine too (is
that spec versioned independently of docker/distribution as a whole?).
But either of those are better at reducing duplication than not
providing any canonicalization advice at all.

@wking wking deleted the should-canonical-json branch October 18, 2016 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants