pkg/compression: new zstd variant zstd:chunked #1084

giuseppe · 2020-11-19T12:10:08Z

add a new custom variant of the zstd compression that permits to retrieve each file separately.

The idea is based on CRFS and its stargz format for having seekable and indexable tarballs.

One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers.

The zstd compression supports embedding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format.

Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further.

The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then.

Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant.

Since access to host files is required for the deduplication, the untar doesn't happen in a separate process running in a chroot, but it happens in original mount namespace. To mitigate possible breakages, openat2(2) is used were available.

Requires: containers/storage#775

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

giuseppe · 2020-11-25T09:56:51Z

@cyphar what do you think about this idea? :-)

I've implemented (still hacky) the deduplication also with host files, e.g. pulling fedora:33 on a Fedora 33 host gives this result:

$ skopeo copy docker://docker.io/gscrivano/zstd-chunked:fedora33 containers-storage:fedora:33
Getting image source signatures
Copying blob 39e269c6c24a [========++++++++++++++++++++++++++++++] 68.4MiB / 68.4MiB (deduplicated: 54.0MiB = 78.87%)
Copying config 7985d34067 done  
Writing manifest to image destination
Storing signatures

mtrmac

Just an extremely partial skim, I have read < 10% of the code, just to get a basic idea of the broad structure.

At a first uninformed glance, I’d prefer all of the filesystem knowledge to be in c/storage.
Should this wait until the c/image/storage parallel pull code (not necessarily the more complex “concurrent pull detection” part) is revived?
Is there a relationship to Support layer deltas #902 ? Is this strictly more powerful, or are the two independent optimizations, each with its own advantages, that can live alongside each other?

copy/copy.go

mtrmac · 2020-11-25T21:24:10Z

docker/errors.go

@@ -17,6 +17,14 @@ var (
 	ErrTooManyRequests = errors.New("too many requests to registry")
 )

+// ErrBadRequest is returned when the status code returned is 400
+type ErrBadRequest struct {


Is there any user for this? If this is supposed to be a part of the GetBlobAt interface, the caller is probably not going to be checking for a docker-transport-specific error.

storage/chunked_zstd.go is using it

Oops, GitHub is auto-collapsing files…

I don’t think it makes sense to have a generic interface with a specific error condition like that; compare types.ManifestTypeRejectedError, as an explicitly documented error case in the PutManifest method.

(Alternatively, it would be consistent to just say “this is totally docker+c/storage-specific, there will be no generic interfaces, and this is a DockerGetBlobAt, and so on. That’s probably not the best design, I expect dir: could easily implement GetBlobAt, for example.)

storage/chunked_zstd.go

mtrmac · 2020-11-25T21:34:11Z

types/types.go

+}
+
+// ImageDestinationPartial is a service to store a blob by requesting the missing chunks to a ImageSourceSeekable.
+type ImageDestinationPartial interface {


We should very seriously discuss whether we want to keep any future additions to the transports a public API, or to just close it down as internal.

I do realize that there are other implementations, mainly various magic wrappers in Buildah; OTOH the need to keep these things stable has been a very notable source of overhead.

I agree and I thought about it as well. I'd prefer to keep it private especially for now and have the possibility to iterate on it

Have you made it private?

it is used by the storage package, so I cannot make it private.

I've marked all the new interface as experimental though, so we don't need to incement the major version if we make any breaking change.

(Random drive-by comment, not part of a full review)

it is used by the storage package, so I cannot make it private.

So there’s a circular dependency? Historically that’s been rather problematic. (For starters, how is either of the two PRs going to actually pass CI before merging?)

I've marked all the new interface as experimental though, so we don't need to incement the major version if we make any breaking change.

(I’m skeptical that comments matter that much. Go is going to upgrade and break things without notice. It is possible to copy&paste code without noticing documentation. Yet, we do need some way to experiment… A subpackage with a hard-to-overlook name (c/image/types/unstable?) would be a bit more explicit.)

I think there should be:

A type defined somewhere in c/storage , so that there isn’t a loop, and so that c/storage itself can at compile-time enforce interface conformance. (There already is one)

A completely private type in c/image , in the already-existing c/image/internal/types , that no external callers have business touching. We want to, over time, regain the ability to change the transports interfaces, so let’s not expose any more unless we have to. (We might need an explicit adapter between the two interfaces if they use different names/types; that’s perfectly fine.)

A possible major difficulty with this is Buildah’s (and Podman’s??) alternative/wrapping transport implementations , notably the blob cache; the above would prevent Buildah’s blob cache from using the chunked variant. That’s probably? fine, the thoughts about making transports private have always implied moving the blob cache inside c/image so that it does not break on interface changes, and can benefit from private interfaces. OTOH that blob cache move might then have to happen before this PR.

it is used by the storage package, so I cannot make it private.

So there’s a circular dependency? Historically that’s been rather problematic. (For starters, how is either of the two PRs going to actually pass CI before merging?)

no, with storage package I meant c/image/storage.

I try to move these definitions under internal/types

mtrmac · 2020-11-25T21:38:08Z

copy/copy.go

 // copyLayer copies a layer with srcInfo (with known Digest and Annotations and possibly known Size) in src to dest, perhaps compressing it if canCompress,
 // and returns a complete blobInfo of the copied layer, and a value for LayerDiffIDs if diffIDIsNeeded
 func (ic *imageCopier) copyLayer(ctx context.Context, srcInfo types.BlobInfo, toEncrypt bool, pool *mpb.Progress) (types.BlobInfo, digest.Digest, error) {
 	cachedDiffID := ic.c.blobInfoCache.UncompressedDigest(srcInfo.Digest) // May be ""
 	// Diffs are needed if we are encrypting an image or trying to decrypt an image
 	diffIDIsNeeded := ic.diffIDsAreNeeded && cachedDiffID == "" || toEncrypt || (isOciEncrypted(srcInfo.MediaType) && ic.c.ociDecryptConfig != nil)

+	srcAlgo := ""
+	if srcInfo.CompressionAlgorithm != nil {


srcInfo.Compression* is supposed to never be set at this point; they really don’t belong to BlobInfo at all, it’s an output of copyBlobFromStream to be consumed by UpdatedImage.

copy/copy.go

mtrmac · 2020-11-25T21:43:58Z

storage/storage_image.go

+		ctx:     ctx,
+	}
+
+	data, err := compression.ReadChunkedZstdManifest(&readerAt, srcInfo.Size)


The PutBlobPartial interface looks generic. Just blindly hard-coding Zstd here seems surprising. (Maybe all it takes is a MIME type check.)

mtrmac · 2020-11-25T21:45:15Z

storage/chunked_zstd.go

+	manifest       []byte
+	ctx            context.Context
+	srcInfo        types.BlobInfo
+	layersMetadata map[string][]compression.FileMetadata


All of the fields, and especially the maps/arrays, will eventually need documentation.

mtrmac · 2020-11-25T21:49:53Z

storage/storage_image.go

+			// FIXME: what to do with the uncompressed digest?
+			diffOutput.UncompressedDigest = blob.Digest
+
+			if err := s.imageRef.transport.store.ApplyDiffFromStagingDirectory(layer.ID, diffOutput.Target, diffOutput, nil); err != nil {


Is the two-stage process really required for this?

We’d already like for the image pull process to happen concurrently for layers as they arrive (pull in any order, and create layers in order as soon as possible, instead of currently just staging everything and then doing a lot of work in Commit). If that’s the more natural approach for this feature, it would probably make sense to return to that work and get it working, and have a simpler implementation and a more straightforward c/storage API.

mtrmac · 2020-11-25T21:59:54Z

Oh, and: how does digest verification happen with this, or what replaces it to obtain equivalent security?

giuseppe · 2020-11-26T08:43:48Z

Is there a relationship to Support layer deltas #902 ? Is this strictly more powerful, or are the two independent optimizations, each with its own advantages, that can live alongside each other?

they are independent. #902 requires deltas to be generated and they are useful when e.g. pulling an updated image. This PR instead is to improve pulling of a generic layer and doing deduplication with all the files in the storage.

Another advantage of this PR is that it performs local deduplication as well. If a file is present locally, it is deduplicated with reflinks when the underlying file system supports them.

The two stages "prepare staging directory" & "commit from staging directory" breaks the c/storage abstraction that everything is a tarball and can be handled transparently by different drivers (in such case the PutBlob() logic is still fine), but I think it is an improvement when using overlay, the "Commit()" is basically a os.Rename() and can be easily done in parallel.

giuseppe · 2020-11-26T08:57:28Z

Should this wait until the c/image/storage parallel pull code (not necessarily the more complex “concurrent pull detection” part) is revived?

I think will simplify the "parallel pull" as each "commit to staging directory" can be done separately without any interference among the different layers. The Commit() can still be left sequential, but being a very cheap operation, that won't be a problem. @vrothberg What do you think?

giuseppe · 2020-11-26T09:02:31Z

Oh, and: how does digest verification happen with this, or what replaces it to obtain equivalent security?

this part is still not solved and where I'll need your help :-) Do you think it is reasonable to include signatures as part of the metadata (it can be a separate chunk)? Only the manifest must be signed.

Since the manifest has the checksum for each file in the layer, we can validate each file individually.

vrothberg · 2020-11-26T09:06:46Z

I think will simplify the "parallel pull" as each "commit to staging directory" can be done separately without any interference among the different layers. The Commit() can still be left sequential, but being a very cheap operation, that won't be a problem. @vrothberg What do you think?

I only briefly skimmed this PR and do not yet understand how the two PRs relate to another. In case an optimized commit would facilitate this work, let me know. We can prioritize it. If others want to tackle it earlier, I am more than happy :)

At the moment, I channel all my mental resources on getting the podman-cp rewrite done but I hope to have it finished by Monday. Thanksgiving buys me uninterrupted time in the EU :^)

mtrmac · 2020-11-26T15:48:57Z

Another advantage of this PR is that it performs local deduplication as well. If a file is present locally, it is deduplicated with reflinks when the underlying file system supports them.

BTW if this is letting the creator of the image cause accesses to files on the host filesystem, this worries me a lot. It very likely allows a network observer to determine whether files exists at attacker-directed paths on the host filesystem, and in the extreme (but probably fixable) cases it risks hardware operations (IIRC some files in /dev cause hardware to do things (rewind tapes?) just on open(2)).

mtrmac · 2020-11-26T15:52:22Z

Oh, and: how does digest verification happen with this, or what replaces it to obtain equivalent security?

this part is still not solved and where I'll need your help :-) Do you think it is reasonable to include signatures as part of the metadata (it can be a separate chunk)? Only the manifest must be signed.

Since the manifest has the checksum for each file in the layer, we can validate each file individually.

There needs to be some kind of “chain of authentication” going from the manifest (which is validated against a signature or user-specified digest) all the way to individual on-disk files. I really haven’t read the code; maybe that chain of authentication could be an annotation on the layer in the manifest (which would restrict this feature to manifest schemas that have annotations, but this is probably already OCI-only), with that annotation containing some kind of hash of the “top-level” metadata item in the layer (is there even such a thing?), which further authenticates the individual files / layer chunks, or some groups of individual files / layer chunks.

vrothberg · 2021-07-01T13:09:25Z

I'd imagine it to look similar to that:

Getting image source signatures                                                                     
Copying blob df748f8d6e55 done (skipped: 219.6KiB = 0.26%)  
Copying config 7273426b03 done                                                                  
Writing manifest to image destination

giuseppe · 2021-07-01T16:51:27Z

fixed in the last version

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

add a new custom variant of the zstd compression that permits to retrieve each file separately. The idea is based on CRFS and its stargz format for having seekable and indexable tarballs. One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers. The zstd compression supports embeddeding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format. Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further. The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then. Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2021-07-02T12:37:23Z

rebased again. @vrothberg are you ok with the last version of the progress bar?

vrothberg · 2021-07-02T12:51:17Z

@giuseppe, I am cool to merge as is (see below), thanks! The race is something I wanted to tackle for a long time but never found/took the time until now.

skopeo (main) $ ./bin/skopeo copy docker://docker.io/gscrivano/zstd-chunked:fedora3 containers-storage:fedora-chunke
d
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: opaque flag erroneously copied up, co
nsider update to kernel 4.8 or later to fix 
Getting image source signatures
Copying blob 8f75b7c47cae [======================================] 82.0MiB / 82.2MiB (skipped: 0.0b = 0.00%)
Copying config 0a485308c4 done  
Writing manifest to image destination
Storing signatures
skopeo (main) $ ./bin/skopeo copy docker://docker.io/gscrivano/zstd-chunked:fedora3 containers-storage:fedora-chunked
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: opaque flag erroneously copied up, co
nsider update to kernel 4.8 or later to fix 
Getting image source signatures
Copying blob 8f75b7c47cae [--------------------------------------] 0.0b / 0.0b
Copying config 0a485308c4 done  
Writing manifest to image destination
Storing signatures

giuseppe · 2021-07-03T19:16:54Z

So anything more left before we can finally merge this? :)

giuseppe · 2021-07-03T19:17:21Z

@rhatdan are you fine with it?

rhatdan · 2021-07-07T10:53:45Z

LGTM

mtrmac · 2021-08-20T15:47:37Z

storage/storage_image.go

+			return err
+		}
+
+		// FIXME: what to do with the uncompressed digest?


@giuseppe Why is this (and the related s.blobDiffIDs[blobDigest] = blobDigest assignment), using the compressed values as the supposedly-unique uncompressed ones, correct and safe?

This was referenced Nov 19, 2020

Enable zstd:chunked support in containers/image containers/storage#775

Merged

[TEST] skopeo: test zstd:chunked containers/skopeo#1111

Closed

giuseppe force-pushed the zstd-chunked branch 10 times, most recently from 304cb1c to 7ccf65a Compare November 25, 2020 09:47

giuseppe force-pushed the zstd-chunked branch 3 times, most recently from 13435c2 to 9d39fae Compare November 25, 2020 11:17

mtrmac reviewed Nov 25, 2020

View reviewed changes

giuseppe force-pushed the zstd-chunked branch from 9d39fae to d77e0fe Compare November 26, 2020 08:10

giuseppe force-pushed the zstd-chunked branch 2 times, most recently from eb4d32b to b7bcbf6 Compare November 26, 2020 15:31

giuseppe force-pushed the zstd-chunked branch 3 times, most recently from 393baa8 to 2efaee5 Compare November 26, 2020 18:09

giuseppe force-pushed the zstd-chunked branch from fa20cac to 7c100da Compare July 1, 2021 14:24

giuseppe added 12 commits July 2, 2021 14:36

compression: change return type for zstdWriterWithLevel

48f39f6

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

compression: support generating compressor metadata

b167b4b

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

compression: let algorithms register a MIME type

7421a48

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

types: add interface for partial blob retrieval

e8ce5b8

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

types: add interface for storing partial blob

05f70a3

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

docker: support partial blob retrieval

93477e6

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

storage: support partial storage with zstd:chunked

67a4c99

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

copy: do not ignore errors on Close

3250f2d

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

copy: use partial blob retrieval if possible

138ab62

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

copy: provide progress bar for partial blobs

ee5014c

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

copy: make using partial blobs configurable

be50ba7

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the zstd-chunked branch from 7c100da to be50ba7 Compare July 2, 2021 12:37

rhatdan merged commit eeb9819 into containers:main Jul 7, 2021

mtrmac mentioned this pull request Jul 9, 2021

Make compressionGoroutine panic-safe again #1297

Merged

mtrmac reviewed Aug 20, 2021

View reviewed changes

cgwalters mentioned this pull request Oct 14, 2021

container: support splitting inputs ostreedev/ostree-rs-ext#69

Closed

cgwalters mentioned this pull request Dec 6, 2021

Support layer deltas #902

Open

cgwalters mentioned this pull request Sep 14, 2022

container-encapsulate should be more intelligent coreos/rpm-ostree#4012

Open

ariel-miculas mentioned this pull request Feb 15, 2023

implement compression project-machine/puzzlefs#14

Closed

travier mentioned this pull request Jun 7, 2024

Push zstd::chunked compressed container images to another tag toolbx-images/images#128

Open

jonjohnsonjr mentioned this pull request Jul 23, 2024

compress/zstd: add new package golang/go#62513

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/compression: new zstd variant zstd:chunked #1084

pkg/compression: new zstd variant zstd:chunked #1084

giuseppe commented Nov 19, 2020 •

edited

Loading

giuseppe commented Nov 25, 2020

mtrmac left a comment

mtrmac Nov 25, 2020

giuseppe Nov 26, 2020

mtrmac Nov 26, 2020

mtrmac Nov 25, 2020 •

edited

Loading

giuseppe Nov 26, 2020

rhatdan May 27, 2021

giuseppe May 31, 2021

mtrmac May 31, 2021

giuseppe May 31, 2021

mtrmac Nov 25, 2020

mtrmac Nov 25, 2020

mtrmac Nov 25, 2020

mtrmac Nov 25, 2020

mtrmac commented Nov 25, 2020

giuseppe commented Nov 26, 2020

giuseppe commented Nov 26, 2020

giuseppe commented Nov 26, 2020 •

edited

Loading

vrothberg commented Nov 26, 2020

mtrmac commented Nov 26, 2020

mtrmac commented Nov 26, 2020

vrothberg commented Jul 1, 2021

giuseppe commented Jul 1, 2021

giuseppe commented Jul 2, 2021

vrothberg commented Jul 2, 2021

giuseppe commented Jul 3, 2021

giuseppe commented Jul 3, 2021

rhatdan commented Jul 7, 2021

mtrmac Aug 20, 2021

pkg/compression: new zstd variant zstd:chunked #1084

pkg/compression: new zstd variant zstd:chunked #1084

Conversation

giuseppe commented Nov 19, 2020 • edited Loading

giuseppe commented Nov 25, 2020

mtrmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrmac Nov 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrmac commented Nov 25, 2020

giuseppe commented Nov 26, 2020

giuseppe commented Nov 26, 2020

giuseppe commented Nov 26, 2020 • edited Loading

vrothberg commented Nov 26, 2020

mtrmac commented Nov 26, 2020

mtrmac commented Nov 26, 2020

vrothberg commented Jul 1, 2021

giuseppe commented Jul 1, 2021

giuseppe commented Jul 2, 2021

vrothberg commented Jul 2, 2021

giuseppe commented Jul 3, 2021

giuseppe commented Jul 3, 2021

rhatdan commented Jul 7, 2021

Choose a reason for hiding this comment

giuseppe commented Nov 19, 2020 •

edited

Loading

mtrmac Nov 25, 2020 •

edited

Loading

giuseppe commented Nov 26, 2020 •

edited

Loading