Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: image merge/"rechunking" #5717

Open
cgwalters opened this issue Sep 3, 2024 · 6 comments
Open

feature request: image merge/"rechunking" #5717

cgwalters opened this issue Sep 3, 2024 · 6 comments

Comments

@cgwalters
Copy link

See ostreedev/ostree-rs-ext#69 and specifically this blog post I found very inspirational: https://grahamc.com/blog/nix-and-layered-docker-images

A lot more work on this happened in https://github.com/hhd-dev/rechunk/

This issue is about adding generic support for something like this to buildah. What might that look like? Looking at rechunk (which is building on what rpm-ostree is doing today) is that it's currently got an RPM dependency, which gets messy for buildah integration.

Here's a strawman: we create buildah rechunk (again, name totally subject to bikeshedding)...maybe it's "merge"?

This could start by accepting an existing OCI image as input, and trying to do some heuristics on it (splitting large files into their own layers, etc.)

Alternatively, a lower level entrypoint may be accepting a large list of OCI images and "merging" them using many of the same approaches that the Nix builder does, to map them to some configurable number of higher level layers (ostree-rs-ext today caps at 64...whole other thread to discuss going higher than that).

@cgwalters
Copy link
Author

Note that this tool would also need to paper over #5592 (which would still be great to fix).

@antheas
Copy link

antheas commented Sep 3, 2024

ostree-rs-ext today caps at 64...

Bazzite uses 70 layers

70 layers is a sound maximum for a large OS image, although for OCI images you'd probably rather stay under 40 (until composefs solves the layering issue).

Seeing the popularity sched_ext has in the linux community, where everyone can make their own scheduler now, I think the logical first step would be to embed the basic functionality for doing this into buildah and once there is a good algorithm doing the next logical step, which is buildah rechunk.

For me, this boils down to the following right now:

  1. Allow reflinking existing files from container storage into a new container
    • This saves a copy which is very important on the images rechunking is important for (e.g., very large)
  2. Allow defining the output tarstream for a layer while doing that

Given my familiarity with OSTree, I know that it can stage a commit in around 20 seconds with hard links.

If 1 is implemented, this means that the rechunk process would take around 50 seconds for an image with a very large number of files, which is negligible, and result in very minimal thrashing. This is around 12x faster than the current ostree-rs-ext process, given the image is placed in containers storage again.

2 would then be the natural extension, allowing reordering the tar stream based on the previous manifest which is needed for zstd:chunked.

Even with zstd:chunked, layer invalidation remains important as every changed layer needs to be staged and stored in the registry. Composefs can partly deal with the former but not the latter.

@antheas
Copy link

antheas commented Sep 3, 2024

Sidenote: on fedora based containers the image has to be squashed before pushing to the registry anyway, since every dnf command updates the rpm database, adding 40mb-150mb to every layer, so I do not know how important preserving the layer structure is.

While this is another discussion, perhaps it is worth discussing if it is worth "bricking" the rpm database in in containers, forcing dnf to write into WAL, which may be much smaller.

@cgwalters
Copy link
Author

While this is another discussion, perhaps it is worth discussing if it is worth "bricking" the rpm database in in containers, forcing dnf to write into WAL, which may be much smaller.

On the general topic of the rpm database and containers,

@cgwalters
Copy link
Author

In the very short term I think what makes the most sense is for us to just carry forward with "rechunking" work on the ostree-container side.

Generalizing it - what it would look like with buildah/podman in general is a big topic, and it really does snowball quickly into the general "reproducible builds" problem which in turn quickly snowballs into Intelligent Build System territory, which is not what podman or buildah really expose today; it looks more like things listed off of https://github.com/moby/buildkit?tab=readme-ov-file#used-by

So probably we can just keep this on the back burner unless there is an interested and motivated person to contribute here.

@rhatdan
Copy link
Member

rhatdan commented Sep 12, 2024

@nalind has started working on designing rechunking into buildah. And has some preliminary code for it. Thoughts on actual how you would rechunk is still being discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants