-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YARN-10494 CLI tool for docker-to-squashfs conversion (pure Java). #2513
base: trunk
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
Fix findbugs issues. Fix checkstyle issues.
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Test failure appears to be unrelated. @ericbadger, would you be willing to do a review? I know it's a lot of code. |
I'd be happy to do a review. It will likely take awhile though. As you said, it's quite a bit of code |
I've been able to run the tool locally and it seems to work as designed. At least from my initial testing. However, I found that the tool runs quite a bit slower than the docker-to-squash python script. Notably, I ran both this tool and the docker-to-squash tool on a fairly large image (14.8 GB, 32 layers) and it took 38:13 to run on this tool while taking 21:50 to run on docker-to-squash. I'm currently trying to figure out where the differences are that make this tool take so much longer. My first thought is that this tool appears to download layers sequentially, while the docker-to-squash tool does them in parallel (since it uses docker pull). The next step is converting the layers, which is the internal implementation vs mksquashfs. It's possible that mkquashfs is just faster there. I'll need to do more analysis. And then the last step is the layer upload. I know that this tool is uploading both the sqsh image as well as the tgz file. So that's about double the work, which makes sense why it would take longer. Anyway, I'm still looking into the performance, but @insideo , feel free to post your insights. |
Here's some additional info. The times aren't exact for the docker-to-squash though. They would actually be smaller than the values listed, because those times include the time it takes to convert the layer to squashfs as well as the time to upload it to hdfs. I performed this test on an internal image, so I can't exactly show it to you or tell you what's in it other than that it's some ML stuff and it's huge. I'll try and find a similar image on dockerhub that I can use
|
@ericbadger I suspect the performance delta is due to the latest mksquashfs code being multi-threaded during encodes -- the larger images seem to have a higher deltas in your example. We could implement multi-threaded conversion in the Java code as well, but since the engine is designed to be mostly streaming, it would be a pretty big code change. Also, this would make reproducible builds considerably more difficult to ensure. What we could do is process multiple layers in parallel - this would likely close the gap since most real-world images have several layers which would need conversion, and since each individual layer would still be processed serially, reproducibility would be maintained. Thoughts? |
@insideo parallel layer comversion would certainly be helpful. I am sort of worried for some images though. Generally, docker images are made using fewer layers instead of many layers. And in the runc implementation, there's actually a limit of 37 layers because of how we name the mounts as well as the 4kb limit on the arguments to the mount command. So that gives opposing incentives. On one hand, you want more, smaller layers to decrease image conversion time. On the other hand, you want fewer, larger layers to adhere to the layer limit as well as to follow general convention surrounding docker images (e.g. RUN yum install && yum install && yum install && etc.) Especially anybody who starts building there images using Buildah or starts taking advantage of the new Docker image feature to define your own layer points. They would likely be inclined to make fewer layers instead of more layers. When you say the tool is streaming, what exactly do you mean? I asked you this before and I thought you said that it would start converting the layers as they came in instead of waiting for them to be fully downloaded. But looking at the log it seems like there is a download stage, a conversion stage, and then an upload stage and those stages are sequential Also, I just realized that I am using squashfs-tools 4.3, which doesn't have reproducible builds on. So it's a slightly fair comparison since 4.4 slows things down by removing some (all?) of the multithreaded-ness of mksquashfs. I will retest with squashfs-tools 4.4 with reproducible builds enabled |
The current implementation of the CLI tool is not stream-oriented, but the underlying squashfs code definitely is. The filesystem tree and content are built up dynamically as the tar.gz file is read. To do otherwise would require unpacking the tar.gz file into a temporary location, which was explicitly avoided in the design to minimize unnecessary I/O and avoid issues of UID/GID/timestamp changes in the process.
This would be an interesting comparison for sure. |
Ahh makes sense on the streaming. Thanks for the explanation I cloned https://github.com/plougher/squashfs-tools and ran make to compile mksquashfs. Then I ran my docker-to-squash script with the new mksquashfs. These are the results. The newest run is under docker-to-squash 4.4. It doesn't show all that much difference, except for the first layer for some reason
|
Hey @insideo , so I'm actively reviewing this but obviously it will be awhile before I get through the whole thing. I do have an initial ask though. When I enable debug logging my terminal gets bombarded with thousands (more?) of logs that look like the text below. It's probably just a single log call that is going berserk because of a big image over http. Could you look into that?
|
I think that's Apache HttpClient (could try disabling org.apache.http.wire debug logging). It shouldn't be coming from this code. |
Looks like you're right. I set org.apache.http.wire to INFO logging and things look way better |
Still haven't found an issue with the squashfs creation yet, but there is some inherent parsing that needs to happen to convert OCI images into squashfs filesystems that can be read correctly by overlayFS. It looks like whiteout files and opaque directories are not implemented in this PR. The issue is that OCI images handle whiteouts/opaque directories in an annoyingly different way than overlayFS does. OCI uses '.wh.' and '.wh..wh..opq' files while OverlayFS uses a character devices and directory extended attributes. OCI Standard:
OverlayFS:
There's some discussion here as to why the OCI spec didn't choose to use the overlayFS method of whiteouts. Mostly it appears to be inconsistent behavior in tar that may or may not support the extended attributes and that they didn't want to depend on. |
💔 -1 overall
This message was automatically generated. |
Initial WIP PR for YARN-10494.