-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266
Conversation
related apache/datafusion-ballista#455 |
Codecov Report
@@ Coverage Diff @@
## master apache/arrow-datafusion#266 +/- ##
=======================================
Coverage 76.21% 76.21%
=======================================
Files 140 140
Lines 23553 23553
=======================================
Hits 17950 17950
Misses 5603 5603 Continue to review full report at Codecov.
|
I'm not sure whether there are any expectations around CLI performance in Docker, but it's probably worth seeing whether @andygrove's 2020 article on Rust/Alpine perf issues still holds true: https://andygrove.io/2020/05/why-musl-extremely-slow |
4736c9d
to
c133df0
Compare
i guess dropping alpine part of the change is safer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, excellent savings especially if we're looking at publishing this to Docker Hub soonish :)
And then tried to run the image:
I got an error:
Perhaps we need to update the entry point to be |
@alamb I guess you would need to volume map the /data dir. the error message wasn't specific enough |
Indeed @jimexist you are correct:
works just fine |
minor suggestion not related to this change. i recommend using github container registry instead of dockerhub to host public docker images. dockerhub's rate limit makes it impossible to use the image in any serious production or ci/cd pipeline. |
I agree with @houqp 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Note that this makes it more difficult to use caching in docker builds on CI. I think that we can do that on a follow-up PR
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example
Closes apache/arrow-datafusion#123
indicates that this PR will close issue #123.Closes #267.
Old image size
2.16GB
, new image size is89.9MB
Rationale for this change
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
We should optimize for the docker image size of
datafusion-cli
. Using a multi-stage build would contribute to a cleaner build with minimal intermediate build artifacts.What changes are included in this PR?
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please add the
breaking change
label.