Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266

Merged
merged 5 commits into from
May 6, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented May 5, 2021

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example Closes apache/arrow-datafusion#123 indicates that this PR will close issue #123.

Closes #267.

Old image size 2.16GB, new image size is 89.9MB

Rationale for this change

Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.

We should optimize for the docker image size of datafusion-cli. Using a multi-stage build would contribute to a cleaner build with minimal intermediate build artifacts.

What changes are included in this PR?

There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please add the breaking change label.

@jimexist
Copy link
Member Author

jimexist commented May 5, 2021

related apache/datafusion-ballista#455

@jimexist jimexist changed the title Update datafusion-cli dockerfile to be minimalist Update datafusion-cli dockerfile to be minimal May 5, 2021
@codecov-commenter
Copy link

codecov-commenter commented May 5, 2021

Codecov Report

Merging #266 (2a82cf2) into master (3be087a) will not change coverage.
The diff coverage is n/a.

❗ Current head 2a82cf2 differs from pull request most recent head d36d768. Consider uploading reports for the commit d36d768 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##           master     apache/arrow-datafusion#266   +/-   ##
=======================================
  Coverage   76.21%   76.21%           
=======================================
  Files         140      140           
  Lines       23553    23553           
=======================================
  Hits        17950    17950           
  Misses       5603     5603           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3be087a...d36d768. Read the comment docs.

@returnString
Copy link
Contributor

I'm not sure whether there are any expectations around CLI performance in Docker, but it's probably worth seeing whether @andygrove's 2020 article on Rust/Alpine perf issues still holds true: https://andygrove.io/2020/05/why-musl-extremely-slow

@jimexist jimexist changed the title Update datafusion-cli dockerfile to be minimal Use multi-stage build dockerfile in datafusion-cli May 5, 2021
@jimexist
Copy link
Member Author

jimexist commented May 5, 2021

I'm not sure whether there are any expectations around CLI performance in Docker, but it's probably worth seeing whether @andygrove's 2020 article on Rust/Alpine perf issues still holds true: https://andygrove.io/2020/05/why-musl-extremely-slow

i guess dropping alpine part of the change is safer.

@jimexist jimexist changed the title Use multi-stage build dockerfile in datafusion-cli Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB May 5, 2021
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor

@returnString returnString left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, excellent savings especially if we're looking at publishing this to Docker Hub soonish :)

@alamb
Copy link
Contributor

alamb commented May 5, 2021

 docker build . -f  ./datafusion-cli/Dockerfile -t datafusion-cli

And then tried to run the image:

docker run -it datafusion-cli

I got an error:

(arrow_dev) alamb@ip-10-0-0-124:~/Software/arrow-datafusion$ docker run -it datafusion-cli
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:56:34
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Perhaps we need to update the entry point to be /usr/local/bin/datafusion-cli ?

@jimexist
Copy link
Member Author

jimexist commented May 5, 2021

@alamb I guess you would need to volume map the /data dir. the error message wasn't specific enough

@alamb
Copy link
Contributor

alamb commented May 5, 2021

Indeed @jimexist you are correct:

docker run -v /tmp:/data  -it datafusion-cli

works just fine

@houqp
Copy link
Member

houqp commented May 6, 2021

minor suggestion not related to this change. i recommend using github container registry instead of dockerhub to host public docker images. dockerhub's rate limit makes it impossible to use the image in any serious production or ci/cd pipeline.

@jorgecarleitao
Copy link
Member

I agree with @houqp 👍

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Note that this makes it more difficult to use caching in docker builds on CI. I think that we can do that on a follow-up PR

@alamb alamb merged commit 1c40738 into apache:master May 6, 2021
@jimexist jimexist deleted the datafusion-cli-docker branch May 6, 2021 11:11
@houqp houqp added the datafusion Changes in the datafusion crate label Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use multi-stage build dockerfile in datafusion-cli
7 participants