Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: faster builds by caching the build cache ⚡ #41033

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

shmsr
Copy link
Member

@shmsr shmsr commented Sep 27, 2024

Proposed commit message

Tested the change to build the binary of agenbeat for linux/amd64 where my docker host is darwin/arm64. After debugging, I found that most of the time is spent in the cross-compilation part during packaging. So to isolate and benchmark the change, I pulled out only the docker command that's responsible for the building of binaries:

docker run --ulimit nofile=262144:262144 \
  --env EXEC_UID=501 \
  --env EXEC_GID=20 \
  -v /Users/subhamsarkar/go/pkg/mod:/go/pkg/mod:ro \
  --env DEV=false \
  --rm \
  --env GOGC=off \
  --env GOFLAGS="-mod=readonly -buildvcs=false" \
  --env MAGEFILE_VERBOSE=true \
  --env MAGEFILE_TIMEOUT= \
  --env SNAPSHOT=false \
  -v /Users/subhamsarkar/go/src/github.com/elastic/beats:/go/src/github.com/elastic/beats \
  -w /go/src/github.com/elastic/beats/x-pack/agentbeat \
  docker.elastic.co/beats-dev/golang-crossbuild:1.22.6-main-debian10 \
  --build-cmd "build/mage-linux-amd64 golangCrossBuild" \
  --platforms linux/amd64

It was taking ~4m30s before. After the change, 1st run is similar to before but the subsequent runs take ~25s only ⚡

Along with the introduction of build caching, I've made some general improvements including fixing code smells, better errors, tiny optimizations, etc.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2024
@botelastic
Copy link

botelastic bot commented Sep 27, 2024

This pull request doesn't have a Team:<team> label.

@shmsr shmsr changed the title all: super fast crossbuilds ⚡ all: super fast crossbuilds by caching the build cache ⚡ Sep 27, 2024
@mergify mergify bot assigned shmsr Sep 27, 2024
Copy link
Contributor

mergify bot commented Sep 27, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @shmsr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Sep 27, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 27, 2024
Comment on lines 341 to 373
// To speed up cross-compilation, we need to persist the build cache so that subsequent builds
// for the same arch are faster (⚡).
//
// As we want to persist the build cache, we need to mount the cache directory to the Docker host.
// This is done by mounting the host directory to the container.
//
// Path of the cache directory on the host:
// <repoInfo.RootDir>/build/.go-build/<b.Platform>
// Example: <rootdir>/build/.go-build/linux/amd64
//
// As per: https://docs.docker.com/engine/storage/bind-mounts/#differences-between--v-and---mount-behavior
// If the directory doesn't exist, Docker does not automatically create it for you, but generates an error.
// So, we need to create the directory before mounting it.
//
// Also, in the container, the cache directory is mounted to /root/.cache/go-build.
buildCacheHostDir := filepath.Join(repoInfo.RootDir, "build", ".go-build", b.Platform)
buildCacheContainerDir := "/root/.cache/go-build"
if err = os.MkdirAll(buildCacheHostDir, 0755); err != nil {
return fmt.Errorf("failed to create directory %s: %w", buildCacheHostDir, err)
}

// Common arguments
args = append(args,
"--rm",
"--env", "GOFLAGS=-mod=readonly -buildvcs=false",
"--env", "MAGEFILE_VERBOSE="+verbose,
"--env", "MAGEFILE_TIMEOUT="+EnvOr("MAGEFILE_TIMEOUT", ""),
"--env", fmt.Sprintf("SNAPSHOT=%v", Snapshot),
"--env", "SNAPSHOT="+strconv.FormatBool(Snapshot),

// To persist the build cache, we need to mount the cache directory to the Docker host.
// With docker run, mount types are: bind, volume and tmpfs. For our use case, we have
// decide to use the bind mount type.
"--mount", fmt.Sprintf("type=bind,source=%s,target=%s", buildCacheHostDir, buildCacheContainerDir),
Copy link
Member Author

@shmsr shmsr Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speed-up part by persisting build cache is here.

@shmsr shmsr changed the title all: super fast crossbuilds by caching the build cache ⚡ all: faster crossbuilds by caching the build cache ⚡ Sep 27, 2024
Comment on lines -71 to -73
echo "~~~ Downloading artifacts"
buildkite-agent artifact download x-pack/agentbeat/build/distributions/** . --step 'agentbeat-package-linux'
ls -lah x-pack/agentbeat/build/distributions/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't why we were doing this. For the integration test, test binaries will be created which is already handled in mage goIntegTest.

Comment on lines +559 to +562
// Parallelize conservatively to avoid overloading the host.
if maxParallel >= 2 {
return maxParallel / 2
}
Copy link
Member Author

@shmsr shmsr Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed. I did this change i.e., n/2 (compared to ~n before) because there were cases where BK agent was lost. From the Slack threads and BK notice about "Agent lost", seems like it is happening a lot and maybe we should keep this change.

Now it is working with the change. Cross-compilations are quite expensive I think we shouldn't spawn all of them at once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suspicion the BK agents being lost is related to a more infra level problem, I recall seeing some communication about this.

IMO we should use as much CPU as possible to ensure our builds are as fast as possible. The Buildkite agent's only job is to compile and test things quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point. However, excessive parallelization can be problematic in certain scenarios. Prior to implementing this change, I frequently encountered the "Agent Lost" error. But after the change, none! So, I thought let me keep the change and discuss once.

Certainly, I can remove this change for now and revisit it later. The primary focus of this PR is to discuss the persistence of the build cache.

@shmsr
Copy link
Member Author

shmsr commented Sep 28, 2024

After playing a lot with BK, I think I'd need some help with how things work in CI. Locally, it has been working great. For example: Inside x-pack/agentbeat if I run PLATFORMS=linux/amd64 PACKAGES=tar.gz mage -v package it takes around ~1m24s (2nd run onwards) compared to ~5m42s earlier which is a significant drop; roughly more than 4 mins reduction in build-time. I think we should introduce build caching here in the CI as well.

Here's what I need help with:

  • Review the changes in the BK pipeline.
  • I see there's some stale module and build cache in BK. From the date, it looks like 2 days old. How it is currently being done?
  • What's the ideal way to use the build cache (platform-wise as well) and module cache in multiple steps. Suppose I have warmed up the cache in the first step and now I want the cache to be available for 2nd step. I tried uploading it to artifact after reading somewhere about it but it takes a lot of time because of the upload.
  • How can I persist the build-cache generated in CI? If this happens, it will significantly drop the build time.
  • Also for integration tests, I see -count=1 with go test is being used. So even if have the cache, it is not gonna be used. So is it intentional? Is it fine to remove it? I think there should be a pipeline that runs the integrations tests with -count=1 after being merged to main.

@shmsr shmsr marked this pull request as ready for review September 29, 2024 08:14
@shmsr shmsr requested review from a team as code owners September 29, 2024 08:14
@shmsr shmsr marked this pull request as draft September 29, 2024 08:15
Comment on lines +341 to +379
// To speed up cross-compilation, we need to persist the build cache so that subsequent builds
// for the same arch are faster (⚡).
//
// As we want to persist the build cache, we need to mount the cache directory to the Docker host.
// This is done by mounting the host directory to the container.
//
// Path of the cache directory on the host:
// <os.TempDir>/build/.go-build/<b.Platform>
// Example: /tmp/build/.go-build/linux/amd64
// Reason for using <os.TempDir> and not <repoInfo.RootDir> as base because for
// builds happening on CI, the paths looks similar to:
// /opt/buildkite-agent/builds/bk-agent-prod-gcp-1727515099712207954/elastic/beats-xpack-agentbeat/
// where bk-agent-prod-gcp-1727515099712207954 is the agent so it keeps changing. So even if we do cache the
// build, it will be useless as the cache directory will be different for every build.
//
// As per: https://docs.docker.com/engine/storage/bind-mounts/#differences-between--v-and---mount-behavior
// If the directory doesn't exist, Docker does not automatically create it for you, but generates an error.
// So, we need to create the directory before mounting it.
//
// Also, in the container, the cache directory is mounted to /root/.cache/go-build.
buildCacheHostDir := filepath.Join(os.TempDir(), "build", ".go-build", b.Platform)
buildCacheContainerDir := "/root/.cache/go-build"

if err = os.MkdirAll(buildCacheHostDir, 0755); err != nil {
return fmt.Errorf("failed to create directory %s: %w", buildCacheHostDir, err)
}

// Common arguments
args = append(args,
"--rm",
"--env", "GOFLAGS=-mod=readonly -buildvcs=false",
"--env", "MAGEFILE_VERBOSE="+verbose,
"--env", "MAGEFILE_TIMEOUT="+EnvOr("MAGEFILE_TIMEOUT", ""),
"--env", fmt.Sprintf("SNAPSHOT=%v", Snapshot),
"--env", "SNAPSHOT="+strconv.FormatBool(Snapshot),

// To persist the build cache, we need to mount the cache directory to the Docker host.
// With docker run, mount types are: bind, volume and tmpfs. For our use case, we have
// decide to use the bind mount type.
"--mount", fmt.Sprintf("type=bind,source=%s,target=%s", buildCacheHostDir, buildCacheContainerDir),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part where build caching is happening.

@shmsr
Copy link
Member Author

shmsr commented Sep 30, 2024

@shmsr shmsr changed the title all: faster crossbuilds by caching the build cache ⚡ all: faster builds by caching the build cache ⚡ Sep 30, 2024
@alexsapran
Copy link
Contributor

Why are we making this change?

The CI agents are ephemeral. So, any caching between runs is lost unless we share the cache.

Are we improving a local dev workflow that we want to make re-building faster or improving CI timing?

@shmsr
Copy link
Member Author

shmsr commented Oct 7, 2024

Why are we making this change?

The CI agents are ephemeral. So, any caching between runs is lost unless we share the cache.

Are we improving a local dev workflow that we want to make re-building faster or improving CI timing?

Yeah currently the change is working well in dev but not on CI because of the reason you've mentioned.

See the 3rd and 4th point here: #41033 (comment)

So I need some help to do this in CI. We can use this but that would take some more work.

@alexsapran
Copy link
Contributor

See the 3rd and 4th point here: #41033 (comment)

Answering each bullet

How can I persist the build-cache generated in CI? If this happens, it will significantly drop the build time.

Why would we need to persist the build cache in CI, and what problem are we solving?
Are we trying to optimise a specific step that we identified as having an issue, or is it generally good practice?

So I need some help to do this in CI. We can use #41033 (comment) but that would take some more work.

Again, I agree with you, and this is why I am asking you to define the problem we are solving.

Overall, I see this as follows:
We can fix the local dev use case by improving and reusing the cache, which could end this work.
Then, identify the steps to improve each step's time and see what we can do. This can be a dedicated other PR.
Because we use ephemeral agents in CI, caching is a challenge. Because what if we need to evict something from the cache? Is the caching unique to the specific Build or re-usable between different Builds?

We maintain a set of custom VM images for Beats that come pre-bundled with container images and go packages to help improve CI times. Can we improve something there?
It depends on the problem we want to solve.

@shmsr
Copy link
Member Author

shmsr commented Oct 8, 2024

Thank you @alexsapran . That makes sense. I propose we proceed as follows:

  1. General improvements like code smells, micro-optimizattions in dev-tools/mage/* (low priority; we can defer this for now). Not so important.
  2. Faster builds in the development environment (implementing build caching). This will improve build times considerably, and in case the cache needs to be evicted, it is easy to do so in dev environment. It's a safe change IMO and good to have.
  3. Exploration of faster builds in the CI environment (utilizing GOCACHEPROG or alternative solutions). Lots of discussion is needed here like you also mentioned.

Changes for points 1 and 2 are ready. I can separate these changes into separate PRs. Additionally, I'll create an issue to track the work for point 3?

What do you think?

@alexsapran
Copy link
Contributor

Changes for points 1 and 2 are ready. I can separate these changes into separate PRs. Additionally, I'll create an issue to track the work for point 3?

From my side, splitting the PR is optional. Thanks in advance for raising the issue for 3.

@shmsr shmsr mentioned this pull request Oct 8, 2024
6 tasks
Copy link
Contributor

mergify bot commented Oct 9, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b shmsr-improve-builder upstream/shmsr-improve-builder
git merge upstream/main
git push upstream shmsr-improve-builder

1 similar comment
Copy link
Contributor

mergify bot commented Oct 10, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b shmsr-improve-builder upstream/shmsr-improve-builder
git merge upstream/main
git push upstream shmsr-improve-builder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants