Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Buildkite CI queues from AWS to GKE #878

Merged
merged 2 commits into from
Apr 18, 2024
Merged

Conversation

mstifflin
Copy link
Contributor

@mstifflin mstifflin commented Apr 16, 2024

What changed?

  • Update Buildkite pipeline yaml to work with the newly provisioned queues in Google Kubernetes Engine
  • Use agent-stack-k8s v0.8.0 helm chart which has its own expected pipeline yaml syntax in order to successfully onboard.
  • Install buildkite-agent in Dockerfile for use in the code coverage step
    • The mount that previously worked in AWS's VM based infra doesn't work in Kubernete's container based set up. Install the cli directly for simplicity.

Why?

  • Migrate from AWS > Google Cloud. Buildkite Enterprise recommends GKE (as opposed to GCP) as the way to have a queue with autoscaling compute.

How did you test it?

Potential risks

  • CI builds will be broken or flaky
  • Can be mitigated by a git revert

Release notes
n/a

Documentation Changes
n/a

@coveralls
Copy link

Pull Request Test Coverage Report for Build 2247

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 7 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.04%) to 60.234%

Files with Coverage Reduction New Missed Lines %
src/main/java/com/uber/cadence/internal/testservice/TestWorkflowMutableStateImpl.java 1 83.5%
src/main/java/com/uber/cadence/internal/replay/ReplayDecisionTaskHandler.java 1 87.8%
src/main/java/com/uber/cadence/internal/replay/ReplayDecider.java 5 80.27%
Totals Coverage Status
Change from base Build 2245: -0.04%
Covered Lines: 11445
Relevant Lines: 19001

💛 - Coveralls

docker: "*"
command: "./gradlew --no-daemon test"
timeout_in_minutes: 15
timeout_in_minutes: 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect CI jobs to take longer on GKE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem a bit flakier when leaving the timeout as is, so I bumped it. I don't have a concrete root cause, but the current theory is moving from VM based infra on AWS where you have dedicated bandwidth to a k8s cluster with potentially many containers competing for bandwidth could be part of the issue. CPU / Memory resources were configured to closely match what was available on AWS.

@mstifflin mstifflin merged commit 00503da into master Apr 18, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants