-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch over to managed Arm64 hosts #16801
Conversation
thx for raising the PR. What's the scale of the pool? In this case, probably we can run workflow on each PR instead of nightly? |
There's a total of 20 build slots available to be shared between all CNCF projects in the pool. On average, how long does the Arm build take? In order for the PR to work, we need a maintainer to install the Actuated app onto the etcd-io GitHub organisation - this enables runner management and event subscriptions in order to work. |
Based on the scale, and it's shared by all CNCF projects, It seems that we still need to run it nightly instead of trigging on each PR.
any objection? @etcd-io/maintainers-etcd let's do this after we get consensus. thx for the link. |
It depends on how quick the PRs are to run. Have you got a timing for the nightly job? I'm seeing about 30 minutes. Depending on how many PRs you tend to get, it may well be fine to run on every PR, until we have the other projects fully onboarded.
Sure, this has been discussed with @dims for a while already, I'm sure he can answer questions if anyone has them. |
@alexellis i am not a etcd maintainer per se, so the consensus that @ahrtr mentions among etcd maintainers is important here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for raising this proposal @alexellis. This sounds like a great initiative.
Two questions:
- Is the scale of the pool fixed at 20 or is there ability to expand it in future if required? (Give there are ~160 CNCF projects who could be wanting to use the service).
- We require 8cpus for lazyfs robustness ci nightlies in order to run tests at a decent level of QPS. Currently we use large github
amd64
runners. Is there an option for a largearm64
runner i.e.actuated-arm64-8cpu-32gb
?
The 20 slots are what Chris A has agreed to pilot. It's up to projects to show demand in order for that to potentially be increased. 32GB RAM is fine if that's what you know you need. There is a free telemetry plugin we can recommend later on which will confirm if 16 or 24GB for instance is being used. And as you can see from this commit, it's a trivial change to revert if you should need to do so for any reason. I'd encourage the team to get this moving forward and we can cover further questions and suggestions via Slack. |
Yes please. We need to set spec to 8CPU and 32GB Ram to match the github large runners that we found to be a requirement for our robustness nightly consistency stress tests with lazyfs enabled and a high enough QPS to create confidence. If we can update that spec, then this pr will look good to me. |
Ack You can customise the sizes as you require by editing the label in I'm just here to help get the ball rolling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks @alexellis. I believe managed on demand arm64 ci hosts will definitely be a big win for the project. Keen to trial this.
I'm happy to follow up with tweaking the sizing for the machines though it would certainly be appreciated if you could make that small tweak to save a second pr and extra commit being required. Will defer to maintainers on if that is deemed a blocking issue or not.
nice! maybe we can remove the container on GitHub action in the follow-up? It seems we don't need to worry about leaky resources. |
Looks great! Maybe we can run it for a while and see if there are any other parts that need to be modified/tweaked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thx @alexellis
The followups:
- Install the Actuated app onto the etcd-io GitHub organisation per Switch over to managed Arm64 hosts #16801 (comment)
- Customise the size per Switch over to managed Arm64 hosts #16801 (comment)
lgtm thanks! |
Could you link me to that? Every build will run in an isolated VM with its own short-lived Docker daemon and library. |
This change switches over from 2x self-managed runners where side effects are possible between builds, to a pool of servers where each build runs in an isolated VM. The service is provided by actuated.dev, and sponsored by both Ampere and the CNCF. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
Awesome, that's why we're doing this 👍
The idea is to give you access, then for you to manage this and take ownership, but I've force pushed my commit with the size you asked for. There may be other jobs in etcd-io repositories which do not require as many resources, or that may even require more. The labels are customisable as per the docs - https://docs.actuated.dev/troubleshooting/#a-job-is-running-out-of-ram-or-needs-more-cores We've not had anyone perform the GitHub App installation yet. What's needed for this to happen? |
All arm workflows are periodic currently, so we cannot see if this change is incompatible with them. For example our Is this something that was missed, or are we ok with merging PR first and fixing later? @jmhbnz @ahrtr I would prefer that we test it like we did in previous cases. @alexellis Can you temporarily push a commit that also makes the arm workflows execute on PR? We just need to have one run succeed and then we can remove the commit from PR. |
@@ -23,7 +23,7 @@ jobs: | |||
count: 80 | |||
testTimeout: 200m | |||
artifactName: main-arm64 | |||
runs-on: "['self-hosted', 'Linux', 'ARM64']" | |||
runs-on: actuated-arm64-8cpu-32gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch from @serathius. I believe for robustness this needs to be "['actuated-arm64-8cpu-32gb']"
, refer line 18
above for example.
We can probably simplify the fromJson
that the robustness nightly template does after this pr merges because the array will no longer be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think "['actuated-arm64-8cpu-32gb']"
is the correct solution as Github Actions would treat actuated-arm64-8cpu-32gb
as a label and not machine type. I think the proposed value makes sense assuming we will remove fromJson
, but I'm not sure. It's a matter of trial and error.
We will need a maintainer to add the Github App first before the test can be successful, refer: https://docs.actuated.dev/register/#install-the-github-app |
Done.
It's a good point. It'd better to verify the change in this PR firstly (see below). Once it's confirmed, we can change it back. thx
|
@ahrtr I was going to suggest adding a workflow_dispatch to the job, and getting this merged. I don't think you can test it in a PR until the initial merge of this PR has been performed. See my follow-up commit, to enable the job to be tested without waiting for the cron. Once you've seen it working, I'd suggest removing the |
As per your request: #16801 (comment) I've added the requested extra triggers. |
@alexellis could you rebase this PR to get rid of the commits coming from |
Adding workflow_dispatch as an "on" trigger enables manual testing by maintainers, without having to wait for the nightly cron schedule. @ahrtr requested this temporary change in order to trigger the arm64 jobs via CI. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
@jmhbnz that's been done and the Arm64 jobs have completed as far as I can see. |
Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have: .github/workflows/e2e-arm64-template.yaml - Shared template .github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly. Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request. Signed-off-by: Ming Li <mli103hawk@gmail.com>
Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have: .github/workflows/e2e-arm64-template.yaml - Shared template .github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly. Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request. Signed-off-by: Ming Li <mli103hawk@gmail.com>
…tcd-io#16912 KubeconNA 2023 Contribfest issue etcd-io#16893 . Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have: .github/workflows/e2e-arm64-template.yaml - Shared template .github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly. Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request. Signed-off-by: Ming Li <mli103hawk@gmail.com>
Hi folks 👋 We've seen around 200 jobs scheduled for the org today, with bbolt having also been added on. That's not a very big number, and shouldn't be hitting any rate limits, however for etcd-io and no other customers, we're seeing a 403 / rate-limit error whilst trying to obtain a registration token (GitHub support said the actuated app is within its limit for etcd-io). I don't know if GitHub rolled out a bug or is having a partial outage, but I'm pinging them. It looks like there are 18 jobs queued which can be retried on our end in the morning by @welteki on the actuated team. Just gathering info.. do you have other different GitHub apps, bots, or integrations installed/operating on this organisation? Alex |
As of recently we have kubernetes/test-infra#31218 in progress to finalise kubernetes/org#4498 k8s ci robot / k8s prow test infra integration. You can see this start to kick in under a recent pr like #16950 but it is early days. We have an arm64 runner issue under #16948 we would like actuated assistance on. Lastly we are planning for k8s ci robot to be able to trigger actions workflows for new contributors via #16956. |
@ahrtr seems like someone on your org has disabled self hosted runners 🙈 Forbidden","errors":"Repository level self-hosted runners are disabled on this repository. You’ll need to look into this and change the setting so that actuated can add runners for CI. |
Please send me your email for a Slack invite to alex@actuated.dev for ongoing support for etcd |
|
The enterprise has its own set of configurations so will need to be configured appropriately to enable the use of repository-level self hosted runners. The GitHub team told me that the default is to allow for them. |
Yes, it might be caused by recently etcd-io org transferring. See screenshot below as well, But it seems that actuated still has the required permission?
sent. |
The issue is not whether the actuated GitHub app has permissions, it's whether your enterprise and organisation allow self-hosted runners, and they do not at present. You'll need to go through the settings at both levels and toggle it over. Here are two screenshots GitHub's actions PM sent to me earlier today: |
Pls see screenshot below. I see message below under the section "Runner", it seems that I have no permission to do that? @mrbobbytables @palnabarun could you help on this?
|
@jmhbnz you are also invited to the slack space |
@mrbobbytables could you set the following at the enterprise level please? All of the etcd jobs are backed up since this was changed / transferred. There's a total of 85 that cannot run due to this setting. |
Sorry about that - I've flipped it so etcd-io should be able to add them. If someone could verify I'd appreciate it 🙏 @ahrtr @serathius @wenjiaswe @jmhbnz - as we integrate etcd-io into k8s more little bumps like this are bound to happen, is there an existing issue here or in k8s that we could use to surface these issues as they arise? If one does not exist, I'd lean towards creating one in https://github.com/kubernetes/org as the GitHub admins all watch that repo. |
Thanks @mrbobbytables. @alexellis The configuration should be correct now, please see screen shot below. But lots of workflows on arm64 are still pending to run (queued). Should we manually re-trigger all the already queued workflows or wait for more time for them to automatically be triggered?
Leave it to @jmhbnz to follow this. Thanks. |
…tcd-io#16912 KubeconNA 2023 Contribfest issue etcd-io#16893 . Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have: .github/workflows/e2e-arm64-template.yaml - Shared template .github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly. Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request. Signed-off-by: Ming Li <mli103hawk@gmail.com>
I'm not super familiar with the self hosted runners^^;; If they don't retrigger automatically soon though it may need to be done manually |
Yes, manually re-triggering works. FYI. etcd-io/bbolt#614 |
Just manually re-triggered all workflows. |
Thanks @ahrtr, @mrbobbytables. I've opened kubernetes/org#4590 to get our large amd64 runners re-enabled. Another side-effect of moving the org. |
actuated customers (and members of enrolled GitHub organisations) can retrigger queued jobs as per the docs: https://docs.actuated.dev/tasks/cli/#schedule-a-repair-to-re-queue-jobs I can see jobs are running again now. |
Description
Switch over to managed Arm64 hosts
The service is provided by actuated.dev, and sponsored by both Ampere and the CNCF.
Details
This change switches over from 2x self-managed runners where side effects are possible between builds, to a pool of servers where each build runs in an isolated VM.
Benefits to the etcd maintainers and community:
@dims and other etcd maintainers have been briefed on the discussion / plans via Slack.
Please feel free to ask questions here too.