Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch over to managed Arm64 hosts #16801

Merged
merged 2 commits into from
Oct 24, 2023
Merged

Switch over to managed Arm64 hosts #16801

merged 2 commits into from
Oct 24, 2023

Conversation

alexellis
Copy link
Contributor

@alexellis alexellis commented Oct 20, 2023

Description

Switch over to managed Arm64 hosts

The service is provided by actuated.dev, and sponsored by both Ampere and the CNCF.

Details

This change switches over from 2x self-managed runners where side effects are possible between builds, to a pool of servers where each build runs in an isolated VM.

Benefits to the etcd maintainers and community:

  • Securely isolated, immutable environments for every build
  • No more server maintenance, reduced documentation and process
  • Support from an externally contracted team, much more efficient use of CNCF credits.
  • Potential to enable builds for each PR instead of just nightly

@dims and other etcd maintainers have been briefed on the discussion / plans via Slack.

Please feel free to ask questions here too.

@ahrtr
Copy link
Member

ahrtr commented Oct 20, 2023

This change switches over from 2x self-managed runners where side effects are possible between builds, to a pool of servers where each build runs in an isolated VM.

thx for raising the PR. What's the scale of the pool? In this case, probably we can run workflow on each PR instead of nightly?

@alexellis
Copy link
Contributor Author

alexellis commented Oct 20, 2023

There's a total of 20 build slots available to be shared between all CNCF projects in the pool.

On average, how long does the Arm build take?

In order for the PR to work, we need a maintainer to install the Actuated app onto the etcd-io GitHub organisation - this enables runner management and event subscriptions in order to work.

@ahrtr
Copy link
Member

ahrtr commented Oct 20, 2023

There's a total of 20 build slots available to be shared between all CNCF projects in the pool.

Based on the scale, and it's shared by all CNCF projects, It seems that we still need to run it nightly instead of trigging on each PR.

we need a maintainer to install the Actuated app onto the etcd-io GitHub organisation

any objection? @etcd-io/maintainers-etcd let's do this after we get consensus. thx for the link.

@alexellis
Copy link
Contributor Author

Based on the scale, and it's shared by all CNCF projects, It seems that we still need to run it nightly instead of trigging on each PR.

It depends on how quick the PRs are to run. Have you got a timing for the nightly job? I'm seeing about 30 minutes.

Depending on how many PRs you tend to get, it may well be fine to run on every PR, until we have the other projects fully onboarded.

let's do this after we get consensus. thx for the link.

Sure, this has been discussed with @dims for a while already, I'm sure he can answer questions if anyone has them.

@dims
Copy link
Contributor

dims commented Oct 20, 2023

@alexellis i am not a etcd maintainer per se, so the consensus that @ahrtr mentions among etcd maintainers is important here.

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for raising this proposal @alexellis. This sounds like a great initiative.

Two questions:

  • Is the scale of the pool fixed at 20 or is there ability to expand it in future if required? (Give there are ~160 CNCF projects who could be wanting to use the service).
  • We require 8cpus for lazyfs robustness ci nightlies in order to run tests at a decent level of QPS. Currently we use large github amd64 runners. Is there an option for a large arm64 runner i.e. actuated-arm64-8cpu-32gb?

@alexellis
Copy link
Contributor Author

The 20 slots are what Chris A has agreed to pilot. It's up to projects to show demand in order for that to potentially be increased.

32GB RAM is fine if that's what you know you need. There is a free telemetry plugin we can recommend later on which will confirm if 16 or 24GB for instance is being used.

And as you can see from this commit, it's a trivial change to revert if you should need to do so for any reason.

I'd encourage the team to get this moving forward and we can cover further questions and suggestions via Slack.

@jmhbnz
Copy link
Member

jmhbnz commented Oct 21, 2023

32GB RAM is fine if that's what you know you need. There is a free telemetry plugin we can recommend later on which will confirm if 16 or 24GB for instance is being used.

Yes please. We need to set spec to 8CPU and 32GB Ram to match the github large runners that we found to be a requirement for our robustness nightly consistency stress tests with lazyfs enabled and a high enough QPS to create confidence.

If we can update that spec, then this pr will look good to me.

@alexellis
Copy link
Contributor Author

Ack

You can customise the sizes as you require by editing the label in runs-on - just put whatever you need. Since the PR only runs overnight, I'd say get this merged, then send your own one after that with what you want.

I'm just here to help get the ball rolling.

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks @alexellis. I believe managed on demand arm64 ci hosts will definitely be a big win for the project. Keen to trial this.

I'm happy to follow up with tweaking the sizing for the machines though it would certainly be appreciated if you could make that small tweak to save a second pr and extra commit being required. Will defer to maintainers on if that is deemed a blocking issue or not.

@fuweid
Copy link
Member

fuweid commented Oct 22, 2023

nice! maybe we can remove the container on GitHub action in the follow-up? It seems we don't need to worry about leaky resources.

@tao12345666333
Copy link
Contributor

Looks great! Maybe we can run it for a while and see if there are any other parts that need to be modified/tweaked

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenjiaswe
Copy link
Contributor

lgtm thanks!

@alexellis
Copy link
Contributor Author

@fuweid

nice! maybe we can remove the container on GitHub action in the follow-up? It seems we don't need to worry about leaky resources.

Could you link me to that?

Every build will run in an isolated VM with its own short-lived Docker daemon and library.

This change switches over from 2x self-managed runners where
side effects are possible between builds, to a pool of servers
where each build runs in an isolated VM.

The service is provided by actuated.dev, and sponsored by
both Ampere and the CNCF.

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
@alexellis
Copy link
Contributor Author

LGTM - Thanks @alexellis. I believe managed on demand arm64 ci hosts will definitely be a big win for the project. Keen to trial this.

Awesome, that's why we're doing this 👍

I'm happy to follow up with tweaking the sizing for the machines though it would certainly be appreciated if you could make that small tweak to save a second pr and extra commit being required. Will defer to maintainers on if that is deemed a blocking issue or not.

The idea is to give you access, then for you to manage this and take ownership, but I've force pushed my commit with the size you asked for.

There may be other jobs in etcd-io repositories which do not require as many resources, or that may even require more. The labels are customisable as per the docs - https://docs.actuated.dev/troubleshooting/#a-job-is-running-out-of-ram-or-needs-more-cores

We've not had anyone perform the GitHub App installation yet. What's needed for this to happen?

@jmhbnz jmhbnz requested a review from serathius October 23, 2023 21:02
@serathius
Copy link
Member

serathius commented Oct 24, 2023

All arm workflows are periodic currently, so we cannot see if this change is incompatible with them. For example our fromJSON sharade when making the runs-on a parameter in template. I think there is high risk it will not work with your change. Only after merging the PR we will see the results.

Is this something that was missed, or are we ok with merging PR first and fixing later? @jmhbnz @ahrtr

I would prefer that we test it like we did in previous cases. @alexellis Can you temporarily push a commit that also makes the arm workflows execute on PR? We just need to have one run succeed and then we can remove the commit from PR.

@@ -23,7 +23,7 @@ jobs:
count: 80
testTimeout: 200m
artifactName: main-arm64
runs-on: "['self-hosted', 'Linux', 'ARM64']"
runs-on: actuated-arm64-8cpu-32gb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch from @serathius. I believe for robustness this needs to be "['actuated-arm64-8cpu-32gb']", refer line 18 above for example.

We can probably simplify the fromJson that the robustness nightly template does after this pr merges because the array will no longer be present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think "['actuated-arm64-8cpu-32gb']" is the correct solution as Github Actions would treat actuated-arm64-8cpu-32gb as a label and not machine type. I think the proposed value makes sense assuming we will remove fromJson, but I'm not sure. It's a matter of trial and error.

@jmhbnz
Copy link
Member

jmhbnz commented Oct 24, 2023

I would prefer that we test it like we did in previous cases. @alexellis Can you temporarily push a commit that also makes the arm workflows execute on PR? We just need to have one run succeed and then we can remove the commit from PR.

We will need a maintainer to add the Github App first before the test can be successful, refer: https://docs.actuated.dev/register/#install-the-github-app

@ahrtr
Copy link
Member

ahrtr commented Oct 24, 2023

We will need a maintainer to add the Github App first before the test can be successful, refer: https://docs.actuated.dev/register/#install-the-github-app

Done.

Only after merging the PR we will see the results.

It's a good point. It'd better to verify the change in this PR firstly (see below). Once it's confirmed, we can change it back. thx

$ git diff
diff --git a/.github/workflows/robustness-nightly.yaml b/.github/workflows/robustness-nightly.yaml
index e3e1d51f3..663fd4720 100644
--- a/.github/workflows/robustness-nightly.yaml
+++ b/.github/workflows/robustness-nightly.yaml
@@ -1,11 +1,7 @@
 ---
 name: Robustness Nightly
 permissions: read-all
-on:
-  # schedules always run against the main branch, hence we have to create separate jobs
-  # with individual checkout actions for each of the active release branches
-  schedule:
-    - cron: '25 9 * * *' # runs every day at 09:25 UTC
+on: [push, pull_request]
 jobs:
   main:
     # GHA has a maximum amount of 6h execution time, we try to get done within 3h

@alexellis
Copy link
Contributor Author

alexellis commented Oct 24, 2023

@ahrtr I was going to suggest adding a workflow_dispatch to the job, and getting this merged.

I don't think you can test it in a PR until the initial merge of this PR has been performed.

See my follow-up commit, to enable the job to be tested without waiting for the cron.

Once you've seen it working, I'd suggest removing the container elements which forces the steps to run inside Docker, from the file as it'll run much quicker using the native runner that way.

@alexellis
Copy link
Contributor Author

As per your request: #16801 (comment)

I've added the requested extra triggers.

@ahrtr
Copy link
Member

ahrtr commented Oct 24, 2023

@alexellis could you rebase this PR to get rid of the commits coming from main branch?

@alexellis
Copy link
Contributor Author

Looks like the nightly test ran as expected, in about the same time as without any management:

image

Adding workflow_dispatch as an "on" trigger enables
manual testing by maintainers, without having to wait for
the nightly cron schedule.

@ahrtr requested this temporary change in order to trigger
the arm64 jobs via CI.

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
@alexellis
Copy link
Contributor Author

@jmhbnz that's been done and the Arm64 jobs have completed as far as I can see.

mingli103 added a commit to mingli103/etcd that referenced this pull request Nov 10, 2023
Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform.
Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request.
Currently we have:
.github/workflows/e2e-arm64-template.yaml - Shared template
.github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly.
Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request.

Signed-off-by: Ming Li <mli103hawk@gmail.com>
mingli103 added a commit to mingli103/etcd that referenced this pull request Nov 15, 2023
Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform.
Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request.
Currently we have:
.github/workflows/e2e-arm64-template.yaml - Shared template
.github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly.
Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request.

Signed-off-by: Ming Li <mli103hawk@gmail.com>
mingli103 added a commit to mingli103/etcd that referenced this pull request Nov 15, 2023
…tcd-io#16912

KubeconNA 2023 Contribfest issue etcd-io#16893 . Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have:
.github/workflows/e2e-arm64-template.yaml - Shared template
.github/workflows/e2e-arm64-nightly.yaml - Runs template against both
release-3.5 and main branches nightly. Moving forward we can just rename
.github/workflows/e2e-arm64-template.yaml to
.github/workflows/e2e-arm64.yaml and delete the other file. We can then
just make the template file a standard workflow that will run on pull
request.

Signed-off-by: Ming Li <mli103hawk@gmail.com>
@alexellis
Copy link
Contributor Author

Hi folks 👋

We've seen around 200 jobs scheduled for the org today, with bbolt having also been added on. That's not a very big number, and shouldn't be hitting any rate limits, however for etcd-io and no other customers, we're seeing a 403 / rate-limit error whilst trying to obtain a registration token (GitHub support said the actuated app is within its limit for etcd-io).

I don't know if GitHub rolled out a bug or is having a partial outage, but I'm pinging them. It looks like there are 18 jobs queued which can be retried on our end in the morning by @welteki on the actuated team.

Just gathering info.. do you have other different GitHub apps, bots, or integrations installed/operating on this organisation?

Alex

@jmhbnz
Copy link
Member

jmhbnz commented Nov 16, 2023

Just gathering info.. do you have other different GitHub apps, bots, or integrations installed/operating on this organisation?

As of recently we have kubernetes/test-infra#31218 in progress to finalise kubernetes/org#4498 k8s ci robot / k8s prow test infra integration.

You can see this start to kick in under a recent pr like #16950 but it is early days.

We have an arm64 runner issue under #16948 we would like actuated assistance on.

Lastly we are planning for k8s ci robot to be able to trigger actions workflows for new contributors via #16956.

@alexellis
Copy link
Contributor Author

@ahrtr seems like someone on your org has disabled self hosted runners 🙈

Forbidden","errors":"Repository level self-hosted runners are disabled on this repository.

You’ll need to look into this and change the setting so that actuated can add runners for CI.

@alexellis
Copy link
Contributor Author

alexellis commented Nov 17, 2023

Please send me your email for a Slack invite to alex@actuated.dev for ongoing support for etcd

@serathius
Copy link
Member

@ahrtr seems like someone on your org has disabled self hosted runners 🙈

Forbidden","errors":"Repository level self-hosted runners are disabled on this repository.

You’ll need to look into this and change the setting so that actuated can add runners for CI.

I think it might be the result of
image

@alexellis
Copy link
Contributor Author

The enterprise has its own set of configurations so will need to be configured appropriately to enable the use of repository-level self hosted runners.

The GitHub team told me that the default is to allow for them.

@ahrtr
Copy link
Member

ahrtr commented Nov 17, 2023

@ahrtr seems like someone on your org has disabled self hosted runners 🙈

Forbidden","errors":"Repository level self-hosted runners are disabled on this repository.

You’ll need to look into this and change the setting so that actuated can add runners for CI.

Yes, it might be caused by recently etcd-io org transferring. See screenshot below as well,

Screenshot 2023-11-17 at 10 51 19

But it seems that actuated still has the required permission?
Screenshot 2023-11-17 at 11 00 22

Please send me your email for a Slack invite to alex@actuated.dev

sent.

@alexellis
Copy link
Contributor Author

The issue is not whether the actuated GitHub app has permissions, it's whether your enterprise and organisation allow self-hosted runners, and they do not at present.

You'll need to go through the settings at both levels and toggle it over. Here are two screenshots GitHub's actions PM sent to me earlier today:

image (9)
image (8)

@ahrtr
Copy link
Member

ahrtr commented Nov 17, 2023

Pls see screenshot below. I see message below under the section "Runner", it seems that I have no permission to do that? @mrbobbytables @palnabarun could you help on this?

Choose which repositories are allowed to create repository-level self-hosted runners.

This setting has been set by enterprise administrators.

Screenshot 2023-11-17 at 13 16 10

@ahrtr
Copy link
Member

ahrtr commented Nov 17, 2023

@jmhbnz you are also invited to the slack space actuated by @alexellis to discuss & coordinate any arm64 related workflow issues. If you do not receive the mail, please feel free to send an email to alex@actuated.dev to request to join. Thanks.

@alexellis
Copy link
Contributor Author

@mrbobbytables could you set the following at the enterprise level please? All of the etcd jobs are backed up since this was changed / transferred. There's a total of 85 that cannot run due to this setting.

Screenshot 2023-11-18 at 08 46 20

@mrbobbytables
Copy link

Sorry about that - I've flipped it so etcd-io should be able to add them. If someone could verify I'd appreciate it 🙏

@ahrtr @serathius @wenjiaswe @jmhbnz - as we integrate etcd-io into k8s more little bumps like this are bound to happen, is there an existing issue here or in k8s that we could use to surface these issues as they arise?

If one does not exist, I'd lean towards creating one in https://github.com/kubernetes/org as the GitHub admins all watch that repo.

@ahrtr
Copy link
Member

ahrtr commented Nov 18, 2023

Thanks @mrbobbytables.

@alexellis The configuration should be correct now, please see screen shot below. But lots of workflows on arm64 are still pending to run (queued). Should we manually re-trigger all the already queued workflows or wait for more time for them to automatically be triggered?
Screenshot 2023-11-18 at 16 18 19

If one does not exist, I'd lean towards creating one in https://github.com/kubernetes/org as the GitHub admins all watch that repo.

Leave it to @jmhbnz to follow this. Thanks.

mingli103 added a commit to mingli103/etcd that referenced this pull request Nov 18, 2023
…tcd-io#16912

KubeconNA 2023 Contribfest issue etcd-io#16893 . Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have:
.github/workflows/e2e-arm64-template.yaml - Shared template
.github/workflows/e2e-arm64-nightly.yaml - Runs template against both
release-3.5 and main branches nightly. Moving forward we can just rename
.github/workflows/e2e-arm64-template.yaml to
.github/workflows/e2e-arm64.yaml and delete the other file. We can then
just make the template file a standard workflow that will run on pull
request.

Signed-off-by: Ming Li <mli103hawk@gmail.com>
@mrbobbytables
Copy link

I'm not super familiar with the self hosted runners^^;; If they don't retrigger automatically soon though it may need to be done manually

@ahrtr
Copy link
Member

ahrtr commented Nov 18, 2023

Yes, manually re-triggering works. FYI. etcd-io/bbolt#614

@ahrtr
Copy link
Member

ahrtr commented Nov 18, 2023

Just manually re-triggered all workflows.

@jmhbnz
Copy link
Member

jmhbnz commented Nov 18, 2023

Thanks @ahrtr, @mrbobbytables. I've opened kubernetes/org#4590 to get our large amd64 runners re-enabled. Another side-effect of moving the org.

@alexellis
Copy link
Contributor Author

actuated customers (and members of enrolled GitHub organisations) can retrigger queued jobs as per the docs:

https://docs.actuated.dev/tasks/cli/#schedule-a-repair-to-re-queue-jobs

I can see jobs are running again now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

9 participants