Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't error if the webhook event only has GitHub owned labels #1348

Closed
toast-gear opened this issue Oct 27, 2021 · 7 comments · Fixed by #1381
Closed

Don't error if the webhook event only has GitHub owned labels #1348

toast-gear opened this issue Oct 27, 2021 · 7 comments · Fixed by #1381

Comments

@toast-gear
Copy link
Contributor

toast-gear commented Oct 27, 2021

I have webhooks setup at the organisation level in my organisation as I don't want to install them at the repo level as that would result in having to manage hundreds of webhooks. When a workflow_job event is sent to my API Gateway the webhook is marked as failed as it receives a HTTP 403 response. The errors in my logs are these 2 over and over again:

2021-10-27T14:46:31.491Z	4ff42805-6636-4c5e-89c3-11b0c1225917	
DEBUG	Received workflow job event with labels: '["ubuntu-latest"]'. 
The event does NOT match the configured labels: 'default,self-hosted'

2021-10-27T14:46:31.491Z	4ff42805-6636-4c5e-89c3-11b0c1225917	
ERROR	Received event contains runner labels 'ubuntu-latest' that are not accepted.

This is confusing as it makes it look like there is a problem with the configuration of the webhook when really the webhook is fine. I know we added support for ignoring known GitHub self-hosted labels in this PR #1244. Given many (probably the majority) of organisations running self-hosted runners will also utilise GitHub's runners too it would be great is we could also make it so when the webhook lambda receives a workflow_job event that only contains runs-on labels for GitHub owned labels such as ubuntu-latest, we ignore them, log the ignore at the debug level (as we do now) and return a HTTP 200 response instead.

Fixing this so the webhook still receives a HTTP 200 response marking it as green in the GitHub UI regardless of if the job is bound for a philips-labs runner or not would enable an administrator to more easily see real configuration errors such as a mismatched webhook secret.

EDIT the current GitHub managed labels can be found here https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources, workflow jobs that are marked only for any of these should not result in a non HTTP 200 response (or any other label combination imo)

EDIT As a more concrete example, here are what some of my webhooks look like atm on the organisation webhook screen:

image

And here are some Recent Deliveries in the one of the webhook settings.

image

They are all working fine but it looks like I have a major configuration issue. This leads to lots of confusion unless the adminstrator investigating the webhook errors is:

A) familiar with the GitHub Actions platform
B) familiar with the implementation details of this module

@toast-gear
Copy link
Contributor Author

toast-gear commented Oct 28, 2021

Having look at bit more at this would setting this to true https://github.com/philips-labs/terraform-aws-github-runner#input_disable_check_wokflow_job_labels will prevent the error I've described?

I would say that the default of checking the label and erroring doesn't make much sense as a default to me. I would think not erroring makes more sense as a default because the majority of organisations using self-hosted runners will also use GitHub hosted runners I suspect. Only the most security centric organisations will demand all workloads run on self-hosted runners and given GitHub's lack of security features they are unlikely to use the Actions platform to start with or they will be running GHES in which GitHub hosted runners aren't supported. It is a feature on the roadmap github/roadmap#72 but it isn't here today (nor any time soon) and because of the reasons I stated previously I don't see GitHub's Actions offering being popular in that kind of environment any time soon.

What was the reason behind the current default config? I can't think of many reasons for wanting to fail the webhook if they don't match as a default. If you did want that you would have some sort of custom monitoring on the webhook and so it doesn't make sense to me to have a default setting based on a hyperthetical monitoring solution.

I can see that the list of github managed labels is specific to self-hosted runners and so this probably won't stop my errors right? https://github.com/philips-labs/terraform-aws-github-runner/blob/master/modules/webhook/lambdas/webhook/src/webhook/handler.ts#L116

@toast-gear
Copy link
Contributor Author

toast-gear commented Oct 29, 2021

@npalm I'm really keen to hear if this behaviour will be changed or not. If not I'll need to document out cavaets on my end (which will probably not get read 😞 realistically speaking) to help prevent confusion when if other engineers touch the implementation at some point as it won't be clear why the webhooks are "failing".

@npalm
Copy link
Member

npalm commented Nov 1, 2021

@toast-gear put it on my list to check.

@toast-gear
Copy link
Contributor Author

Thanks @npalm excited to replace our existing excessively custom solution with this excellent module!

@npalm
Copy link
Member

npalm commented Nov 2, 2021

First of returning a 403 is certainly not correct. It should be either an error in the 500 range or an OK in the range of 200s.

The check is added for a couple of reason

  • To avoid scaling kicks in for jobs that should not be processed by the runner.
  • Since once I introduced the event and enabled by default, I though logically to return an error in case the event is ingorred.
  • To prepare to be able to support different type of runners (configuration)

I think the best way forward is to:

  1. Return 200 (instaed of 403), or even better a 202 as also the GitHub docs suggest. And maybe add some extra info to the body in case the labels are not matching.
  2. Change the default indeed to disabled

@toast-gear
Copy link
Contributor Author

To avoid scaling kicks in for jobs that should not be processed by the runner.

We certainly don't want to break this! Just prevent a non 2XX response code being returned to GitHub if the labels don't match.

@toast-gear
Copy link
Contributor Author

toast-gear commented Nov 2, 2021

My apologies but I think I've misunderstood what disable_check_wokflow_job_labels does. I've deployed my stacks again with this enabled and now I scale for any job now regardless of label.

Perhaps I should describe what I think is the ideal behaviour.

The runs-on labels (a.k.a. the labels attribute in the webhook) of the workflow_job webhook are checked by the webhook lambda when it recieves the payload after validating the payloads the signiture etc. If it finds a matching label then the job is queued onto the SQS queue. If it does not find a matching label then the job is ignored, not added onto any SQS queue, the ignore is logged at the debug level and a HTTP 2XX response code is returned.

This just seems like the behaviour everyone would want to me? The reason the workflow_job webhook was introduced was so the runs-on label was exposed to a webhook event to allow better scaling for projects like terraform-aws-github-runner and actions-runner-controller.

Now that I've deployed the stacks again with disable_check_wokflow_job_labels enabled I do not think having this enabled by default is right as it will mean excessive scaling for 99% of implementations as I believe the vast majority of organisations running self-hosted runners are complimenting them with GitHub's hosted runners. I can tell you the reason my organisation uses self-hosted runners is so we can use role based authentication in AWS rather than access / secret keys. We don't have an issue with people using the GitHub hosted runners if they don't require AWS authentication in their workflow. I think this is going to be the most common setup and so the solution should have default settings for mixed fleets environment with optimal scaling i.e. only scale the appropriate runners on workflows where the labels demand it.

Back to streams of 403 errors then :(

image

https://github.com/philips-labs/terraform-aws-github-runner/blob/develop/modules/webhook/lambdas/webhook/src/webhook/handler.ts#L59-L61 the change is easy enough to do, return 200, change log level, adjust log message slightly to say the job was ignored but I wouldn't know what I'm doing with the tests so wouldn't want to raise the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants