Don't error if the webhook event only has GitHub owned labels #1348

toast-gear · 2021-10-27T14:59:48Z

I have webhooks setup at the organisation level in my organisation as I don't want to install them at the repo level as that would result in having to manage hundreds of webhooks. When a workflow_job event is sent to my API Gateway the webhook is marked as failed as it receives a HTTP 403 response. The errors in my logs are these 2 over and over again:

2021-10-27T14:46:31.491Z	4ff42805-6636-4c5e-89c3-11b0c1225917	
DEBUG	Received workflow job event with labels: '["ubuntu-latest"]'. 
The event does NOT match the configured labels: 'default,self-hosted'

2021-10-27T14:46:31.491Z	4ff42805-6636-4c5e-89c3-11b0c1225917	
ERROR	Received event contains runner labels 'ubuntu-latest' that are not accepted.

This is confusing as it makes it look like there is a problem with the configuration of the webhook when really the webhook is fine. I know we added support for ignoring known GitHub self-hosted labels in this PR #1244. Given many (probably the majority) of organisations running self-hosted runners will also utilise GitHub's runners too it would be great is we could also make it so when the webhook lambda receives a workflow_job event that only contains runs-on labels for GitHub owned labels such as ubuntu-latest, we ignore them, log the ignore at the debug level (as we do now) and return a HTTP 200 response instead.

Fixing this so the webhook still receives a HTTP 200 response marking it as green in the GitHub UI regardless of if the job is bound for a philips-labs runner or not would enable an administrator to more easily see real configuration errors such as a mismatched webhook secret.

EDIT the current GitHub managed labels can be found here https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources, workflow jobs that are marked only for any of these should not result in a non HTTP 200 response (or any other label combination imo)

EDIT As a more concrete example, here are what some of my webhooks look like atm on the organisation webhook screen:

And here are some Recent Deliveries in the one of the webhook settings.

They are all working fine but it looks like I have a major configuration issue. This leads to lots of confusion unless the adminstrator investigating the webhook errors is:

A) familiar with the GitHub Actions platform
B) familiar with the implementation details of this module

The text was updated successfully, but these errors were encountered:

toast-gear · 2021-10-28T11:09:12Z

Having look at bit more at this would setting this to true https://github.com/philips-labs/terraform-aws-github-runner#input_disable_check_wokflow_job_labels will prevent the error I've described?

I would say that the default of checking the label and erroring doesn't make much sense as a default to me. I would think not erroring makes more sense as a default because the majority of organisations using self-hosted runners will also use GitHub hosted runners I suspect. Only the most security centric organisations will demand all workloads run on self-hosted runners and given GitHub's lack of security features they are unlikely to use the Actions platform to start with or they will be running GHES in which GitHub hosted runners aren't supported. It is a feature on the roadmap github/roadmap#72 but it isn't here today (nor any time soon) and because of the reasons I stated previously I don't see GitHub's Actions offering being popular in that kind of environment any time soon.

What was the reason behind the current default config? I can't think of many reasons for wanting to fail the webhook if they don't match as a default. If you did want that you would have some sort of custom monitoring on the webhook and so it doesn't make sense to me to have a default setting based on a hyperthetical monitoring solution.

I can see that the list of github managed labels is specific to self-hosted runners and so this probably won't stop my errors right? https://github.com/philips-labs/terraform-aws-github-runner/blob/master/modules/webhook/lambdas/webhook/src/webhook/handler.ts#L116

toast-gear · 2021-10-29T08:05:54Z

@npalm I'm really keen to hear if this behaviour will be changed or not. If not I'll need to document out cavaets on my end (which will probably not get read 😞 realistically speaking) to help prevent confusion when if other engineers touch the implementation at some point as it won't be clear why the webhooks are "failing".

npalm · 2021-11-01T15:56:59Z

@toast-gear put it on my list to check.

toast-gear · 2021-11-01T16:01:58Z

Thanks @npalm excited to replace our existing excessively custom solution with this excellent module!

npalm · 2021-11-02T07:46:31Z

First of returning a 403 is certainly not correct. It should be either an error in the 500 range or an OK in the range of 200s.

The check is added for a couple of reason

To avoid scaling kicks in for jobs that should not be processed by the runner.
Since once I introduced the event and enabled by default, I though logically to return an error in case the event is ingorred.
To prepare to be able to support different type of runners (configuration)

I think the best way forward is to:

Return 200 (instaed of 403), or even better a 202 as also the GitHub docs suggest. And maybe add some extra info to the body in case the labels are not matching.
Change the default indeed to disabled

toast-gear · 2021-11-02T08:16:58Z

To avoid scaling kicks in for jobs that should not be processed by the runner.

We certainly don't want to break this! Just prevent a non 2XX response code being returned to GitHub if the labels don't match.

toast-gear · 2021-11-02T20:39:02Z

My apologies but I think I've misunderstood what disable_check_wokflow_job_labels does. I've deployed my stacks again with this enabled and now I scale for any job now regardless of label.

Perhaps I should describe what I think is the ideal behaviour.

The runs-on labels (a.k.a. the labels attribute in the webhook) of the workflow_job webhook are checked by the webhook lambda when it recieves the payload after validating the payloads the signiture etc. If it finds a matching label then the job is queued onto the SQS queue. If it does not find a matching label then the job is ignored, not added onto any SQS queue, the ignore is logged at the debug level and a HTTP 2XX response code is returned.

This just seems like the behaviour everyone would want to me? The reason the workflow_job webhook was introduced was so the runs-on label was exposed to a webhook event to allow better scaling for projects like terraform-aws-github-runner and actions-runner-controller.

Now that I've deployed the stacks again with disable_check_wokflow_job_labels enabled I do not think having this enabled by default is right as it will mean excessive scaling for 99% of implementations as I believe the vast majority of organisations running self-hosted runners are complimenting them with GitHub's hosted runners. I can tell you the reason my organisation uses self-hosted runners is so we can use role based authentication in AWS rather than access / secret keys. We don't have an issue with people using the GitHub hosted runners if they don't require AWS authentication in their workflow. I think this is going to be the most common setup and so the solution should have default settings for mixed fleets environment with optimal scaling i.e. only scale the appropriate runners on workflows where the labels demand it.

Back to streams of 403 errors then :(

https://github.com/philips-labs/terraform-aws-github-runner/blob/develop/modules/webhook/lambdas/webhook/src/webhook/handler.ts#L59-L61 the change is easy enough to do, return 200, change log level, adjust log message slightly to say the job was ignored but I wouldn't know what I'm doing with the tests so wouldn't want to raise the PR.

toast-gear mentioned this issue Nov 2, 2021

Faulty scaling logic when using runner groups #1370

Closed

npalm mentioned this issue Nov 5, 2021

fix: update return codes, no error code for job that are ignored #1381

Merged

npalm closed this as completed in #1381 Nov 9, 2021

npalm mentioned this issue Jan 6, 2022

Replace disable_check_wokflow_job_labels by opt in enable_workflow_job_labels_check, and check all label #1591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't error if the webhook event only has GitHub owned labels #1348

Don't error if the webhook event only has GitHub owned labels #1348

toast-gear commented Oct 27, 2021 •

edited

Loading

toast-gear commented Oct 28, 2021 •

edited

Loading

toast-gear commented Oct 29, 2021 •

edited

Loading

npalm commented Nov 1, 2021

toast-gear commented Nov 1, 2021

npalm commented Nov 2, 2021

toast-gear commented Nov 2, 2021

toast-gear commented Nov 2, 2021 •

edited

Loading

Don't error if the webhook event only has GitHub owned labels #1348

Don't error if the webhook event only has GitHub owned labels #1348

Comments

toast-gear commented Oct 27, 2021 • edited Loading

toast-gear commented Oct 28, 2021 • edited Loading

toast-gear commented Oct 29, 2021 • edited Loading

npalm commented Nov 1, 2021

toast-gear commented Nov 1, 2021

npalm commented Nov 2, 2021

toast-gear commented Nov 2, 2021

toast-gear commented Nov 2, 2021 • edited Loading

toast-gear commented Oct 27, 2021 •

edited

Loading

toast-gear commented Oct 28, 2021 •

edited

Loading

toast-gear commented Oct 29, 2021 •

edited

Loading

toast-gear commented Nov 2, 2021 •

edited

Loading