feat: make it easier to reason about health check failures #374

makkes · 2021-06-24T15:58:27Z

Whenever a health check times out now, the most recently collected error for each resource will be printed as part of the error message. This excludes errors for those resources for which no error was reported in the last update. This is because whenever a timeout occurs, an error is reported on ALL resources, even those that have been seen as healthy before.

Also, this commit causes all successfully checked resources to be omitted in the error event.

This is how it will look in the API:

Before:

$ k -n flux-system get events --field-selector=involvedObject.kind=Kustomization -w
LAST SEEN   TYPE     REASON   OBJECT                MESSAGE
34m         Normal   info     kustomization/test1   configmap/test1 created
32m         Normal   info     kustomization/test1   Update completed
0s          Normal   error    kustomization/test1   Health check timed out for [Deployment 'default/fails', Deployment 'default/fails2', Job 'default/wrong', Deployment 'default/git']

After:

$ k -n flux-system get events --field-selector=involvedObject.kind=Kustomization -w
LAST SEEN   TYPE     REASON   OBJECT                MESSAGE
34m         Normal   info     kustomization/test1   configmap/test1 created
32m         Normal   info     kustomization/test1   Update completed
0s          Normal   error    kustomization/test1   Health check failed for [Deployment 'default/fails' (status 'InProgress'): context deadline exceeded, Deployment 'default/fails2' (status 'NotFound'): context deadline exceeded, Job 'default/wrong' (status 'Unknown'): no matches for kind "Job" in group ""]

Signed-off-by: Max Jonas Werner mail@makk.es

stefanprodan · 2021-06-24T16:11:18Z

@makkes this looks great! Can you please test this with more objects, let's say one with a wrong api version, one that passes and another one that fails due to image not found or some scheduling timeout. Thanks!

makkes · 2021-06-24T17:04:17Z

Another spec I tested:

[...]
  healthChecks:
  - apiVersion: v1
    kind: Job
    name: somejob
    namespace: default
  - apiVersion: apps/v1
    kind: Deployment
    name: fails
    namespace: default
  - apiVersion: apps/v1
    kind: Deployment
    name: succeeds
    namespace: default
[...]

Events:

0s          Normal   error    kustomization/test1   Health check failed for [Deployment 'default/fails', Job 'default/somejob']: no matches for kind "Job" in group ""
0s          Normal   error    kustomization/test1   Health check failed for [Job 'default/somejob', Deployment 'default/fails', Deployment 'default/succeeds']: no matches for kind "Job" in group ""

The "succeeds" Deployment is there because its status in "Unknown". I believe the reason for that is that I changed the loop to exit early as soon as an error ocurrs. I'll see whether it would make sense to store all errors in a slice instead.

stefanprodan · 2021-06-25T09:20:15Z

So looks like we can capture only one error. Can you please try with 2 failed deployments and one ok.

makkes · 2021-06-25T10:17:20Z

Yep, I'm on it and also trying to find a neat way to catch all errors. But I need to dig into cli-utils a little more.

hiddeco · 2021-06-25T10:55:18Z

https://pkg.go.dev/k8s.io/apimachinery/pkg/util/errors may be of help here to collect an aggregation of errors.

makkes · 2021-07-02T14:42:09Z

So looks like we can capture only one error. Can you please try with 2 failed deployments and one ok.

@stefanprodan this is how it looks like with two failing Deployments and one succeeding:

0s          Normal   error    kustomization/k   Health check failed for [Deployment 'default/fails', Deployment 'default/dep', Deployment 'default/fails2']: Deployment 'default/dep': context deadline exceeded, Deployment 'default/fails2': context deadline exceeded, Deployment 'default/fails': context deadline exceeded

As you can see, it shows all 3 of them as timed out, even though the 'default/dep' one is always healthy. It looks like the timing out of one resources check leads to errors being reported for all of them.

I will test out one more idea on a more precise reporting.

makkes · 2021-07-02T20:50:55Z

Ok, this version works with 1 successful and 2 failing ones:

0s          Normal   error    kustomization/k   Health check failed for [Deployment 'default/fails2', Deployment 'default/fails', Deployment 'default/dep']: Deployment 'default/fails2': context deadline exceeded, Deployment 'default/fails': context deadline exceeded

makkes · 2021-07-02T21:03:10Z

@stefanprodan @hiddeco ready for another round of reviews, please. 🙂

Whenever a health check times out now, the most recently collected error for each resource will be printed as part of the error message. This excludes errors for those resources for which no error was reported in the last update. This is because whenever a timeout occurs, an error is reported on ALL resources, even those that have been seen as healthy before. Also, this commit causes all successfully checked resources to be omitted in the error event. Signed-off-by: Max Jonas Werner <mail@makk.es>

stefanprodan

LGTM

Thanks @makkes 🏅 I really like that we also show the status condition reason 👌

makkes · 2021-07-22T15:27:25Z

@makkes Do you see a scenario where lastStatus[rs.Identifier] potentially could be nil?

[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1bc3985]

goroutine 309 [running]:
github.com/fluxcd/kustomize-controller/controllers.(*KustomizeHealthCheck).Assess(0xc0009c3b78, 0x3b9aca00, 0x0, 0x0)
	/workspace/controllers/kustomization_healthcheck.go:90 +0x425
github.com/fluxcd/kustomize-controller/controllers.(*KustomizationReconciler).checkHealth(0xc00083e5a0, 0x2381328, 0xc0009af710, 0xc00000fd98, 0x1c15899, 0xd, 0xc0009a2ba0, 0x23, 0xc0000587d0, 0xd, ...)
	/workspace/controllers/kustomization_controller.go:744 +0xfd
github.com/fluxcd/kustomize-controller/controllers.(*KustomizationReconciler).reconcile(0xc00083e5a0, 0x2381328, 0xc0009af710, 0x1c15899, 0xd, 0xc0009a2ba0, 0x23, 0xc0000587d0, 0xd, 0x0, ...)
	/workspace/controllers/kustomization_controller.go:385 +0xf7b
github.com/fluxcd/kustomize-controller/controllers.(*KustomizationReconciler).Reconcile(0xc00083e5a0, 0x2381328, 0xc0009af710, 0xc0000587f0, 0x8, 0xc0000587d0, 0xd, 0xc0009af700, 0x0, 0x0, ...)
	/workspace/controllers/kustomization_controller.go:233 +0xe58
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00089dea0, 0x2381280, 0xc000776000, 0x1e16e60, 0xc0002651a0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00089dea0, 0x2381280, 0xc000776000, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc0002eb480, 0xc00089dea0, 0x2381280, 0xc000776000)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:210 +0x425

Hu, I'll need to take a look to find a scenario where that might happen.

stefanprodan · 2021-07-26T09:07:44Z

@makkes why is the map made out of pointers? we should change it to values or check for nil before each operation .

makkes · 2021-07-26T10:06:49Z

@stefanprodan the main reason is that collector.ResourceStatusCollector's ResourceStatuses is also a map of pointers:

type ResourceStatusCollector struct {
[...]
	ResourceStatuses map[object.ObjMetadata]*event.ResourceStatus
[...]
}

But I agree we should just skip nils. I'll put up a PR.

stefanprodan · 2021-07-26T10:25:58Z

thanks @makkes

The `lastStatus` map now stores `event.ResourceStatus` values instead of pointers to them, preventing nil pointers being dereferenced by accident. refs fluxcd#374 Signed-off-by: Max Jonas Werner <mail@makk.es>

When checking the health status of each declared resource, kstatus might return nil for certain resources (for whatever reason). In that case, this information is now conveyed in the health status event. fluxcd#374 Signed-off-by: Max Jonas Werner <mail@makk.es>

makkes force-pushed the better-health-check-messaging branch from 5bb31e2 to 63d83ff Compare June 24, 2021 16:07

hiddeco added area/kstatus Health checking related issues and pull requests enhancement New feature or request labels Jun 24, 2021

makkes force-pushed the better-health-check-messaging branch from 4301524 to 14da055 Compare July 2, 2021 14:40

makkes force-pushed the better-health-check-messaging branch from d494266 to 6304f61 Compare July 2, 2021 20:49

makkes force-pushed the better-health-check-messaging branch from 6304f61 to 5bb78ad Compare July 2, 2021 21:02

makkes force-pushed the better-health-check-messaging branch from 5bb78ad to f73621b Compare July 4, 2021 19:38

makkes force-pushed the better-health-check-messaging branch from f73621b to bbc4208 Compare July 4, 2021 19:54

stefanprodan approved these changes Jul 5, 2021

View reviewed changes

stefanprodan merged commit bb71e74 into fluxcd:main Jul 5, 2021

makkes deleted the better-health-check-messaging branch July 5, 2021 07:54

makkes mentioned this pull request Jul 26, 2021

fix: prevent nil pointer dereference in health checks #394

Merged

makkes mentioned this pull request Oct 20, 2021

REQUEST: New membership for @makkes fluxcd/community#127

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make it easier to reason about health check failures #374

feat: make it easier to reason about health check failures #374

makkes commented Jun 24, 2021 •

edited

Loading

stefanprodan commented Jun 24, 2021

makkes commented Jun 24, 2021 •

edited

Loading

stefanprodan commented Jun 25, 2021

makkes commented Jun 25, 2021

hiddeco commented Jun 25, 2021 •

edited

Loading

makkes commented Jul 2, 2021 •

edited

Loading

makkes commented Jul 2, 2021

makkes commented Jul 2, 2021

stefanprodan left a comment

makkes commented Jul 22, 2021

stefanprodan commented Jul 26, 2021

makkes commented Jul 26, 2021

stefanprodan commented Jul 26, 2021

feat: make it easier to reason about health check failures #374

feat: make it easier to reason about health check failures #374

Conversation

makkes commented Jun 24, 2021 • edited Loading

stefanprodan commented Jun 24, 2021

makkes commented Jun 24, 2021 • edited Loading

stefanprodan commented Jun 25, 2021

makkes commented Jun 25, 2021

hiddeco commented Jun 25, 2021 • edited Loading

makkes commented Jul 2, 2021 • edited Loading

makkes commented Jul 2, 2021

makkes commented Jul 2, 2021

stefanprodan left a comment

Choose a reason for hiding this comment

makkes commented Jul 22, 2021

stefanprodan commented Jul 26, 2021

makkes commented Jul 26, 2021

stefanprodan commented Jul 26, 2021

makkes commented Jun 24, 2021 •

edited

Loading

makkes commented Jun 24, 2021 •

edited

Loading

hiddeco commented Jun 25, 2021 •

edited

Loading

makkes commented Jul 2, 2021 •

edited

Loading