-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make it easier to reason about health check failures #374
Conversation
5bb31e2
to
63d83ff
Compare
@makkes this looks great! Can you please test this with more objects, let's say one with a wrong api version, one that passes and another one that fails due to image not found or some scheduling timeout. Thanks! |
Another spec I tested:
Events:
The "succeeds" Deployment is there because its status in "Unknown". I believe the reason for that is that I changed the loop to exit early as soon as an error ocurrs. I'll see whether it would make sense to store all errors in a slice instead. |
So looks like we can capture only one error. Can you please try with 2 failed deployments and one ok. |
Yep, I'm on it and also trying to find a neat way to catch all errors. But I need to dig into cli-utils a little more. |
https://pkg.go.dev/k8s.io/apimachinery/pkg/util/errors may be of help here to collect an aggregation of errors. |
4301524
to
14da055
Compare
@stefanprodan this is how it looks like with two failing Deployments and one succeeding:
As you can see, it shows all 3 of them as timed out, even though the 'default/dep' one is always healthy. It looks like the timing out of one resources check leads to errors being reported for all of them. I will test out one more idea on a more precise reporting. |
d494266
to
6304f61
Compare
Ok, this version works with 1 successful and 2 failing ones:
|
6304f61
to
5bb78ad
Compare
@stefanprodan @hiddeco ready for another round of reviews, please. 🙂 |
5bb78ad
to
f73621b
Compare
Whenever a health check times out now, the most recently collected error for each resource will be printed as part of the error message. This excludes errors for those resources for which no error was reported in the last update. This is because whenever a timeout occurs, an error is reported on ALL resources, even those that have been seen as healthy before. Also, this commit causes all successfully checked resources to be omitted in the error event. Signed-off-by: Max Jonas Werner <mail@makk.es>
f73621b
to
bbc4208
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @makkes 🏅 I really like that we also show the status condition reason 👌
Hu, I'll need to take a look to find a scenario where that might happen. |
@makkes why is the map made out of pointers? we should change it to values or check for nil before each operation . |
@stefanprodan the main reason is that
But I agree we should just skip nils. I'll put up a PR. |
thanks @makkes |
The `lastStatus` map now stores `event.ResourceStatus` values instead of pointers to them, preventing nil pointers being dereferenced by accident. refs fluxcd#374 Signed-off-by: Max Jonas Werner <mail@makk.es>
When checking the health status of each declared resource, kstatus might return nil for certain resources (for whatever reason). In that case, this information is now conveyed in the health status event. fluxcd#374 Signed-off-by: Max Jonas Werner <mail@makk.es>
Whenever a health check times out now, the most recently collected error for each resource will be printed as part of the error message. This excludes errors for those resources for which no error was reported in the last update. This is because whenever a timeout occurs, an error is reported on ALL resources, even those that have been seen as healthy before.
Also, this commit causes all successfully checked resources to be omitted in the error event.
This is how it will look in the API:
Before:
After:
Signed-off-by: Max Jonas Werner mail@makk.es