-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add healthz endpoint #107
Add healthz endpoint #107
Conversation
db3a266
to
84539d3
Compare
@nilathedragon @Embraser01 Would you like to take a look? |
I'm not a maintainer so I can't make the call on this, but I'd welcome the separate healthz endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Didn't had time to check this problem yet
port: 80 | ||
path: /healthz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of having the health check on the 80 port (which serve all requests). It would be better to add it either on the metric endpoint or on a new http server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have it on the metrics endpoint, you wouldn't be able to turn off metrics without breaking the health check. So a separate port might be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the metrics server is always on but the metrics handler is enabled only when needed. It should be fine to replace the "static_response" handler by an healthz handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/caddyserver/ingress/blob/master/internal/caddy/global/metrics.go#L28 This line would currently disable the whole server if metrics are turned off. I think I encountered this when I tried disabling metrics and my readiness checks would keep failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to port it to the metrics endpoint and see if disabling the metrics server still works. Will report back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly skim the code. It's definitely doable. I will update the PR to accomodate the change.
I have a question though. Would there be possibility where the ingress_server
server is down, but the metrics_server
is still active? This will translate to the controller showing as healthy but can't actually route any traffics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Embraser01 I've moved the /healthz
to the metrics server.
And add a test to validate the change by comparing the generated Caddy JSON config. Currently, it only has one base test case, and doesn't seem straightforward to add more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question though. Would there be possibility where the ingress_server server is down, but the metrics_server is still active? This will translate to the controller showing as healthy but can't actually route any traffics.
I don't think it can happen, at least I think on config reload, Caddy works on a "all or nothing" way where if the ingress_server
is not yet started, metrics_server
will not serve either although I'm not sure (@mholt should know more on this)
123ee43
to
bb7ac2f
Compare
bb7ac2f
to
550ed09
Compare
550ed09
to
ae14620
Compare
@Embraser01 Can you help approve the workflow again? I fixed the build error for goreleaser. |
@Embraser01 CI passed ✔️ Should we merge? Or let me know if you have extra feedback. |
Yes, really sorry for the long delay |
With the current
/metrics
endpoint for the readiness probe. The log is flooded with error logThis's likely due to the large size of the metrics endpoint.
This PR try to fix that:
healthz
endpoint and use that forreadinessProbe
instead of the metrics endpoint.