Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add healthz endpoint #107

Merged
merged 2 commits into from
Dec 6, 2022
Merged

Conversation

xinbinhuang
Copy link
Contributor

@xinbinhuang xinbinhuang commented Sep 9, 2022

With the current /metrics endpoint for the readiness probe. The log is flooded with error log

{"level":"error","ts":1662690783.7314105,"logger":"http.handlers.metrics","msg":"error encoding and sending metric family:write tcp 10.42.55.171:9765->192.168.0.217:55420: write: connection reset by peer"}
{"level":"error","ts":1662690783.7314105,"logger":"http.handlers.metrics","msg":"error encoding and sending metric family:write tcp 10.42.55.171:9765->192.168.0.217:55420: write: connection reset by peer"}
 {"level":"error","ts":1662690813.7306309,"logger":"http.handlers.metrics","msg":"error encoding and sending metric family:write tcp 10.42.55.171:9765->192.168.0.217:56800: write: broken pipe"}

This's likely due to the large size of the metrics endpoint.

This PR try to fix that:

  • Add a lightweight healthz endpoint and use that for readinessProbe instead of the metrics endpoint.
  • Also, fix dev workflow to make it actually working

@xinbinhuang xinbinhuang changed the title add healthz Add healthz endpoint Sep 9, 2022
@xinbinhuang
Copy link
Contributor Author

xinbinhuang commented Sep 9, 2022

@nilathedragon @Embraser01 Would you like to take a look?

@nilathedragon
Copy link
Contributor

I'm not a maintainer so I can't make the call on this, but I'd welcome the separate healthz endpoint.

Copy link
Member

@Embraser01 Embraser01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Didn't had time to check this problem yet

Comment on lines 89 to 90
port: 80
path: /healthz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of having the health check on the 80 port (which serve all requests). It would be better to add it either on the metric endpoint or on a new http server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have it on the metrics endpoint, you wouldn't be able to turn off metrics without breaking the health check. So a separate port might be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the metrics server is always on but the metrics handler is enabled only when needed. It should be fine to replace the "static_response" handler by an healthz handler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/caddyserver/ingress/blob/master/internal/caddy/global/metrics.go#L28 This line would currently disable the whole server if metrics are turned off. I think I encountered this when I tried disabling metrics and my readiness checks would keep failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to port it to the metrics endpoint and see if disabling the metrics server still works. Will report back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly skim the code. It's definitely doable. I will update the PR to accomodate the change.

I have a question though. Would there be possibility where the ingress_server server is down, but the metrics_server is still active? This will translate to the controller showing as healthy but can't actually route any traffics.

Copy link
Contributor Author

@xinbinhuang xinbinhuang Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Embraser01 I've moved the /healthz to the metrics server.

And add a test to validate the change by comparing the generated Caddy JSON config. Currently, it only has one base test case, and doesn't seem straightforward to add more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question though. Would there be possibility where the ingress_server server is down, but the metrics_server is still active? This will translate to the controller showing as healthy but can't actually route any traffics.

I don't think it can happen, at least I think on config reload, Caddy works on a "all or nothing" way where if the ingress_server is not yet started, metrics_server will not serve either although I'm not sure (@mholt should know more on this)

Embraser01
Embraser01 previously approved these changes Oct 12, 2022
@xinbinhuang
Copy link
Contributor Author

xinbinhuang commented Oct 26, 2022

@Embraser01 Can you help approve the workflow again? I fixed the build error for goreleaser.

@xinbinhuang
Copy link
Contributor Author

@Embraser01 CI passed ✔️ Should we merge? Or let me know if you have extra feedback.

@Embraser01 Embraser01 merged commit 6e28cb2 into caddyserver:master Dec 6, 2022
@Embraser01
Copy link
Member

Yes, really sorry for the long delay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants