-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cortex components should have readiness check endpoints #784
Comments
By health checks do you mean liveness? don’t think it should be taken as
given that liveness checks are always a good idea. I’ve had plenty of
experiences where they’ve caused a pod to crash loop and made it harder to
figure out what’s broken.
The basic premise of liveness is that the pod enters a failure state it
cannot exit; Cortex shouldn’t do this; it has retries and reconnect loops
etc. If it does get into a failure state, then replication should route
requests elsewhere and we should leave the pod running for later diagnosis.
Which leads us on to readiness checks - these are a good idea. The Ingester
already has them, and we should add basic checks (to show the http server
is running, perhaps also ping the grpc server) to other components.
WDYT?
…On Wed, 4 Apr 2018 at 22:49, Chris Marchbanks ***@***.***> wrote:
In order to facilitate kubernetes self healing it would be great to have
health checks present in the cortex components.
It would be great to collect some ideas of what good health checks are for
different services here.
I am not working on this currently, but may in the future.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#784>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbGhV4zAh0LRptw-SsAMzZXijxfn81vks5tlT_-gaJpZM4THjbM>
.
|
I agree that liveness checks are a bit heavy handed much of the time, and it would be great to be able to diagnose them or improve Cortex to be able to handle the failure mode more gracefully. I think readiness checks would be ideal, that way the failing component won't be served any requests. The one exception to this is the ruler, since there is no HA yet. A liveness check might be nice to reduce customer impact. |
Just had a instance of the distributor HTTP server just stop accepting requests, and adding a liveness check would have detected this and restarted it. Also, weaveworks/common#92 would hopefully show us the error and exit gracefully. |
Liveness is bad, as referenced here: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html But we should do readiness in distributors. |
Liveness is not always bad, though may or may not be correct for Cortex components. As mentioned above there are cases where a liveness check could help as well. |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
Closed in #2166 |
In order to facilitate kubernetes self healing it would be great to have health checks present in the cortex components.
It would be great to collect some ideas of what good health checks are for different services here.
I am not working on this currently, but may in the future.
The text was updated successfully, but these errors were encountered: