Cortex components should have readiness check endpoints #784

csmarchbanks · 2018-04-04T21:49:50Z

In order to facilitate kubernetes self healing it would be great to have health checks present in the cortex components.

It would be great to collect some ideas of what good health checks are for different services here.

I am not working on this currently, but may in the future.

tomwilkie · 2018-04-05T06:12:09Z

By health checks do you mean liveness? don’t think it should be taken as given that liveness checks are always a good idea. I’ve had plenty of experiences where they’ve caused a pod to crash loop and made it harder to figure out what’s broken. The basic premise of liveness is that the pod enters a failure state it cannot exit; Cortex shouldn’t do this; it has retries and reconnect loops etc. If it does get into a failure state, then replication should route requests elsewhere and we should leave the pod running for later diagnosis. Which leads us on to readiness checks - these are a good idea. The Ingester already has them, and we should add basic checks (to show the http server is running, perhaps also ping the grpc server) to other components. WDYT?

…

On Wed, 4 Apr 2018 at 22:49, Chris Marchbanks ***@***.***> wrote: In order to facilitate kubernetes self healing it would be great to have health checks present in the cortex components. It would be great to collect some ideas of what good health checks are for different services here. I am not working on this currently, but may in the future. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#784>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGhV4zAh0LRptw-SsAMzZXijxfn81vks5tlT_-gaJpZM4THjbM> .

csmarchbanks · 2018-04-05T13:05:40Z

I agree that liveness checks are a bit heavy handed much of the time, and it would be great to be able to diagnose them or improve Cortex to be able to handle the failure mode more gracefully. I think readiness checks would be ideal, that way the failing component won't be served any requests.

The one exception to this is the ruler, since there is no HA yet. A liveness check might be nice to reduce customer impact.

tomwilkie · 2018-07-17T10:11:36Z

Just had a instance of the distributor HTTP server just stop accepting requests, and adding a liveness check would have detected this and restarted it. Also, weaveworks/common#92 would hopefully show us the error and exit gracefully.

gouthamve · 2019-11-11T16:34:12Z

Liveness is bad, as referenced here: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

But we should do readiness in distributors.

csmarchbanks · 2019-11-11T16:47:48Z

Liveness is not always bad, though may or may not be correct for Cortex components. As mentioned above there are cases where a liveness check could help as well.

stale · 2020-02-03T10:56:40Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

gouthamve · 2020-03-30T21:23:45Z

Closed in #2166

tomwilkie added the type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating. label Aug 24, 2018

gouthamve changed the title ~~Cortex components should have health check endpoints~~ Cortex components should have readiness check endpoints Nov 11, 2019

stale bot added the stale label Feb 3, 2020

pracucci added keepalive Skipped by stale bot and removed stale labels Feb 3, 2020

pracucci mentioned this issue Feb 28, 2020

Convert Cortex components to services #2166

Merged

gouthamve closed this as completed Mar 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex components should have readiness check endpoints #784

Cortex components should have readiness check endpoints #784

csmarchbanks commented Apr 4, 2018

tomwilkie commented Apr 5, 2018 via email

csmarchbanks commented Apr 5, 2018

tomwilkie commented Jul 17, 2018

gouthamve commented Nov 11, 2019

csmarchbanks commented Nov 11, 2019

stale bot commented Feb 3, 2020

gouthamve commented Mar 30, 2020

Cortex components should have readiness check endpoints #784

Cortex components should have readiness check endpoints #784

Comments

csmarchbanks commented Apr 4, 2018

tomwilkie commented Apr 5, 2018 via email

csmarchbanks commented Apr 5, 2018

tomwilkie commented Jul 17, 2018

gouthamve commented Nov 11, 2019

csmarchbanks commented Nov 11, 2019

stale bot commented Feb 3, 2020

gouthamve commented Mar 30, 2020