Ready checks are incompatible with Service usage and PDB usage #916

nkvoll · 2019-05-23T10:42:43Z

Ready checks are currently used to decide whether a Service should include the pods, so pods are considered ready as soon as they can respond to HTTP requests.

Once we start to use PodDisruptionBudgets, readiness means that the pod is not considered disrupted. In this scenario, the safest bet is cluster state: green. Yellow might be tolerated in some scenarios (e.g 1 node cluster), but it seems this should be configurable by the user.

These two are not immediately compatible, but since readiness is the only way to indicate to the k8s admins that the cluster is disrupted, this takes precedence.

To facilitate this, we can tolerate unready pods in the service. This causes a problem: requests may be forwarded to pods that are not ready to receive requests at all. To fix this, we can use some kind of new ready label on the pods in the Service selector. This label may be controlled by the operator (e.g pinging individual nodes every X seconds), effectively moving the service-related readiness probe to the operator and separating it from the PDB-related readiness check.

Relates to: #914

The text was updated successfully, but these errors were encountered:

pebrc · 2019-08-13T12:23:07Z

This label may be controlled by the operator (e.g pinging individual nodes every X seconds)

One disadvantage of this approach is that unavailable pods might be listed as endpoints of the service and available pods might be missing from the endpoints if the operator is not running. The current implementation allows clusters to run (and fail) independently from the failure domain of the operator (unless structural changes are being applied to the cluster via the operator, in which case they are somewhat tied together)

anyasabo · 2019-08-13T19:49:49Z

Just to be sure I understand the concerns correctly:

Current ready check:

Does the pod respond to http requests

Both the pod disruption budget and service use the pod's ready status to decide if pod is disrupted and if the pod is ready to receive requests.

So if the pod gets disrupted, it will get marked as ready as soon as that pod is up and responding to requests. This might cause a problem because the cluster may still be yellow and be recovering (say it is replicating from a primary to get back in sync since it was last up), but from the pod disruption budget it thinks everything is fine and may disrupt the pod hosting the primary shard, which will make the cluster even sadder.

If we change the readiness check to both:

the pod responds to http requests
the cluster is green

Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.

Would it maybe make sense to mark a node as not ready if it is a target node for any active recoveries? I think that might address this in a more idiomatic way unless there's other reasons for not readiness I might be missing.

My initial thought was maybe make use of the ready++ readiness gates enhancement:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md

where we could define multiple readiness gates, but we kind of want the opposite (ready--?) if I understand correctly. We want a subset of readiness checks for service readiness, but all of them for disruption budget purposes.

pebrc · 2019-08-16T15:18:03Z

Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.

I think one detail that was implied in the OP and not clear to me is that if we were to follow this approach and bind the default readiness check to cluster health we would of course need to make service endpoint membership independent of that check by using publishNotReadyAddresses: true. If that were the case then service membership would only rely on the labelling of the pods which we would then use in the selector for the service.

(the failure domain argument still applies though)

anyasabo · 2019-08-16T17:31:05Z

Ah got it. I'm curious what states specifically will the node be replying to our readiness check as is (replying to the / endpoint with a 200), but the cluster is actually not ready to have another pod disrupted.

The one I could think of was if there were active recoveries going to that node. If we were to disrupt the node with the shard it is recovering from, then if there is only one replica (the one that is getting recovered to) we may turn the cluster red.

But on the other hand, recoveries can happen in normal operations too, so I'm not sure taking a node out of the service just because it is receiving a recovery is necessarily the right choice either.

sebgl · 2019-09-19T13:39:05Z

Based on https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors, it looks like we cannot use maxUnavailable when selecting pods with a non-builtin controller label (eg. not using a Deployment or StatefulSet label selector).
IIUC that is because the PDB controller would not be able to figure out what's the "expected" number of pods allowed to run. Which makes sense.

So we end up with either:

one PDB per StatefulSet. Not an option for us since we can have N StatefulSets as part of the cluster.
one PDB for all pods, but then we must use minAvailable instead of maxUnavailable.

Working with option 2, the operator needs to reconcile the correct value for minAvailable in the PDB, based on the number of nodes in the cluster.
I think we can do it this way, in the reconciliation loop:

If the cluster is green, set minAvailable to the number of nodes minus 1 (authorize one pod to be disrupted)
if the cluster is not green, set minAvailable to the number of nodes

We already poll for cluster health information regularly, and reconcile if it changes. This way we don't have to change anything in our healthchecks and service usage.

Now there's of course a race condition here where the PDB would be set for a green cluster whereas the cluster is (realtime) yellow. If the disruption happens before we reconcile the PDB, then there's a risk we loose an important node. But I think that's also true when using readiness checks anyway, since they are only executed every X seconds.

Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.

Thoughts?

anyasabo · 2019-09-19T17:55:24Z

I like your option 2 to just use minAvailable @sebgl , though as mentioned it's not a direct function of the replicas in the spec, but rather a function of the current pods we should have (say we are migrating to a new sset), so it is more complicated than it appears at first.

Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.

I think this can be left as an option to add later. This seems like a relatively advanced use case and I'm not sure it's something we need to expose just yet.

pebrc · 2019-09-19T17:59:48Z

I think we have a decision at this point to

change the readiness checks to use a node local API. So pod readiness is not conflated with cluster availability anymore
make the PDB use minAvailable and make it dynamic and bound to cluster health in the way @sebgl has described and thereby sidestepping the problem in the OP

sebgl · 2019-09-20T15:53:02Z

PodDisruptionBudgets are immutable in k8s < 1.15 (PR that adds mutability in 1.15).
Which means we'll have to delete then recreate (at least for k8s < 1.15), which leaves a small time window where things can go wrong.

sebgl mentioned this issue May 24, 2019

Consider features from the Elasticsearch helm-chart #928

Closed

pebrc added the loe:medium label May 28, 2019

sebgl added the discuss We need to figure this out label Jul 17, 2019

pebrc mentioned this issue Sep 18, 2019

Pod readiness based on cluster availability breaks change budget #1748

Closed

sebgl self-assigned this Sep 20, 2019

This was referenced Sep 23, 2019

Set a dynamic MinAvailable value for the default PDB #1775

Merged

Bind ES Pod readiness to a node-local endpoint #1777

Merged

sebgl closed this as completed in #1775 Sep 25, 2019

pebrc mentioned this issue May 27, 2024

Use Elasticearch readiness port instead of / for Pod readiness #7841

Closed

pebrc mentioned this issue Sep 9, 2024

Readiness probe not working with custom container port #8042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ready checks are incompatible with Service usage and PDB usage #916

Ready checks are incompatible with Service usage and PDB usage #916

nkvoll commented May 23, 2019

pebrc commented Aug 13, 2019

anyasabo commented Aug 13, 2019 •

edited

Loading

pebrc commented Aug 16, 2019 •

edited

Loading

anyasabo commented Aug 16, 2019

sebgl commented Sep 19, 2019 •

edited

Loading

anyasabo commented Sep 19, 2019

pebrc commented Sep 19, 2019

sebgl commented Sep 20, 2019 •

edited

Loading

Ready checks are incompatible with Service usage and PDB usage #916

Ready checks are incompatible with Service usage and PDB usage #916

Comments

nkvoll commented May 23, 2019

pebrc commented Aug 13, 2019

anyasabo commented Aug 13, 2019 • edited Loading

pebrc commented Aug 16, 2019 • edited Loading

anyasabo commented Aug 16, 2019

sebgl commented Sep 19, 2019 • edited Loading

anyasabo commented Sep 19, 2019

pebrc commented Sep 19, 2019

sebgl commented Sep 20, 2019 • edited Loading

anyasabo commented Aug 13, 2019 •

edited

Loading

pebrc commented Aug 16, 2019 •

edited

Loading

sebgl commented Sep 19, 2019 •

edited

Loading

sebgl commented Sep 20, 2019 •

edited

Loading