-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ready checks are incompatible with Service usage and PDB usage #916
Comments
One disadvantage of this approach is that unavailable pods might be listed as endpoints of the service and available pods might be missing from the endpoints if the operator is not running. The current implementation allows clusters to run (and fail) independently from the failure domain of the operator (unless structural changes are being applied to the cluster via the operator, in which case they are somewhat tied together) |
Just to be sure I understand the concerns correctly: Current ready check:
Both the pod disruption budget and service use the pod's ready status to decide if pod is disrupted and if the pod is ready to receive requests. So if the pod gets disrupted, it will get marked as ready as soon as that pod is up and responding to requests. This might cause a problem because the cluster may still be yellow and be recovering (say it is replicating from a primary to get back in sync since it was last up), but from the pod disruption budget it thinks everything is fine and may disrupt the pod hosting the primary shard, which will make the cluster even sadder. If we change the readiness check to both:
Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good. Would it maybe make sense to mark a node as not ready if it is a target node for any active recoveries? I think that might address this in a more idiomatic way unless there's other reasons for not readiness I might be missing. My initial thought was maybe make use of the ready++ readiness gates enhancement: where we could define multiple readiness gates, but we kind of want the opposite ( |
I think one detail that was implied in the OP and not clear to me is that if we were to follow this approach and bind the default readiness check to cluster health we would of course need to make service endpoint membership independent of that check by using (the failure domain argument still applies though) |
Ah got it. I'm curious what states specifically will the node be replying to our readiness check as is (replying to the The one I could think of was if there were active recoveries going to that node. If we were to disrupt the node with the shard it is recovering from, then if there is only one replica (the one that is getting recovered to) we may turn the cluster red. But on the other hand, recoveries can happen in normal operations too, so I'm not sure taking a node out of the service just because it is receiving a recovery is necessarily the right choice either. |
Based on https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors, it looks like we cannot use So we end up with either:
Working with option 2, the operator needs to reconcile the correct value for
We already poll for cluster health information regularly, and reconcile if it changes. This way we don't have to change anything in our healthchecks and service usage. Now there's of course a race condition here where the PDB would be set for a green cluster whereas the cluster is (realtime) yellow. If the disruption happens before we reconcile the PDB, then there's a risk we loose an important node. But I think that's also true when using readiness checks anyway, since they are only executed every X seconds. Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional ( Thoughts? |
I like your option 2 to just use minAvailable @sebgl , though as mentioned it's not a direct function of the replicas in the spec, but rather a function of the current pods we should have (say we are migrating to a new sset), so it is more complicated than it appears at first.
I think this can be left as an option to add later. This seems like a relatively advanced use case and I'm not sure it's something we need to expose just yet. |
I think we have a decision at this point to
|
PodDisruptionBudgets are immutable in k8s < 1.15 (PR that adds mutability in 1.15). |
Ready checks are currently used to decide whether a
Service
should include the pods, so pods are considered ready as soon as they can respond to HTTP requests.Once we start to use
PodDisruptionBudget
s, readiness means that the pod is not considered disrupted. In this scenario, the safest bet is cluster state: green. Yellow might be tolerated in some scenarios (e.g 1 node cluster), but it seems this should be configurable by the user.These two are not immediately compatible, but since readiness is the only way to indicate to the k8s admins that the cluster is disrupted, this takes precedence.
To facilitate this, we can tolerate unready pods in the service. This causes a problem: requests may be forwarded to pods that are not ready to receive requests at all. To fix this, we can use some kind of new
ready
label on the pods in the Service selector. This label may be controlled by the operator (e.g pinging individual nodes every X seconds), effectively moving the service-related readiness probe to the operator and separating it from the PDB-related readiness check.Relates to: #914
The text was updated successfully, but these errors were encountered: