Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ready checks are incompatible with Service usage and PDB usage #916

Closed
nkvoll opened this issue May 23, 2019 · 8 comments · Fixed by #1775
Closed

Ready checks are incompatible with Service usage and PDB usage #916

nkvoll opened this issue May 23, 2019 · 8 comments · Fixed by #1775
Assignees
Labels
discuss We need to figure this out

Comments

@nkvoll
Copy link
Member

nkvoll commented May 23, 2019

Ready checks are currently used to decide whether a Service should include the pods, so pods are considered ready as soon as they can respond to HTTP requests.

Once we start to use PodDisruptionBudgets, readiness means that the pod is not considered disrupted. In this scenario, the safest bet is cluster state: green. Yellow might be tolerated in some scenarios (e.g 1 node cluster), but it seems this should be configurable by the user.

These two are not immediately compatible, but since readiness is the only way to indicate to the k8s admins that the cluster is disrupted, this takes precedence.

To facilitate this, we can tolerate unready pods in the service. This causes a problem: requests may be forwarded to pods that are not ready to receive requests at all. To fix this, we can use some kind of new ready label on the pods in the Service selector. This label may be controlled by the operator (e.g pinging individual nodes every X seconds), effectively moving the service-related readiness probe to the operator and separating it from the PDB-related readiness check.

Relates to: #914

@sebgl sebgl added the discuss We need to figure this out label Jul 17, 2019
@pebrc
Copy link
Collaborator

pebrc commented Aug 13, 2019

This label may be controlled by the operator (e.g pinging individual nodes every X seconds)

One disadvantage of this approach is that unavailable pods might be listed as endpoints of the service and available pods might be missing from the endpoints if the operator is not running. The current implementation allows clusters to run (and fail) independently from the failure domain of the operator (unless structural changes are being applied to the cluster via the operator, in which case they are somewhat tied together)

@anyasabo
Copy link
Contributor

anyasabo commented Aug 13, 2019

Just to be sure I understand the concerns correctly:

Current ready check:

  • Does the pod respond to http requests

Both the pod disruption budget and service use the pod's ready status to decide if pod is disrupted and if the pod is ready to receive requests.

So if the pod gets disrupted, it will get marked as ready as soon as that pod is up and responding to requests. This might cause a problem because the cluster may still be yellow and be recovering (say it is replicating from a primary to get back in sync since it was last up), but from the pod disruption budget it thinks everything is fine and may disrupt the pod hosting the primary shard, which will make the cluster even sadder.

If we change the readiness check to both:

  • the pod responds to http requests
  • the cluster is green

Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.

Would it maybe make sense to mark a node as not ready if it is a target node for any active recoveries? I think that might address this in a more idiomatic way unless there's other reasons for not readiness I might be missing.


My initial thought was maybe make use of the ready++ readiness gates enhancement:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md

where we could define multiple readiness gates, but we kind of want the opposite (ready--?) if I understand correctly. We want a subset of readiness checks for service readiness, but all of them for disruption budget purposes.

@pebrc
Copy link
Collaborator

pebrc commented Aug 16, 2019

Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.

I think one detail that was implied in the OP and not clear to me is that if we were to follow this approach and bind the default readiness check to cluster health we would of course need to make service endpoint membership independent of that check by using publishNotReadyAddresses: true. If that were the case then service membership would only rely on the labelling of the pods which we would then use in the selector for the service.

(the failure domain argument still applies though)

@anyasabo
Copy link
Contributor

Ah got it. I'm curious what states specifically will the node be replying to our readiness check as is (replying to the / endpoint with a 200), but the cluster is actually not ready to have another pod disrupted.

The one I could think of was if there were active recoveries going to that node. If we were to disrupt the node with the shard it is recovering from, then if there is only one replica (the one that is getting recovered to) we may turn the cluster red.

But on the other hand, recoveries can happen in normal operations too, so I'm not sure taking a node out of the service just because it is receiving a recovery is necessarily the right choice either.

@sebgl
Copy link
Contributor

sebgl commented Sep 19, 2019

Based on https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors, it looks like we cannot use maxUnavailable when selecting pods with a non-builtin controller label (eg. not using a Deployment or StatefulSet label selector).
IIUC that is because the PDB controller would not be able to figure out what's the "expected" number of pods allowed to run. Which makes sense.

So we end up with either:

  1. one PDB per StatefulSet. Not an option for us since we can have N StatefulSets as part of the cluster.
  2. one PDB for all pods, but then we must use minAvailable instead of maxUnavailable.

Working with option 2, the operator needs to reconcile the correct value for minAvailable in the PDB, based on the number of nodes in the cluster.
I think we can do it this way, in the reconciliation loop:

  • If the cluster is green, set minAvailable to the number of nodes minus 1 (authorize one pod to be disrupted)
  • if the cluster is not green, set minAvailable to the number of nodes

We already poll for cluster health information regularly, and reconcile if it changes. This way we don't have to change anything in our healthchecks and service usage.

Now there's of course a race condition here where the PDB would be set for a green cluster whereas the cluster is (realtime) yellow. If the disruption happens before we reconcile the PDB, then there's a risk we loose an important node. But I think that's also true when using readiness checks anyway, since they are only executed every X seconds.

Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.

Thoughts?

@anyasabo
Copy link
Contributor

I like your option 2 to just use minAvailable @sebgl , though as mentioned it's not a direct function of the replicas in the spec, but rather a function of the current pods we should have (say we are migrating to a new sset), so it is more complicated than it appears at first.

Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.

I think this can be left as an option to add later. This seems like a relatively advanced use case and I'm not sure it's something we need to expose just yet.

@pebrc
Copy link
Collaborator

pebrc commented Sep 19, 2019

I think we have a decision at this point to

  • change the readiness checks to use a node local API. So pod readiness is not conflated with cluster availability anymore
  • make the PDB use minAvailable and make it dynamic and bound to cluster health in the way @sebgl has described and thereby sidestepping the problem in the OP

@sebgl sebgl self-assigned this Sep 20, 2019
@sebgl
Copy link
Contributor

sebgl commented Sep 20, 2019

PodDisruptionBudgets are immutable in k8s < 1.15 (PR that adds mutability in 1.15).
Which means we'll have to delete then recreate (at least for k8s < 1.15), which leaves a small time window where things can go wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss We need to figure this out
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants