Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve HPA DC scaling races #5597

Closed
deads2k opened this issue Nov 2, 2015 · 20 comments
Closed

Resolve HPA DC scaling races #5597

deads2k opened this issue Nov 2, 2015 · 20 comments

Comments

@deads2k
Copy link
Contributor

deads2k commented Nov 2, 2015

The HPA controller enabled in #5310 can race with the deployment config controller creating unexpected scaling results.

The DC itself should have a desired number of replicas that the DC controller maintains to avoid (or significantly tighten) the race window.

@DirectXMan12 @ironcladlou

@ironcladlou
Copy link
Contributor

The solution we previously discussed for this is to make the DeploymentConfigController responsible for reconciling a deployment's replica count with its associated config. The auto-scaler (or manual scaler) would target only the deploymentConfig. This allows the controller to safely decide whether to scale up now (if the latest deployment is complete) or later (after an in-progress deployment completes).

This does introduce a fundamental change in the way we've been presenting DC scaling to users. Today, the DC replicas is only for new deployments, and subsequent deployments follow the replica count of the last successful deployment. With this change, a user modification to the DC replicas would actually scale up the deployment. I think this is probably better behavior, and brings us closer to the semantics of the new upstream deployment system, but it does warrant discussion since the current behavior is documented and unchanged since the introduction of deployments.

cc @smarterclayton @ncdc @Kargakis

@0xmichalis
Copy link
Contributor

With this change, a user modification to the DC replicas would actually scale up the deployment.

I find this reasonable.

@ncdc
Copy link
Contributor

ncdc commented Nov 10, 2015

cc @danmcp @devop-mmcgrath @eparis

@DirectXMan12
Copy link
Contributor

So, AFAICT, the actual scaling action for our DeploymentConfig implementation occurs in the deployer pod (except for "cleanup failed deployment" part which happens in the deployer pod controller), which dies when the deployment is complete. Since this plan would require adding in another place that manipulates deployments, we have to be careful that we don't accidentally add in any more race conditions (my initial conclusion is that this shouldn't be a problem, but it's something to keep in mind).

(For comparison, the upstream Deployment implemention is controlled by the Deployment controller, as opposed to our DeploymentConfig setup, which has 3 controllers and the deployer pod).

@ironcladlou
Copy link
Contributor

The DeploymentConfigController might also need to take care of ensuring that deployments older than the latest complete deployment are scaled down to zero (to eventually correct mistaken scaleups concurrent with new deployment creation in an HA setup). I think that might close any remaining gaps, but I won't be surprised if somebody thinks of other issues... 😁

@ironcladlou
Copy link
Contributor

Trello card: https://trello.com/c/CCv7OpxE

@smarterclayton
Copy link
Contributor

Can you document how this proposal differs/aligns with the upstream DC
today (specifically what upstream allows w.r.t. changes in scale, and what
the upstream controller will do to old RCs)

On Nov 10, 2015, at 11:18 AM, Dan Mace notifications@github.com wrote:

The DeploymentConfigController might also need to take care of ensuring
that deployments older than the latest complete deployment are scaled down
to zero (to eventually correct mistaken scaleups concurrent with new
deployment creation in an HA setup). I think that might close any remaining
gaps, but I won't be surprised if somebody thinks of other issues... [image:
😁]


Reply to this email directly or view it on GitHub
#5597 (comment).

@ironcladlou
Copy link
Contributor

A quick note since the terms are different upstream: In OpenShift, "deployments" are replication controllers, and the template for a deployment is a "deploymentConfig". In Kubernetes, the template for a deployment is called a "deployment", and the thing that's deployed is just a replication controller.

Upstream RC scaling is managed entirely by the deployment controller. The autoscaler targets the deployment (aka deploymentConfig in openshift today). When the controller is reconciling a given deployment, a change in the replica count between reconcilation iterations is handled gracefully: each iteration, the controller scales up what it thinks is currently the new or existing RC, and scales down all old RCs. During the next iteration it will work with the latest system state. Replica count updates to the deployment which occur between iterations will be picked up and reconciled next time around.

If the deployment template changes in such a way that a parallel controller (in an HA setup) starts scaling up a new RC and scaling down "old RCs" which overlap with another controller still scaling up what it thought was the new RC (but which is now old), they would only fight for a single iteration, and the system state should eventually converge to the current template/replica count.

The change here aligns us a little more closely with that model in that it allows the HPA to target deploymentConfigs rather than RCs to prevent fighting. The way we implement that capability is much different than upstream, so there's not much in the way of overlap in terms of how we facilitate that targetting, but conceptually it's a step forward as we'll eventually want to be able to do a direct replacement of deploymentConfigs with deployments and not have to revisit the HPA targetting mechanics.

Hopefully this is a clear enough explanation. I recommend everybody take a look at the upstream deployment controller to help understand and fact check me. It's extremely simple compared to what we have in origin: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/deployment/deployment_controller.go

@ironcladlou
Copy link
Contributor

I outlined another upstream alignment wrt scaling here: #5597 (comment)

@ironcladlou
Copy link
Contributor

We'll also need to fix the CLI and console so they adjust the deploymentConfig.spec.replicas directly when RCs are owned by a deploymentConfig. cc @jwforres @spadgett

@ironcladlou
Copy link
Contributor

The scale subresource handling will also need updated to scale just the deploymentConfig. The CLI and console should use the subresource instead of manipulating the DC directly, I think.

@jwforres
Copy link
Member

I would prefer that, do you have an example of what we would post to the
scale subresource?

On Wed, Nov 11, 2015 at 11:21 AM, Dan Mace notifications@github.com wrote:

The scale subresource handling will also need updated to scale just the
deploymentConfig. The CLI and console should use the subresource instead of
manipulating the DC directly, I think.


Reply to this email directly or view it on GitHub
#5597 (comment).

@ironcladlou
Copy link
Contributor

Talking with @Kargakis I found a pretty big problem stemming from behavior I had forgotten about, which is failure handling/automatic "rollback". Given:

  1. Deploy RC.version 1 (success)
  2. Deploy RC.version 2 (success)
  3. Deploy RC.version=3 (failed)

Openshift will:

  1. Scale down RC.version=3 to 0
  2. Scale up RC.version=2 to its previous replica count

This is a problem because now we have a discrepency that will be very complicated for the controllers to resolve: the "latest" deployment according to the DC is RC.version=3, but the "effective" deployment is RC.version=2. So now scaling operations are going to inadvertantly target RC.version=3 due to the DC version not matching the "effective" deployment version.

Here's how this works upstream:

  1. Deploy RC state A (scales up completely, "success")
  2. Deploy RC state B (scales up complete, "success")
  3. Deploy RC state C (scale never achieved, condition is such that this needs rectified)

Although upstream has no auto-rectify behavior yet, when it does, it'll be very simple: the deployment will be updated to state B, and the controller will now identify the existing RC with state B as the "new" RC and all other RCs (including RC state C) as "old", scale up B and scale down A and C.

I think the way we're handling it in OpenShift has in hindsight has painted us into a corner due to the complexity introduced by the inconsistency of the DC.version and what's considered the effective RC. I think given our use of sequential version numbers, we probably should have done this instead:

  1. Roll back the DC template to RC.version=2's template
  2. Deploy RC.version=4

Thinking back on why we did it the current way, I think it was to avoid a new deployer process (@smarterclayton or @abhgupta might remember.). I think we might be better off just letting a new deployment happen than trying to deal with the idea of the active deployment being some arbitrary RC backwards in time.

@smarterclayton
Copy link
Contributor

Why would scaling operations target 3?

On Nov 11, 2015, at 8:51 AM, Dan Mace notifications@github.com wrote:

Talking with @Kargakis https://github.com/kargakis I found a pretty big
problem stemming from behavior I had forgotten about, which is failure
handling/automatic "rollback". Given:

  1. Deploy RC.version 1 (success)
  2. Deploy RC.version 2 (success)
  3. Deploy RC.version=3 (failed)

Openshift will:

  1. Scale down RC.version=3 to 0
  2. Scale up RC.version=2 to its previous replica count

This is a problem because now we have a discrepency that will be very
complicated for the controllers to resolve: the "latest" deployment
according to the DC is RC.version=3, but the "effective" deployment is
RC.version=2. So now scaling operations are going to inadvertantly target
RC.version=3 due to the DC version not matching the "effective" deployment
version.

Here's how this works upstream:

  1. Deploy RC state A (scales up completely, "success")
  2. Deploy RC state B (scales up complete, "success")
  3. Deploy RC state C (scale never achieved, condition is such that this
    needs rectified)

Although upstream has no auto-rectify behavior yet, when it does, it'll be
very simple: the deployment will be updated to state B, and the controller
will now identify the existing RC with state B as the "new" RC and all
other RCs (including RC state C) as "old", scale up B and scale down A and
C.

I think the way we're handling it in OpenShift has in hindsight has painted
us into a corner due to the complexity introduced by the inconsistency of
the DC.version and what's considered the effective RC. I think given our
use of sequential version numbers, we probably should have done this
instead:

  1. Roll back the DC template to RC.version=2's template
  2. Deploy RC.version=4

Thinking back on why we did it the current way, I think it was to avoid a
new deployer process (@smarterclayton https://github.com/smarterclayton
or @abhgupta https://github.com/abhgupta might remember.). I think we
might be better off just letting a new deployment happen than trying to
deal with the idea of the active deployment being some arbitrary RC
backwards in time.


Reply to this email directly or view it on GitHub
#5597 (comment).

@ironcladlou
Copy link
Contributor

Why would scaling operations target 3?

The scaling operations will target the DC, which is at version=3. Normally we'd then use that as the basis of finding the latest RC for scaling. But with the current model, we must now have a special case where we go look for the "latest successful deployment" which could be any version.

That might work, but my confidence level in covering all edge cases is pretty low at this very moment. There are so many places where we assume that the latest deployment at a point in time is the current DC.latestVersion.

@smarterclayton
Copy link
Contributor

If we have code that is assuming latest = active, we have a bigger problem.

The scaler sets the DC.

The deployment controller should be able to know that a scale event
happened and target the right RC (it's the controller for responsible for
knowing what the latest rc is) What am I missing?

On Nov 11, 2015, at 8:57 AM, Dan Mace notifications@github.com wrote:

Why would scaling operations target 3?

The scaling operations will target the DC, which is at version=3. Normally
we'd then use that as the basis of finding the latest RC for scaling. But
with the current model, we must now have a special case where we go look
for the "latest successful deployment" which could be any version.

That might work, but my confidence level in covering all edge cases is
pretty low at this very moment. There are so many places where we assume
that the latest deployment at a point in time is the current
DC.latestVersion.


Reply to this email directly or view it on GitHub
#5597 (comment).

@ironcladlou
Copy link
Contributor

The deployer pod controller is going to also be trying to scale RCs in parallel, competing with the deployment controller.

If we want to keep the current general failure handling behavior, I think the way we implement it needs reworked so that only the deployment config controller is responsible for scaling outside the deployment process.

@smarterclayton
Copy link
Contributor

On Nov 11, 2015, at 9:16 AM, Dan Mace notifications@github.com wrote:

The deployer pod controller is going to also be trying to scale RCs in
parallel, competing with the deployment controller.

The deployer has to back off, or the deployment pod controller has to know
the deployment pod is in progress.

If we want to keep the current general failure handling behavior, I think
the way we implement it needs reworked so that only the deployment config
controller is responsible for scaling outside the deployment process

I'm not convinced of this. The deployment controller owns RCs and has
final say on target scaling. When the deployment controller has given
control to the deployer pod, there is no ambiguity of that. The deployer
needs to wait until the deployer pod is complete, or update the pod target
annotation. You can always heal an rc that is not under active control.

@ironcladlou
Copy link
Contributor

What I mean is, the logic that's currently in the deployer pod controller needs moved into the deployment config controller so that the deployer pod controller is responsible only for looking at the outcome of a deployer pod and updating the deployment phase. The deployment config controller should be looking at the deployments it owns and making all the scaling decisions.

@smarterclayton
Copy link
Contributor

Yes. I had a quick discussion with Brian today about potentially not
having "failed" deployments per se. I still want it for custom deploy
strategies, but maybe for rolling we just always try, and have a condition
on the deployment config that says "hasn't reached terminal state in X
minutes, warning"

On Nov 11, 2015, at 11:28 AM, Dan Mace notifications@github.com wrote:

What I mean is, the logic that's currently in the deployer pod controller
needs moved into the deployment config controller so that the deployer pod
controller is responsible only for looking at the outcome of a deployer pod
and updating the deployment phase. The deployment config controller should
be looking at the deployments it owns and making all the scaling decisions.


Reply to this email directly or view it on GitHub
#5597 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants