Resolve HPA DC scaling races #5597

deads2k · 2015-11-02T20:00:28Z

The HPA controller enabled in #5310 can race with the deployment config controller creating unexpected scaling results.

The DC itself should have a desired number of replicas that the DC controller maintains to avoid (or significantly tighten) the race window.

@DirectXMan12 @ironcladlou

ironcladlou · 2015-11-10T17:33:22Z

The solution we previously discussed for this is to make the DeploymentConfigController responsible for reconciling a deployment's replica count with its associated config. The auto-scaler (or manual scaler) would target only the deploymentConfig. This allows the controller to safely decide whether to scale up now (if the latest deployment is complete) or later (after an in-progress deployment completes).

This does introduce a fundamental change in the way we've been presenting DC scaling to users. Today, the DC replicas is only for new deployments, and subsequent deployments follow the replica count of the last successful deployment. With this change, a user modification to the DC replicas would actually scale up the deployment. I think this is probably better behavior, and brings us closer to the semantics of the new upstream deployment system, but it does warrant discussion since the current behavior is documented and unchanged since the introduction of deployments.

cc @smarterclayton @ncdc @Kargakis

0xmichalis · 2015-11-10T18:21:31Z

With this change, a user modification to the DC replicas would actually scale up the deployment.

I find this reasonable.

ncdc · 2015-11-10T18:38:41Z

cc @danmcp @devop-mmcgrath @eparis

DirectXMan12 · 2015-11-10T19:05:12Z

So, AFAICT, the actual scaling action for our DeploymentConfig implementation occurs in the deployer pod (except for "cleanup failed deployment" part which happens in the deployer pod controller), which dies when the deployment is complete. Since this plan would require adding in another place that manipulates deployments, we have to be careful that we don't accidentally add in any more race conditions (my initial conclusion is that this shouldn't be a problem, but it's something to keep in mind).

(For comparison, the upstream Deployment implemention is controlled by the Deployment controller, as opposed to our DeploymentConfig setup, which has 3 controllers and the deployer pod).

ironcladlou · 2015-11-10T19:18:46Z

The DeploymentConfigController might also need to take care of ensuring that deployments older than the latest complete deployment are scaled down to zero (to eventually correct mistaken scaleups concurrent with new deployment creation in an HA setup). I think that might close any remaining gaps, but I won't be surprised if somebody thinks of other issues... 😁

ironcladlou · 2015-11-10T19:22:29Z

Trello card: https://trello.com/c/CCv7OpxE

smarterclayton · 2015-11-11T01:57:25Z

Can you document how this proposal differs/aligns with the upstream DC
today (specifically what upstream allows w.r.t. changes in scale, and what
the upstream controller will do to old RCs)

On Nov 10, 2015, at 11:18 AM, Dan Mace notifications@github.com wrote:

The DeploymentConfigController might also need to take care of ensuring
that deployments older than the latest complete deployment are scaled down
to zero (to eventually correct mistaken scaleups concurrent with new
deployment creation in an HA setup). I think that might close any remaining
gaps, but I won't be surprised if somebody thinks of other issues... [image:
😁]

—
Reply to this email directly or view it on GitHub
#5597 (comment).

ironcladlou · 2015-11-11T14:45:32Z

A quick note since the terms are different upstream: In OpenShift, "deployments" are replication controllers, and the template for a deployment is a "deploymentConfig". In Kubernetes, the template for a deployment is called a "deployment", and the thing that's deployed is just a replication controller.

Upstream RC scaling is managed entirely by the deployment controller. The autoscaler targets the deployment (aka deploymentConfig in openshift today). When the controller is reconciling a given deployment, a change in the replica count between reconcilation iterations is handled gracefully: each iteration, the controller scales up what it thinks is currently the new or existing RC, and scales down all old RCs. During the next iteration it will work with the latest system state. Replica count updates to the deployment which occur between iterations will be picked up and reconciled next time around.

If the deployment template changes in such a way that a parallel controller (in an HA setup) starts scaling up a new RC and scaling down "old RCs" which overlap with another controller still scaling up what it thought was the new RC (but which is now old), they would only fight for a single iteration, and the system state should eventually converge to the current template/replica count.

The change here aligns us a little more closely with that model in that it allows the HPA to target deploymentConfigs rather than RCs to prevent fighting. The way we implement that capability is much different than upstream, so there's not much in the way of overlap in terms of how we facilitate that targetting, but conceptually it's a step forward as we'll eventually want to be able to do a direct replacement of deploymentConfigs with deployments and not have to revisit the HPA targetting mechanics.

Hopefully this is a clear enough explanation. I recommend everybody take a look at the upstream deployment controller to help understand and fact check me. It's extremely simple compared to what we have in origin: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/deployment/deployment_controller.go

ironcladlou · 2015-11-11T15:03:17Z

I outlined another upstream alignment wrt scaling here: #5597 (comment)

ironcladlou · 2015-11-11T15:47:03Z

We'll also need to fix the CLI and console so they adjust the deploymentConfig.spec.replicas directly when RCs are owned by a deploymentConfig. cc @jwforres @spadgett

ironcladlou · 2015-11-11T16:21:11Z

The scale subresource handling will also need updated to scale just the deploymentConfig. The CLI and console should use the subresource instead of manipulating the DC directly, I think.

jwforres · 2015-11-11T16:38:38Z

I would prefer that, do you have an example of what we would post to the
scale subresource?

On Wed, Nov 11, 2015 at 11:21 AM, Dan Mace notifications@github.com wrote:

The scale subresource handling will also need updated to scale just the
deploymentConfig. The CLI and console should use the subresource instead of
manipulating the DC directly, I think.

—
Reply to this email directly or view it on GitHub
#5597 (comment).

ironcladlou · 2015-11-11T16:51:41Z

Talking with @Kargakis I found a pretty big problem stemming from behavior I had forgotten about, which is failure handling/automatic "rollback". Given:

Deploy RC.version 1 (success)
Deploy RC.version 2 (success)
Deploy RC.version=3 (failed)

Openshift will:

Scale down RC.version=3 to 0
Scale up RC.version=2 to its previous replica count

This is a problem because now we have a discrepency that will be very complicated for the controllers to resolve: the "latest" deployment according to the DC is RC.version=3, but the "effective" deployment is RC.version=2. So now scaling operations are going to inadvertantly target RC.version=3 due to the DC version not matching the "effective" deployment version.

Here's how this works upstream:

Deploy RC state A (scales up completely, "success")
Deploy RC state B (scales up complete, "success")
Deploy RC state C (scale never achieved, condition is such that this needs rectified)

Although upstream has no auto-rectify behavior yet, when it does, it'll be very simple: the deployment will be updated to state B, and the controller will now identify the existing RC with state B as the "new" RC and all other RCs (including RC state C) as "old", scale up B and scale down A and C.

I think the way we're handling it in OpenShift has in hindsight has painted us into a corner due to the complexity introduced by the inconsistency of the DC.version and what's considered the effective RC. I think given our use of sequential version numbers, we probably should have done this instead:

Roll back the DC template to RC.version=2's template
Deploy RC.version=4

Thinking back on why we did it the current way, I think it was to avoid a new deployer process (@smarterclayton or @abhgupta might remember.). I think we might be better off just letting a new deployment happen than trying to deal with the idea of the active deployment being some arbitrary RC backwards in time.

smarterclayton · 2015-11-11T16:54:10Z

Why would scaling operations target 3?

On Nov 11, 2015, at 8:51 AM, Dan Mace notifications@github.com wrote:

Talking with @Kargakis https://github.com/kargakis I found a pretty big
problem stemming from behavior I had forgotten about, which is failure
handling/automatic "rollback". Given:

Deploy RC.version 1 (success)
Deploy RC.version 2 (success)
Deploy RC.version=3 (failed)

Openshift will:

Scale down RC.version=3 to 0
Scale up RC.version=2 to its previous replica count

This is a problem because now we have a discrepency that will be very
complicated for the controllers to resolve: the "latest" deployment
according to the DC is RC.version=3, but the "effective" deployment is
RC.version=2. So now scaling operations are going to inadvertantly target
RC.version=3 due to the DC version not matching the "effective" deployment
version.

Here's how this works upstream:

Deploy RC state A (scales up completely, "success")
Deploy RC state B (scales up complete, "success")
Deploy RC state C (scale never achieved, condition is such that this
needs rectified)

Although upstream has no auto-rectify behavior yet, when it does, it'll be
very simple: the deployment will be updated to state B, and the controller
will now identify the existing RC with state B as the "new" RC and all
other RCs (including RC state C) as "old", scale up B and scale down A and
C.

I think the way we're handling it in OpenShift has in hindsight has painted
us into a corner due to the complexity introduced by the inconsistency of
the DC.version and what's considered the effective RC. I think given our
use of sequential version numbers, we probably should have done this
instead:

Roll back the DC template to RC.version=2's template
Deploy RC.version=4

Thinking back on why we did it the current way, I think it was to avoid a
new deployer process (@smarterclayton https://github.com/smarterclayton
or @abhgupta https://github.com/abhgupta might remember.). I think we
might be better off just letting a new deployment happen than trying to
deal with the idea of the active deployment being some arbitrary RC
backwards in time.

—
Reply to this email directly or view it on GitHub
#5597 (comment).

ironcladlou · 2015-11-11T16:56:57Z

Why would scaling operations target 3?

The scaling operations will target the DC, which is at version=3. Normally we'd then use that as the basis of finding the latest RC for scaling. But with the current model, we must now have a special case where we go look for the "latest successful deployment" which could be any version.

That might work, but my confidence level in covering all edge cases is pretty low at this very moment. There are so many places where we assume that the latest deployment at a point in time is the current DC.latestVersion.

smarterclayton · 2015-11-11T17:01:31Z

If we have code that is assuming latest = active, we have a bigger problem.

The scaler sets the DC.

The deployment controller should be able to know that a scale event
happened and target the right RC (it's the controller for responsible for
knowing what the latest rc is) What am I missing?

On Nov 11, 2015, at 8:57 AM, Dan Mace notifications@github.com wrote:

Why would scaling operations target 3?

The scaling operations will target the DC, which is at version=3. Normally
we'd then use that as the basis of finding the latest RC for scaling. But
with the current model, we must now have a special case where we go look
for the "latest successful deployment" which could be any version.

That might work, but my confidence level in covering all edge cases is
pretty low at this very moment. There are so many places where we assume
that the latest deployment at a point in time is the current
DC.latestVersion.

—
Reply to this email directly or view it on GitHub
#5597 (comment).

ironcladlou · 2015-11-11T17:16:05Z

The deployer pod controller is going to also be trying to scale RCs in parallel, competing with the deployment controller.

If we want to keep the current general failure handling behavior, I think the way we implement it needs reworked so that only the deployment config controller is responsible for scaling outside the deployment process.

smarterclayton · 2015-11-11T19:18:30Z

On Nov 11, 2015, at 9:16 AM, Dan Mace notifications@github.com wrote:

The deployer pod controller is going to also be trying to scale RCs in
parallel, competing with the deployment controller.

The deployer has to back off, or the deployment pod controller has to know
the deployment pod is in progress.

If we want to keep the current general failure handling behavior, I think
the way we implement it needs reworked so that only the deployment config
controller is responsible for scaling outside the deployment process

I'm not convinced of this. The deployment controller owns RCs and has
final say on target scaling. When the deployment controller has given
control to the deployer pod, there is no ambiguity of that. The deployer
needs to wait until the deployer pod is complete, or update the pod target
annotation. You can always heal an rc that is not under active control.

ironcladlou · 2015-11-11T19:28:00Z

What I mean is, the logic that's currently in the deployer pod controller needs moved into the deployment config controller so that the deployer pod controller is responsible only for looking at the outcome of a deployer pod and updating the deployment phase. The deployment config controller should be looking at the deployments it owns and making all the scaling decisions.

smarterclayton · 2015-11-11T22:20:45Z

Yes. I had a quick discussion with Brian today about potentially not
having "failed" deployments per se. I still want it for custom deploy
strategies, but maybe for rolling we just always try, and have a condition
on the deployment config that says "hasn't reached terminal state in X
minutes, warning"

On Nov 11, 2015, at 11:28 AM, Dan Mace notifications@github.com wrote:

What I mean is, the logic that's currently in the deployer pod controller
needs moved into the deployment config controller so that the deployer pod
controller is responsible only for looking at the outcome of a deployer pod
and updating the deployment phase. The deployment config controller should
be looking at the deployments it owns and making all the scaling decisions.

—
Reply to this email directly or view it on GitHub
#5597 (comment).

deads2k added priority/P1 component/master area/reliability labels Nov 2, 2015

deads2k mentioned this issue Nov 2, 2015

Enable HorizontalPodAutoscaler Support for DeploymentConfigs #5310

Merged

9 tasks

danmcp assigned deads2k Nov 3, 2015

deads2k assigned DirectXMan12 and unassigned deads2k Nov 3, 2015

eparis added priority/P2 and removed priority/P1 labels Nov 4, 2015

ironcladlou mentioned this issue Nov 12, 2015

Use new deployment scale endpoint to request a scale event #5852

Closed

ironcladlou mentioned this issue Nov 12, 2015

Improve deployment scaling behavior #5875

Merged

6 tasks

pweil- assigned ironcladlou and unassigned DirectXMan12 Nov 12, 2015

ironcladlou mentioned this issue Nov 13, 2015

WIP: Optimize deploymentConfig status updates #5530

Closed

openshift-bot closed this as completed in #5875 Dec 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve HPA DC scaling races #5597

Resolve HPA DC scaling races #5597

deads2k commented Nov 2, 2015

ironcladlou commented Nov 10, 2015

0xmichalis commented Nov 10, 2015

ncdc commented Nov 10, 2015

DirectXMan12 commented Nov 10, 2015

ironcladlou commented Nov 10, 2015

ironcladlou commented Nov 10, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

jwforres commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015

Resolve HPA DC scaling races #5597

Resolve HPA DC scaling races #5597

Comments

deads2k commented Nov 2, 2015

ironcladlou commented Nov 10, 2015

0xmichalis commented Nov 10, 2015

ncdc commented Nov 10, 2015

DirectXMan12 commented Nov 10, 2015

ironcladlou commented Nov 10, 2015

ironcladlou commented Nov 10, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

jwforres commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015

ironcladlou commented Nov 11, 2015

smarterclayton commented Nov 11, 2015