-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve HPA DC scaling races #5597
Comments
The solution we previously discussed for this is to make the DeploymentConfigController responsible for reconciling a deployment's replica count with its associated config. The auto-scaler (or manual scaler) would target only the deploymentConfig. This allows the controller to safely decide whether to scale up now (if the latest deployment is complete) or later (after an in-progress deployment completes). This does introduce a fundamental change in the way we've been presenting DC scaling to users. Today, the DC replicas is only for new deployments, and subsequent deployments follow the replica count of the last successful deployment. With this change, a user modification to the DC replicas would actually scale up the deployment. I think this is probably better behavior, and brings us closer to the semantics of the new upstream deployment system, but it does warrant discussion since the current behavior is documented and unchanged since the introduction of deployments. |
I find this reasonable. |
So, AFAICT, the actual scaling action for our DeploymentConfig implementation occurs in the deployer pod (except for "cleanup failed deployment" part which happens in the deployer pod controller), which dies when the deployment is complete. Since this plan would require adding in another place that manipulates deployments, we have to be careful that we don't accidentally add in any more race conditions (my initial conclusion is that this shouldn't be a problem, but it's something to keep in mind). (For comparison, the upstream Deployment implemention is controlled by the Deployment controller, as opposed to our DeploymentConfig setup, which has 3 controllers and the deployer pod). |
The DeploymentConfigController might also need to take care of ensuring that deployments older than the latest complete deployment are scaled down to zero (to eventually correct mistaken scaleups concurrent with new deployment creation in an HA setup). I think that might close any remaining gaps, but I won't be surprised if somebody thinks of other issues... 😁 |
Trello card: https://trello.com/c/CCv7OpxE |
Can you document how this proposal differs/aligns with the upstream DC On Nov 10, 2015, at 11:18 AM, Dan Mace notifications@github.com wrote: The DeploymentConfigController might also need to take care of ensuring — |
A quick note since the terms are different upstream: In OpenShift, "deployments" are replication controllers, and the template for a deployment is a "deploymentConfig". In Kubernetes, the template for a deployment is called a "deployment", and the thing that's deployed is just a replication controller. Upstream RC scaling is managed entirely by the deployment controller. The autoscaler targets the deployment (aka deploymentConfig in openshift today). When the controller is reconciling a given deployment, a change in the replica count between reconcilation iterations is handled gracefully: each iteration, the controller scales up what it thinks is currently the new or existing RC, and scales down all old RCs. During the next iteration it will work with the latest system state. Replica count updates to the deployment which occur between iterations will be picked up and reconciled next time around. If the deployment template changes in such a way that a parallel controller (in an HA setup) starts scaling up a new RC and scaling down "old RCs" which overlap with another controller still scaling up what it thought was the new RC (but which is now old), they would only fight for a single iteration, and the system state should eventually converge to the current template/replica count. The change here aligns us a little more closely with that model in that it allows the HPA to target deploymentConfigs rather than RCs to prevent fighting. The way we implement that capability is much different than upstream, so there's not much in the way of overlap in terms of how we facilitate that targetting, but conceptually it's a step forward as we'll eventually want to be able to do a direct replacement of deploymentConfigs with deployments and not have to revisit the HPA targetting mechanics. Hopefully this is a clear enough explanation. I recommend everybody take a look at the upstream deployment controller to help understand and fact check me. It's extremely simple compared to what we have in origin: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/deployment/deployment_controller.go |
I outlined another upstream alignment wrt scaling here: #5597 (comment) |
The scale subresource handling will also need updated to scale just the deploymentConfig. The CLI and console should use the subresource instead of manipulating the DC directly, I think. |
I would prefer that, do you have an example of what we would post to the On Wed, Nov 11, 2015 at 11:21 AM, Dan Mace notifications@github.com wrote:
|
Talking with @Kargakis I found a pretty big problem stemming from behavior I had forgotten about, which is failure handling/automatic "rollback". Given:
Openshift will:
This is a problem because now we have a discrepency that will be very complicated for the controllers to resolve: the "latest" deployment according to the DC is RC.version=3, but the "effective" deployment is RC.version=2. So now scaling operations are going to inadvertantly target RC.version=3 due to the DC version not matching the "effective" deployment version. Here's how this works upstream:
Although upstream has no auto-rectify behavior yet, when it does, it'll be very simple: the deployment will be updated to state B, and the controller will now identify the existing RC with state B as the "new" RC and all other RCs (including RC state C) as "old", scale up B and scale down A and C. I think the way we're handling it in OpenShift has in hindsight has painted us into a corner due to the complexity introduced by the inconsistency of the DC.version and what's considered the effective RC. I think given our use of sequential version numbers, we probably should have done this instead:
Thinking back on why we did it the current way, I think it was to avoid a new deployer process (@smarterclayton or @abhgupta might remember.). I think we might be better off just letting a new deployment happen than trying to deal with the idea of the active deployment being some arbitrary RC backwards in time. |
Why would scaling operations target 3? On Nov 11, 2015, at 8:51 AM, Dan Mace notifications@github.com wrote: Talking with @Kargakis https://github.com/kargakis I found a pretty big
Openshift will:
This is a problem because now we have a discrepency that will be very Here's how this works upstream:
Although upstream has no auto-rectify behavior yet, when it does, it'll be I think the way we're handling it in OpenShift has in hindsight has painted
Thinking back on why we did it the current way, I think it was to avoid a — |
The scaling operations will target the DC, which is at version=3. Normally we'd then use that as the basis of finding the latest RC for scaling. But with the current model, we must now have a special case where we go look for the "latest successful deployment" which could be any version. That might work, but my confidence level in covering all edge cases is pretty low at this very moment. There are so many places where we assume that the latest deployment at a point in time is the current DC.latestVersion. |
If we have code that is assuming latest = active, we have a bigger problem. The scaler sets the DC. The deployment controller should be able to know that a scale event On Nov 11, 2015, at 8:57 AM, Dan Mace notifications@github.com wrote: Why would scaling operations target 3? The scaling operations will target the DC, which is at version=3. Normally That might work, but my confidence level in covering all edge cases is — |
The deployer pod controller is going to also be trying to scale RCs in parallel, competing with the deployment controller. If we want to keep the current general failure handling behavior, I think the way we implement it needs reworked so that only the deployment config controller is responsible for scaling outside the deployment process. |
On Nov 11, 2015, at 9:16 AM, Dan Mace notifications@github.com wrote: The deployer pod controller is going to also be trying to scale RCs in The deployer has to back off, or the deployment pod controller has to know If we want to keep the current general failure handling behavior, I think I'm not convinced of this. The deployment controller owns RCs and has |
What I mean is, the logic that's currently in the deployer pod controller needs moved into the deployment config controller so that the deployer pod controller is responsible only for looking at the outcome of a deployer pod and updating the deployment phase. The deployment config controller should be looking at the deployments it owns and making all the scaling decisions. |
Yes. I had a quick discussion with Brian today about potentially not On Nov 11, 2015, at 11:28 AM, Dan Mace notifications@github.com wrote: What I mean is, the logic that's currently in the deployer pod controller — |
The HPA controller enabled in #5310 can race with the deployment config controller creating unexpected scaling results.
The DC itself should have a desired number of replicas that the DC controller maintains to avoid (or significantly tighten) the race window.
@DirectXMan12 @ironcladlou
The text was updated successfully, but these errors were encountered: