-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA scale-up delays on clusters with heavy scaling activity #5769
Comments
It looks like we performed scalability tests on CA 6 years ago: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/scalability_tests.md The tests above however don't seem to test time needed to scale up the nodes. Maybe it's time we did a scalability test to check how much time CA takes to scale up nodes. I guess the first thing to do here is to identify what is the bottleneck (or if there's one at all) |
I performed scale test using #5820 Here's what I did:
Things you should know:
(if you want to know more details, please let me know) To start the test, I just did:
Here are the results: In the middle of the test288 nodes: 385 nodes (all pods are scheduled) After the test385 nodes (all pods are scheduled) scaleUp 99th percentile shoots up once the number of nodes crosses 300. It seems like the happy path scenario works as expected i.e., 99th percentile scale up latency is less than 21.8 seconds. Maybe the results might change if we add more scheduling constraints (affinity, selector or taints/tolerations). Also note that unschedulable pods are very similar to each other because they come from the same Deployment. CA is good at grouping similar pods (especially pods sharing the same owner) and scaling them up.
Note that the actual number of unschedulable pods are 5k test pods + extra daemonset pods for new nodes. |
It might help to know what kind of workload (scheduling constraints like pod affinity, node affinity, node selector, taint tolerations, usual memory and cpu requests, etc.,) you usually schedule and what kind of taints you use on your nodes so that we can try mimicking that in the test workload and perform scale tests based on that. Maybe we can reproduce the issue with that. |
Hey @vadasambar thanks for running this test! It looks like the images are not displaying for most folks (404ing). We can definitely add more color to the workloads we are running as they are definitely not just straight simple deployments running on this cluster. In the meantime, would love to see the results of the images. |
Adding the color Terry referenced: I think we are having an issue with a bunch of things at once, and just pure number of pods isn't our issue (as your test, and tests we have ran ourselves have shown). I think the reason:
Here is what we have found so far in our research:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1d |
@tzatwork my bad 🙇 . I have re-uploaded the images. You should be able to see them now (confirmed by checking this issue in a private window). Thanks for bringing this notice! Let me know if you still can't see them. |
@philnielsen I will check the links for 1. 2 is very interesting. In my test above there were 7 CA nodegroups and each one of them had a max limit of 200 nodes. Is the node selector present on all the deployments or is it like a mix bag? I would assume it to be a mixed bag. |
Thank you for taking a look, we really appreciate your insights! multiple node group example where we have probably ~30 apps/instance types that have isolated nodegroups (in addition to several general node groups that exist in the cluster that handle other deployments): Name TYPE InstanceType. Min Max Zone
c5.12xlarge.<ISOLATED_APP_NAM>-a Node c5.12xlarge 0 200 us-east-1a
c5.12xlarge..<ISOLATED_APP_NAM>-c Node c5.12xlarge 0 200 us-east-1c
c5.12xlarge..<ISOLATED_APP_NAM>-d Node c5.12xlarge 0 200 us-east-1d The node selectors to put pods on these isolated exist on most of the deployments but I wouldn't say all, but they won't be grouped together as "similar" by CA because they are in different app groups. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: v1.26.1, though we've seen the same behavior in versions 1.24.1 and 1.27.1
What k8s version are you using (
kubectl version
)?:1.24
What environment is this in?:
AWS, using kops
What did you expect to happen?:
On k8s clusters with heavy scaling activity, we would expect CA to be able to scale-up in a timely manner to clear unschedulable pending pods.
What happened instead?:
There are times when we need CA to process up to 3k+ pending (unschedulable) pods and have seen significant delays in processing, sometimes up to 15 minutes before CA gets through the list and scales up nodes. We have several deployments that scale up and down by hundreds of pods often in the cluster.
During this time frame, looking at CA metrics, we noticed significantly increased latency generally but more so in the scale-up function as seen below (in seconds):
Below is a screenshot showing the delay in scale-up time. As mentioned above, you can see we peaked above 3k unschedulable pods with a lack of scaling activity during these periods. We suspect CA is struggling to churn through the list.
Anything else we need to know?:
We do not seem to be hitting pod or node level resource limits; not OOM'ing and pods/nodes are not approaching limits in general. We use node selectors for assignment. We also are not being rate-limited on the cloud provider side from what we can tell, there's just a delay before CA attempts to update the ASGs. Per the defined SLO, we are expecting no more than 60s for CA to scale-up on large clusters like ours.
We looked into running multiple replicas to help us churn through the list but by default there can only be one leader, I can't find any documentation about how well it works to run multiple replicas in parallel; under the impression that's not recommended.
Alternatively, we looked into running multiple instances of CA in a single cluster, focused on separate workloads/resources based on pod labels. I don't believe this is supported an any version of CA at this point?
The text was updated successfully, but these errors were encountered: