-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
correctly handle lack of capacity of AWS spot ASGs #2235
correctly handle lack of capacity of AWS spot ASGs #2235
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
For what it's worth I've backported exactly these changes (other than those in cluster-autoscaler/core/static_autoscaler.go
due to it not existing in that version) to a fork of the 1.3.8 CA we're running on our spot based clusters in the past week and it's working beautifully.
BTW, why the bot thinks I don't have CLA signed? I did that long time ago.... |
How did you move your commits? Probably your old commits linked to your old email address? Please have a check. |
c8cec12
to
815a3b7
Compare
Yeah, my bad, still wrong author was in the commits. |
I have experienced this style of failure in the past. I have a 1/2 ondemand, 1/2 spot EKS cluster under the cluster autoscaler. Let me know if I can help test in any way. |
@losipiuk @gjtempleton can we merge it now, before it needs rebasing again? |
Unfortunately I don't have merge permissions on the repo so you'll need someone from the OWNERS file to do it. /assign @losipiuk |
} | ||
|
||
for i := real; i < desired; i++ { | ||
id := fmt.Sprintf("%s%d", placeholderInstanceNamePrefix, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have multiple ASGs with placeholders you'll duplicate ids. I'm not sure if it will create problems or not, but it's probably worth testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'm not 100% sure it'll hold true for this PR as we're running a patched version of CA v1.3.8 but we've had this occur over the past couple of days.
Behaviour we've seen: The CA only picks up that there's one unregistered node called i-placeholder-0
at a time despite 2 ASGs having placeholder nodes with the name i-placeholder-0
so only attempts to decrement the size of one of them, but not the other, until the first group is resolved.
@piontec how about adding the AutoScalingGroupName
to the end of the placeholder instance name to distinguish these from each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @MaciekPytel has an excellent point. Seems like this is an easy fix (as Guy suggests, simply add the autoscaling group name to the placeholder instance name template).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great catch, fixing
@losipiuk is on vacation. /lgtm Hold for a single comment I have. Feel free to cancel hold if you've tested it and it works fine. |
Thanks for this PR. I think it solves the risk I have using spot instances for production clusters. I've raised a feature request with my company's AWS Technical Account Manager to see if they can implement some kind of "failover" to on demand instance if there is no spot capacity available for an autoscaling group. The mixed instances policy goes most of the way to supporting this but not quite. https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options/ |
I confirmed with Internal AWS ASG team who worked on this feature. Seems spotinst have this supported and not sure if there's a reasonable solution in CA. Let's create an separate issue and discuss there? |
Retry should happen with current CA error handling, assuming all of those are separate nodegroups. Either because:
Further improvement would be to return OutOfResouces errors in case of missing spot capacity. |
815a3b7
to
8c93dd4
Compare
@MaciekPytel I will file cherry-picks today. |
Thanks for the great work all! Did this ever get cherry picked to previous versions? |
It doesn't look like it has been yet. |
@Jeffwan hope I'm not treading on your toes, but I've started raising the cherry-picks of this to 1.12-1.14 as it's of use to us. |
CA 1.12: cherry pick of #2235
CA 1.13: cherry pick of #2235
@gjtempleton Thanks. I was busy with oncall this week and didn't file PR. |
CA 1.14: cherry pick of #2235
I don't think this found its way into release-1.15 |
@BradLugo Well caught, I'll raise a PR for cherry-picking into 1.15 now. |
CA 1.15: Cherry-pick of #2235
This is a direct continuation of the #2008 PR. I had to switch the source repo.
Please read the PR linked above to get the full context and history. From the last state in that PR I did the changes mentioned by @losipiuk and @gjtempleton.
Please also note, that I no longer have easy access to a spot based EKS cluster on AWS. @seh would you be so kind as to test it?