Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctly handle lack of capacity of AWS spot ASGs #2235

Merged
merged 1 commit into from
Aug 19, 2019

Conversation

piontec
Copy link
Contributor

@piontec piontec commented Aug 3, 2019

This is a direct continuation of the #2008 PR. I had to switch the source repo.

Please read the PR linked above to get the full context and history. From the last state in that PR I did the changes mentioned by @losipiuk and @gjtempleton.

Please also note, that I no longer have easy access to a spot based EKS cluster on AWS. @seh would you be so kind as to test it?

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 3, 2019
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 3, 2019
Copy link
Member

@gjtempleton gjtempleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

For what it's worth I've backported exactly these changes (other than those in cluster-autoscaler/core/static_autoscaler.go due to it not existing in that version) to a fork of the 1.3.8 CA we're running on our spot based clusters in the past week and it's working beautifully.

@piontec
Copy link
Contributor Author

piontec commented Aug 3, 2019

BTW, why the bot thinks I don't have CLA signed? I did that long time ago....

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 3, 2019

BTW, why the bot thinks I don't have CLA signed? I did that long time ago....

How did you move your commits? Probably your old commits linked to your old email address? Please have a check.

@piontec piontec force-pushed the fix/aws_spots_squashed branch from c8cec12 to 815a3b7 Compare August 5, 2019 09:10
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 5, 2019
@piontec
Copy link
Contributor Author

piontec commented Aug 5, 2019

BTW, why the bot thinks I don't have CLA signed? I did that long time ago....

How did you move your commits? Probably your old commits linked to your old email address? Please have a check.

Yeah, my bad, still wrong author was in the commits.

@xrl
Copy link

xrl commented Aug 6, 2019

I have experienced this style of failure in the past. I have a 1/2 ondemand, 1/2 spot EKS cluster under the cluster autoscaler. Let me know if I can help test in any way.

@piontec
Copy link
Contributor Author

piontec commented Aug 7, 2019

@losipiuk @gjtempleton can we merge it now, before it needs rebasing again?

@gjtempleton
Copy link
Member

Unfortunately I don't have merge permissions on the repo so you'll need someone from the OWNERS file to do it.

/assign @losipiuk

}

for i := real; i < desired; i++ {
id := fmt.Sprintf("%s%d", placeholderInstanceNamePrefix, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have multiple ASGs with placeholders you'll duplicate ids. I'm not sure if it will create problems or not, but it's probably worth testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'm not 100% sure it'll hold true for this PR as we're running a patched version of CA v1.3.8 but we've had this occur over the past couple of days.

Behaviour we've seen: The CA only picks up that there's one unregistered node called i-placeholder-0 at a time despite 2 ASGs having placeholder nodes with the name i-placeholder-0 so only attempts to decrement the size of one of them, but not the other, until the first group is resolved.

@piontec how about adding the AutoScalingGroupName to the end of the placeholder instance name to distinguish these from each other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @MaciekPytel has an excellent point. Seems like this is an easy fix (as Guy suggests, simply add the autoscaling group name to the placeholder instance name template).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great catch, fixing

@MaciekPytel
Copy link
Contributor

@losipiuk is on vacation.

/lgtm
/approve

Hold for a single comment I have. Feel free to cancel hold if you've tested it and it works fine.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2019
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 8, 2019
@cablespaghetti
Copy link
Contributor

Thanks for this PR. I think it solves the risk I have using spot instances for production clusters.

I've raised a feature request with my company's AWS Technical Account Manager to see if they can implement some kind of "failover" to on demand instance if there is no spot capacity available for an autoscaling group. The mixed instances policy goes most of the way to supporting this but not quite. https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options/

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 12, 2019

Thanks for this PR. I think it solves the risk I have using spot instances for production clusters.

I've raised a feature request with my company's AWS Technical Account Manager to see if they can implement some kind of "failover" to on demand instance if there is no spot capacity available for an autoscaling group. The mixed instances policy goes most of the way to supporting this but not quite. https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options/

I confirmed with Internal AWS ASG team who worked on this feature. failover will not be supported at least in short term. If spot has resource constraints, ondemand will have similar problem. But the idea for CA user I think is more on the retry on demand and if could be other AZs.

Seems spotinst have this supported and not sure if there's a reasonable solution in CA. Let's create an separate issue and discuss there?

@MaciekPytel
Copy link
Contributor

But the idea for CA user I think is more on the retry on demand and if could be other AZs.

Retry should happen with current CA error handling, assuming all of those are separate nodegroups. Either because:

  1. You have one ASG/config rather than use MixedInstancesPolicy.
  2. AWS cloudprovider could represent ASG with MixedInstancesPolicy / multi-AZ as multiple NodeGroups.

Further improvement would be to return OutOfResouces errors in case of missing spot capacity.

@piontec piontec force-pushed the fix/aws_spots_squashed branch from 815a3b7 to 8c93dd4 Compare August 12, 2019 19:29
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 12, 2019
@MaciekPytel
Copy link
Contributor

Given a large interest in addressing underlying issue we should probably cherry-pick this fix. I'll leave the final call on this to AWS folks though.
@jaypipes @Jeffwan

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 21, 2019

@MaciekPytel I will file cherry-picks today.

@viggeh
Copy link
Contributor

viggeh commented Sep 2, 2019

Thanks for the great work all! Did this ever get cherry picked to previous versions?

@gjtempleton
Copy link
Member

It doesn't look like it has been yet.

@gjtempleton
Copy link
Member

@Jeffwan hope I'm not treading on your toes, but I've started raising the cherry-picks of this to 1.12-1.14 as it's of use to us.

k8s-ci-robot added a commit that referenced this pull request Sep 3, 2019
k8s-ci-robot added a commit that referenced this pull request Sep 3, 2019
@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 3, 2019

@gjtempleton Thanks. I was busy with oncall this week and didn't file PR.

@BradLugo
Copy link

I don't think this found its way into release-1.15

@gjtempleton
Copy link
Member

@BradLugo Well caught, I'll raise a PR for cherry-picking into 1.15 now.

k8s-ci-robot added a commit that referenced this pull request Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.