-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(kwok): prevent quitting when scaling down node group #6336
Conversation
3a67a32
to
ca0bb24
Compare
nodeGroup.targetSize += 1 | ||
} | ||
|
||
nodeGroup.targetSize = newSize | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a case in which some nodes are created successfully and some fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Should we add a test case around this (for both IncreaseSize
and DeleteNodes
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
/assign @vadasambar |
Thank you for the PR! |
I can't reproduce the issue. I used the same commands and configmap you used. Here's what I did:
kwok provider created 2 fake nodes. And then
kwok provider scaled down the 2 fake nodes. Logs for reference: https://gist.github.com/vadasambar/56ac07f2eedbd97e5d8aaa1424df3481 |
Maybe you can share the error you saw? |
@vadasambar Sorry for the misunderstanding in the title. CA does not panic; it simply quits without any stack information. The latest line in your log shows the reason:
|
no.GetName()) | ||
} | ||
|
||
ngName = fmt.Sprintf("%s-%v", ngName, time.Now().Unix()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to keep the node group name unchanged, especially in cases where nodes still remain in the cluster and CA is restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 275 is clearly a bug. The original intention was to ensure targetSize
of a nodegroup correctly reflects the number of nodes with matching annotation/label. Making sure nodegroup has a unique name on every pod restart achieves this.
If we already have nodes in the cluster with matching nodegroup annotations/labels and we remove time.Now().Unix()
suffix, when creating nodegroups (imagine the CA pod got restarted) the target size won't accurately reflect the actual nodegroup size because there are more nodes matching nodegroup label in the cluster now.
One solution can be to implement Refresh
and update the nodegroup target size based on the actual matching nodes in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Kwok provider will calculate the target sizes during startup, and the Cluster Autoscaler will scale these nodegroups to the appropriate size. I think it's not necessary to calculate the target size in the Refresh
function, just leave this work to CA may be a better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwok provider will clean all the fake node when CA quit:
// Cleanup cleans up all resources before the cloud provider is removed
func (kwok *KwokCloudProvider) Cleanup() error {
for _, ng := range kwok.nodeGroups {
nodeNames, err := ng.getNodeNamesForNodeGroup()
if err != nil {
return fmt.Errorf("error cleaning up: %v", err)
}
for _, node := range nodeNames {
err := kwok.kubeClient.CoreV1().Nodes().Delete(context.Background(), node, v1.DeleteOptions{})
if err != nil {
klog.Errorf("error cleaning up kwok provider nodes '%v'", node)
}
}
}
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I was considering a case when Cleanup
doesn't clean all the nodes (it doesn't work correctly for whatever reason)
The Kwok provider will calculate the target sizes during startup, and the Cluster Autoscaler will scale these nodegroups to the appropriate size. I think it's not necessary to calculate the target size in the Refresh function, just leave this work to CA may be a better choice.
I think we might have to try this out to confirm if CA can handle such a situation. If you can try it out as a part of this PR, great. If not, we can take care of it in another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vadasambar, I have implemented the Refresh() function. Please take a look.
This is clearly a bug. Thank you for the explanation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think we should add a test case which fails with the current code and passes with the fix.
Thanks for you advise, it will be done in a few days. /hold |
ca0bb24
to
3feb6ce
Compare
3feb6ce
to
1b8f1cc
Compare
Done. /unhold |
switch opts.CloudProviderName { | ||
case cloudprovider.KwokProviderName: | ||
return kwok.BuildKwokCloudProvider(opts, do, rl)(opts, do, rl) | ||
return kwok.BuildKwok(opts, do, rl, informerFactory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@qianlei90 apologies for the delay (was out on vacation). I plan to review this PR this week. |
// ngs = append(ngs, ng) | ||
// } | ||
for _, node := range allNodes { | ||
ngName := getNGName(node, kwok.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will lead to klog.Fatal
for cases where the node doesn't have the nodegroup label or annotation.
You might have to use a filter function to filter only nodes which have the nodegroup label/annotation OR convert klog.Fatal
into an non-fatal error log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I converted klog.Fatal
to klog.Warning
and added comments to getNGName
.
} | ||
|
||
for _, ng := range kwok.nodeGroups { | ||
ng.targetSize = targetSizeInCluster[ng.Id()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something for the future: I wonder if we should delete nodes which have ng
that is not a part of kwok.nodeGroups
.
assert.NoError(t, err) | ||
assert.NotNil(t, p) | ||
|
||
err = p.Refresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwokConfig.status
is coming out nil here for some reason because of which node4
test case is not throwing an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it was a misunderstanding on my end. I ran the test again. Looks good to me.
1b8f1cc
to
e71a123
Compare
/lgtm |
/unhold |
Thank you @qianlei90 . LGTM. @BigDarkClown can you please merge the PR 🙏 |
/assign @towca |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: qianlei90, towca, vadasambar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When using the Kwok provider, CA quits when scaling down a node group because the Kwok provider cannot retrieve the node group name from a fake node. This PR primarily aims to fix this issue.
Additionally, I have fixed the target size when scaling up and down the node group.
Which issue(s) this PR fixes:
kwok-provider-config
kwok-provider-templates
starting CA
scale this deployment to test scale up and down
CA log
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: