-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to hard power off node before it is deleted #816
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sadasu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @andfasano |
/test-integration |
expectedPowerState: "", | ||
expectedError: "failed to remove host", | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of another cases are required to manage when p.hardPowerOff()
returns an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/testserver/ironic.go#L227, I wasn't able to mock the out come for hardPowerOff() which is called from within the Delete().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a real interesting case and it's worth some additional explanation.
Adding the mocking function to test case should be ok, ie:
ironic: testserver.NewIronic(t).Node(
nodes.Node{
UUID: nodeUUID,
ProvisionState: "active",
Maintenance: true,
PowerState: powerOn,
},
).WithNodeStatesPowerUpdate(nodeUUID, http.StatusConflict).Delete(nodeUUID),
In this case the expected error is returned, anyhow this type assertion https://github.com/sadasu/baremetal-operator/blob/4d0f31405af55c03516533cd6e624aa6c32afaa2/pkg/provisioner/ironic/ironic.go#L1624 is going to fail because the real error is wrapped (twice).
As per the golang recommended best practices on errors (https://blog.golang.org/go1.13-errors), type check must be changed using the new errors functions like:
var hostErr *HostLockedError
if errors.As(err, &hostErr) {
p.log.Info("could not power off host, busy")
return retryAfterDelay(powerRequeueDelay)
} else {
return operationFailed("failed to power off host")
}
But that's not yet sufficient, because the underlying changePower
does not return a pointer receiver for the error: https://github.com/sadasu/baremetal-operator/blob/4d0f31405af55c03516533cd6e624aa6c32afaa2/pkg/provisioner/ironic/ironic.go#L1681
So that code as well must be changed to:
return result, &HostLockedError{Address: p.host.Spec.BMC.Address}
This means of course that other points in the code (ie PowerOn
) will require a deeper review for properly managing the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have attempt to do this in c941bc0. This commit might need more work. The test for this is here: https://github.com/metal3-io/baremetal-operator/pull/816/files#diff-969e3b93b7bf85d6166287117b76766bd03994870cb36c0f26dce54cf2c11a50R147.
), | ||
priorErrors: 2, | ||
expectedPowerState: "power off", | ||
expectedError: "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment
/test-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one other problem we'll have to address, which is that hardPowerOff() doesn't set (and Delete doesn't check for) an ErrorMessage when there is a LastError from ironic. So if all the ironic calls succeed but ironic cannot actually change the power state, then we will keep retrying forever.
Fun fact: this also means that we currently never report a power management error. I raised #828 to record this. Basically I don't think we will be able to complete this until that issue is fixed.
a7ac374
to
2649e51
Compare
/test-integration |
Let the controller monitor the number of retries and communicate behavior to provisioner via force flag.
/test-integration |
The baremetal host controller is checking for the host's error count before performing some but not all actions. A host's error count should inform whether the controller would perform any action on the host. Does the controller have a higher error threshold for the overall number of errors and not call any action on that host? And should this checked before the action is attempted and not within the logic for each action? |
PowerState: powerOn, | ||
}, | ||
).WithNodeStatesPower(nodeUUID, http.StatusConflict).WithNodeStatesPowerUpdate(nodeUUID, http.StatusConflict), | ||
expectedDirty: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These expected values are consistent with what is returned by transientError(). http.StatusConflict is not resulting in HostLockedError.
/test-integration |
Hello! What is the status with this PR ? Is it still blocked ? |
@sadasu looks like the PR needs a rebase |
@sadasu: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
@sadasu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Stale issues close after 30d of inactivity. Reopen the issue with /close |
@metal3-io-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When a BareMetalHost is deleted, power it off before performing
the delete operation.
Fixes #410