Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to hard power off node before it is deleted #816

Closed
wants to merge 3 commits into from

Conversation

sadasu
Copy link
Member

@sadasu sadasu commented Mar 16, 2021

When a BareMetalHost is deleted, power it off before performing
the delete operation.
Fixes #410

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sadasu
To complete the pull request process, please assign dtantsur
You can assign the PR to them by writing /assign @dtantsur in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 16, 2021
@sadasu
Copy link
Member Author

sadasu commented Mar 16, 2021

/assign @andfasano

@sadasu
Copy link
Member Author

sadasu commented Mar 16, 2021

/test-integration

pkg/provisioner/ironic/delete_test.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/delete_test.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/delete_test.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/delete_test.go Outdated Show resolved Hide resolved
expectedPowerState: "",
expectedError: "failed to remove host",
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of another cases are required to manage when p.hardPowerOff() returns an error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using https://github.com/metal3-io/baremetal-operator/blob/master/pkg/provisioner/ironic/testserver/ironic.go#L227, I wasn't able to mock the out come for hardPowerOff() which is called from within the Delete().

Copy link
Member

@andfasano andfasano Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a real interesting case and it's worth some additional explanation.

Adding the mocking function to test case should be ok, ie:

ironic: testserver.NewIronic(t).Node(
				nodes.Node{
					UUID:           nodeUUID,
					ProvisionState: "active",
					Maintenance:    true,
					PowerState:     powerOn,
				},
			).WithNodeStatesPowerUpdate(nodeUUID, http.StatusConflict).Delete(nodeUUID),

In this case the expected error is returned, anyhow this type assertion https://github.com/sadasu/baremetal-operator/blob/4d0f31405af55c03516533cd6e624aa6c32afaa2/pkg/provisioner/ironic/ironic.go#L1624 is going to fail because the real error is wrapped (twice).

As per the golang recommended best practices on errors (https://blog.golang.org/go1.13-errors), type check must be changed using the new errors functions like:

var hostErr *HostLockedError
if errors.As(err, &hostErr) {
  p.log.Info("could not power off host, busy")
  return retryAfterDelay(powerRequeueDelay)
} else {
  return operationFailed("failed to power off host")
}

But that's not yet sufficient, because the underlying changePower does not return a pointer receiver for the error: https://github.com/sadasu/baremetal-operator/blob/4d0f31405af55c03516533cd6e624aa6c32afaa2/pkg/provisioner/ironic/ironic.go#L1681

So that code as well must be changed to:

return result, &HostLockedError{Address: p.host.Spec.BMC.Address}

This means of course that other points in the code (ie PowerOn) will require a deeper review for properly managing the error

Copy link
Member Author

@sadasu sadasu Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/provisioner/ironic/delete_test.go Outdated Show resolved Hide resolved
),
priorErrors: 2,
expectedPowerState: "power off",
expectedError: "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

pkg/provisioner/ironic/delete_test.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/delete_test.go Outdated Show resolved Hide resolved
@sadasu
Copy link
Member Author

sadasu commented Mar 22, 2021

/test-integration

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one other problem we'll have to address, which is that hardPowerOff() doesn't set (and Delete doesn't check for) an ErrorMessage when there is a LastError from ironic. So if all the ironic calls succeed but ironic cannot actually change the power state, then we will keep retrying forever.

Fun fact: this also means that we currently never report a power management error. I raised #828 to record this. Basically I don't think we will be able to complete this until that issue is fixed.

pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
@sadasu sadasu force-pushed the bmh-delete branch 2 times, most recently from a7ac374 to 2649e51 Compare March 24, 2021 17:17
@sadasu
Copy link
Member Author

sadasu commented Mar 24, 2021

/test-integration

pkg/provisioner/ironic/ironic.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
Let the controller monitor the number of retries and communicate
behavior to provisioner via force flag.
@sadasu
Copy link
Member Author

sadasu commented Mar 25, 2021

/test-integration

@sadasu
Copy link
Member Author

sadasu commented Mar 25, 2021

The baremetal host controller is checking for the host's error count before performing some but not all actions. A host's error count should inform whether the controller would perform any action on the host. Does the controller have a higher error threshold for the overall number of errors and not call any action on that host? And should this checked before the action is attempted and not within the logic for each action?

PowerState: powerOn,
},
).WithNodeStatesPower(nodeUUID, http.StatusConflict).WithNodeStatesPowerUpdate(nodeUUID, http.StatusConflict),
expectedDirty: false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These expected values are consistent with what is returned by transientError(). http.StatusConflict is not resulting in HostLockedError.

@sadasu
Copy link
Member Author

sadasu commented Mar 26, 2021

/test-integration

@maelk
Copy link
Member

maelk commented Apr 15, 2021

Hello! What is the status with this PR ? Is it still blocked ?

@sadasu
Copy link
Member Author

sadasu commented Apr 15, 2021

Hello! What is the status with this PR ? Is it still blocked ?

@maelk During the discussion in this PR it was determined that #841 needs to be solved first. I am working on that right now. Comments welcome.

@andfasano
Copy link
Member

@sadasu looks like the PR needs a rebase

@metal3-io-bot metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label May 2, 2021
@metal3-io-bot
Copy link
Contributor

@sadasu: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2021
@metal3-io-bot
Copy link
Contributor

@sadasu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
shellcheck c941bc0 link /test shellcheck
markdownlint c941bc0 link /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@metal3-io-bot
Copy link
Contributor

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

@metal3-io-bot
Copy link
Contributor

@metal3-io-bot: Closed this PR.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@furkatgofurov7
Copy link
Member

@sadasu hi! Would you have time to re-open this PR and fix the conflicts? This should be stlll valid to fix the #410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Host should be powered down once deleted
6 participants