-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Prevent hosts from getting into infinite power-off loops #1356
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised that I steered you wrong by suggesting that we need to count the NeedsRegistration errors.
The purpose of NeedsRegistration is not to note that we failed to register; rather it is to retrigger a reconcile so that registration can run again on the new reconcile. This is why it's actionError rather than actionFailure. Any actual failures to register are recorded by registerHost()
on the next reconcile.
Once in StatePoweringOffBeforeDelete, however, we will never call registerHost()
because we bail out from ensureRegistered()
(at line 332). This is a bug - the comment there says "In the deleting state the whole idea is to de-register the host" and that is true for the Deleting state, but not true for PoweringOffBeforeDelete where the host very much needs to be registered in order for us to power off. As things stand, once we need to re-register we can never get out of this loop because registration will never happen again.
So we need to fix that, but we also need to find a way to bail out once the registration errors start to pile up, otherwise we will be stuck trying to re-register. This makes me wonder how repeated registration errors are handled in Deprovisioning, which is in a similar boat (i.e. a thing you have to do before deleting that requires registration)... as far as I can see they're not. Maybe we could check the registration error count in checkInitiateDelete and move the state on to Deleting?
Looking at that, I note one more problem: if the Host is created with incorrect credentials, it will be stuck in the Registering state with failures. When the user then deletes the host, it should immediately go to the deleting state where it can be deleted despite the lack of credentials. However, it now goes into PoweringOffBeforeDelete where it will be stuck forever. I guess that'll also be resolved if we fix the known bug, but I think for defensive reasons prior to the initial registration succeeding (i.e. the states "registering"
, "unmanaged"
, and ""
), checkInitiateDelete()
should send a host with a deletion timestamp directly to the Deleting state.
(aside: I wonder if ExternallyProvisioned should also go straight to Deleting? There's no danger of us leaving a running IPA image when the host is externally provisioned - the reason for this feature - and if we don't own provisioning then maybe we shouldn't mess with the power state except when explicitly asked to by the user? Like, shouldn't it be possible to add an ExternallyProvisioned host and then delete it again without Metal³ ever doing anything to it?)
if hsm.Host.Status.ErrorCount > 3 { | ||
info.log.Info("Giving up on host power off after 3 attempts.") | ||
return skipToDelete() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if we did something like
return recordActionFailure(info, metal3api.RegistrationError, r.err.Error())
here (after the if block) then we wouldn't need to change the behaviour of every error return point (most of which are likely transient errors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but see main comment - on reflection I don't think we need to count at this point at all)
2359fba
to
111f553
Compare
|
This looks good, but we also need to deal with:
because if registration fails (in the PoweringOffBeforeDelete state) then (I think!) it will just keep retrying in ensureRegistered() and never get to any code that would allow it to bail out (not that any exists). You should be able to test this by changing the credentials to invalid values and then deleting the host. |
Once #1312 merges we'll be able to update the state machine diagram here too. Just leaving a note to remind myself. |
The host needs to be registered if we are going to power it off.
Namely registering, unmanaged, and none. This is to prevent a stuck case where incorrect credentials have been provided.
|
2fcf433
to
e97b941
Compare
/test-centos-e2e-integration-main |
// Allow state machine to run to continue deprovisioning. | ||
return false | ||
case metal3api.StateDeleting: | ||
// Already in deleting state. Allow state machine to run. | ||
return false | ||
case metal3api.StatePoweringOffBeforeDelete: | ||
if hsm.Host.Status.ErrorType == metal3api.RegistrationError && hsm.Host.Status.ErrorCount > 3 { | ||
hsm.NextState = metal3api.StateDeleting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't return true here, then we'll continue to process the state machine as if the current state were already Deleting, but without updating it in the Status unless handleDeleting() happens to return actionUpdate or actionComplete. Functionally this is OK - ensureRegistered() will look at the next state and not bother trying to re-register. But for the sake of consistency with all the other cases of changing hsm.NextState in this function, I think we should return true when we set it so that the user can explicitly see what is happening at each step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
statement added here, and also at the deprovisioning state
e97b941
to
4156f61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I think we should also change
if r.NeedsRegistration() && !hsm.haveCreds { |
My reasoning is that we know the BMH is marked for deletion already since we are in handlePoweringOffBeforeDelete
, therefore we should skip to delete if it needs registration OR if we lack credentials to power it off. I don't see a case where we should ever register the host when it is already marked for deletion.
(The reason for deleting could be that registration failed.)
/test-centos-e2e-integration-main |
/approve Please do consider the suggestion above. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dtantsur The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
The case is when the ironic DB is dropped in mid-operation while we were trying to turn the power off. |
This is to cover the case where we need registration, and we have credentials, and yet we can't power the node off.
Also note that this is currently rebased on top of #1312, and only the latest commit is relevant.