-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-31868: allow for some errors checking namespace delete #28761
Conversation
@jluhrsen: This pull request references Jira Issue OCPBUGS-31868, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
log.Errorf("Timed out after 20 minutes waiting for deleted namespace: %s, %s", w.namespaceName, err) | ||
return err | ||
} else { | ||
log.Errorf("Encountered error while waiting for deleted namespace: %s, %s", w.namespaceName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could keep a successive error count, clear the value any time we don't get an error, bail when it gets to 10 or something just so we don't spin for 15 minutes if things are really busted..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this doesn't work after thinking about your other comment. I have a new idea coming.
@@ -281,7 +282,6 @@ func (w *availability) namespaceDeleted(ctx context.Context) (bool, error) { | |||
} | |||
|
|||
if err != nil { | |||
logrus.Errorf("Error checking for deleted namespace: %s, %s", w.namespaceName, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to suppress the return false here? Looks like this just move the logging up but the comment in the jira had me thinking you wanted to keep polling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it would be return false, nil to keep polling but still log the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I confused myself that a return false, err
would not stop the polling. but that's actually what was already happening and how we got to this bug.
but, if we return false, nil
then I'm not sure we have a way to keep polling. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to go look it up earlier myself. I think by returning false, nil
it will continue to process. So you could keep the log entry but return nil for the error. This doesn't accomplish my other suggestion but that was just a thought.
PollUntilContextCancel tries a condition func until it returns true, an error, or the context is cancelled or hits a deadline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neisw , how about something like this.
bumped the poller to 30s and then added 6 tries in the namespaceDeleted() in case it hits an error. So, if we did hit some error 6 times in a row then we log all of them and pass the last one back out to the poller which will exit and fail the test case.
return true, nil | ||
} | ||
for retry := 0; retry < 6; retry++ { | ||
_, err := w.kubeClient.CoreV1().Namespaces().Get(ctx, w.namespaceName, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being in the loop here will just hammer it 6 times in a row without a pause in between and then exit. It is the PollUntilContextTimeout that waits in between the calls.
err := wait.PollUntilContextTimeout(ctx, 30*time.Second, 20*time.Minute, true, w.namespaceDeleted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I missed the sleep that you have but still not sure this is the best way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so if the error is nil we return false, nil
If the error is not nil and not IsNotFound
we try up to 6 times with a 5 second sleep in between. But we still always return false, nil. So really we just try a few more times in the same call with a smaller interval when we see an non IsNotFound
but ultimately keep going until we return true or timeout..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you wanting to quit polling after 6 consecutive failures? Maybe preserve the error and return it at the end.
so var err error
outside the loop and return false,err
at the end of the function. Should only return there after the 6 tries and no IsNotFound
or nil error.
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that was my intention originally. I've updated it now. look ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I figured that was what you were going for, it just took me a couple of read throughs to get caught up...
Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>
/test e2e-azure-ovn-upgrade |
Job Failure Risk Analysis for sha: 4979bad
|
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jluhrsen, neisw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jluhrsen: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
ab28660
into
openshift:master
@jluhrsen: Jira Issue OCPBUGS-31868: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-31868 has been moved to the MODIFIED state. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build openshift-enterprise-tests-container-v4.17.0-202405040320.p0.gab28660.assembly.stream.el9 for distgit openshift-enterprise-tests. |
Fix included in accepted release 4.16.0-0.nightly-2024-05-04-214435 |
No description provided.