OCPBUGS-31868: allow for some errors checking namespace delete #28761

jluhrsen · 2024-05-01T00:44:15Z

No description provided.

openshift-ci-robot · 2024-05-01T00:44:22Z

@jluhrsen: This pull request references Jira Issue OCPBUGS-31868, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

neisw · 2024-05-01T18:21:16Z

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

+				log.Errorf("Timed out after 20 minutes waiting for deleted namespace: %s, %s", w.namespaceName, err)
+				return err
+			} else {
+				log.Errorf("Encountered error while waiting for deleted namespace: %s, %s", w.namespaceName, err)


Could keep a successive error count, clear the value any time we don't get an error, bail when it gets to 10 or something just so we don't spin for 15 minutes if things are really busted..

I realize this doesn't work after thinking about your other comment. I have a new idea coming.

neisw · 2024-05-01T18:38:13Z

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

@@ -281,7 +282,6 @@ func (w *availability) namespaceDeleted(ctx context.Context) (bool, error) {
 	}

 	if err != nil {
-		logrus.Errorf("Error checking for deleted namespace: %s, %s", w.namespaceName, err.Error())


Do you mean to suppress the return false here? Looks like this just move the logging up but the comment in the jira had me thinking you wanted to keep polling.

I suppose it would be return false, nil to keep polling but still log the error?

I think I confused myself that a return false, err would not stop the polling. but that's actually what was already happening and how we got to this bug.

but, if we return false, nil then I'm not sure we have a way to keep polling. 🤔

I had to go look it up earlier myself. I think by returning false, nil it will continue to process. So you could keep the log entry but return nil for the error. This doesn't accomplish my other suggestion but that was just a thought.

PollUntilContextTimeout

PollUntilContextCancel tries a condition func until it returns true, an error, or the context is cancelled or hits a deadline

jluhrsen · 2024-05-01T22:04:33Z

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

@neisw , how about something like this.

bumped the poller to 30s and then added 6 tries in the namespaceDeleted() in case it hits an error. So, if we did hit some error 6 times in a row then we log all of them and pass the last one back out to the poller which will exit and fail the test case.

neisw · 2024-05-01T22:37:52Z

pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

-		return true, nil
-	}
+	for retry := 0; retry < 6; retry++ {
+		_, err := w.kubeClient.CoreV1().Namespaces().Get(ctx, w.namespaceName, metav1.GetOptions{})


I think being in the loop here will just hammer it 6 times in a row without a pause in between and then exit. It is the PollUntilContextTimeout that waits in between the calls.

origin/pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

Line 304 in e07b8e3

err := wait.PollUntilContextTimeout(ctx, 30*time.Second, 20*time.Minute, true, w.namespaceDeleted)

Well, I missed the sleep that you have but still not sure this is the best way.

Ok, so if the error is nil we return false, nil

If the error is not nil and not IsNotFound we try up to 6 times with a 5 second sleep in between. But we still always return false, nil. So really we just try a few more times in the same call with a smaller interval when we see an non IsNotFound but ultimately keep going until we return true or timeout..

Are you wanting to quit polling after 6 consecutive failures? Maybe preserve the error and return it at the end.
so var err error outside the loop and return false,err at the end of the function. Should only return there after the 6 tries and no IsNotFound or nil error.

origin/pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go

Line 290 in e07b8e3

return false, nil

yes, that was my intention originally. I've updated it now. look ok?

Yep I figured that was what you were going for, it just took me a couple of read throughs to get caught up...

Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>

jluhrsen · 2024-05-02T21:07:17Z

/test e2e-azure-ovn-upgrade

openshift-trt-bot · 2024-05-03T01:13:56Z

Job Failure Risk Analysis for sha: 4979bad

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6	IncompleteTests Tests for this run (23) are below the historical average (1196): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

jluhrsen · 2024-05-03T17:48:54Z

/retest

jluhrsen · 2024-05-03T17:50:15Z

@neisw , good with you now? here is the test log file which didn't hit any errors in the
Get() to check the namespace, but at least it validates that the changes are not so bad
that they break the tests :)

neisw · 2024-05-03T23:00:15Z

/lgtm

openshift-ci · 2024-05-03T23:00:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jluhrsen, neisw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [neisw]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-05-04T03:15:03Z

@jluhrsen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade	`4979bad`	link	false	`/test e2e-aws-ovn-single-node-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-05-04T03:19:37Z

@jluhrsen: Jira Issue OCPBUGS-31868: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-31868 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-05-04T08:58:39Z

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-tests-container-v4.17.0-202405040320.p0.gab28660.assembly.stream.el9 for distgit openshift-enterprise-tests.
All builds following this will include this PR.

openshift-merge-robot · 2024-05-05T04:46:24Z

Fix included in accepted release 4.16.0-0.nightly-2024-05-04-214435

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 1, 2024

openshift-ci bot requested review from p0lyn0mial and soltysh May 1, 2024 00:45

jluhrsen force-pushed the OCPBUGS-31868 branch from a14d10b to 9f79014 Compare May 1, 2024 00:57

neisw reviewed May 1, 2024

View reviewed changes

jluhrsen force-pushed the OCPBUGS-31868 branch from 9f79014 to e07b8e3 Compare May 1, 2024 22:00

jluhrsen commented May 1, 2024

View reviewed changes

neisw reviewed May 1, 2024

View reviewed changes

OCPBUGS-31868: allow for some errors checking namespace delete

4979bad

Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>

jluhrsen force-pushed the OCPBUGS-31868 branch from e07b8e3 to 4979bad Compare May 2, 2024 17:10

jluhrsen mentioned this pull request May 3, 2024

trt-1538: Wait for monitor resources cleanup #28760

Closed

openshift-ci bot assigned neisw May 3, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2024

openshift-merge-bot bot merged commit ab28660 into openshift:master May 4, 2024
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-31868: allow for some errors checking namespace delete #28761

OCPBUGS-31868: allow for some errors checking namespace delete #28761

jluhrsen commented May 1, 2024

openshift-ci-robot commented May 1, 2024

neisw May 1, 2024

jluhrsen May 1, 2024

neisw May 1, 2024

neisw May 1, 2024

jluhrsen May 1, 2024

neisw May 1, 2024

jluhrsen May 1, 2024

neisw May 1, 2024

neisw May 1, 2024

neisw May 1, 2024 •

edited

Loading

neisw May 1, 2024

jluhrsen May 2, 2024

neisw May 2, 2024

jluhrsen commented May 2, 2024

openshift-trt-bot commented May 3, 2024

jluhrsen commented May 3, 2024

jluhrsen commented May 3, 2024

neisw commented May 3, 2024

openshift-ci bot commented May 3, 2024

openshift-ci bot commented May 4, 2024

openshift-ci-robot commented May 4, 2024

openshift-bot commented May 4, 2024

openshift-merge-robot commented May 5, 2024

OCPBUGS-31868: allow for some errors checking namespace delete #28761

OCPBUGS-31868: allow for some errors checking namespace delete #28761

Conversation

jluhrsen commented May 1, 2024

openshift-ci-robot commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neisw May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jluhrsen commented May 2, 2024

openshift-trt-bot commented May 3, 2024

jluhrsen commented May 3, 2024

jluhrsen commented May 3, 2024

neisw commented May 3, 2024

openshift-ci bot commented May 3, 2024

openshift-ci bot commented May 4, 2024

openshift-ci-robot commented May 4, 2024

openshift-bot commented May 4, 2024

openshift-merge-robot commented May 5, 2024

neisw May 1, 2024 •

edited

Loading