Avoid race condition during importer termination #3116

alromeros · 2024-02-29T09:11:03Z

What this PR does / why we need it:

This pull request sets the exit code when scratch space is required to 0 so, when deleting the importer pod after exiting for scratchSpaceNeeded, we don't need to deal with unwanted automatic restarts (RestartPolicyOnFailure). This conflicting behavior caused some issues during termination while the pod was already restarting after scratchSpaceNeeded.

Importer pod lifecycle for a regular DV before the fix:

# k get pod importer-cirros-dv-source -woyaml | grep exitCode
selecting docker as container runtime
        exitCode: 42
        exitCode: 42
        exitCode: 42
        exitCode: 42
        exitCode: 2
        exitCode: 42
        exitCode: 2
        exitCode: 42
        exitCode: 2
        exitCode: 0
        exitCode: 0

After the fix:

# k get pod importer-cirros-dv-source -woyaml | grep exitCode
selecting docker as container runtime
        exitCode: 0
        exitCode: 0
        exitCode: 0
        exitCode: 0
        exitCode: 0
        exitCode: 0

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # https://bugzilla.redhat.com/show_bug.cgi?id=2240963 (follow-up)

Special notes for your reviewer:

I think this is a better solution than #3044 and #3060. Let me know what you think.

Release note:

Bugfix: Avoid race condition during importer termination

alromeros · 2024-02-29T09:31:11Z

Third PR trying to accomplish this but I think this is the best solution yet. #3060 led to nasty behavior when pod was failing with errors other than scratch space required and #3044 0 grace period wasn't behaving well leaving containers stuck on the ceph lane.

alromeros · 2024-02-29T14:04:11Z

/cc @akalenyu
/cc @awels

akalenyu

I think this is a really good idea!

You are saying the way we have it today (non-zero) is plain wrong,
as up until now we were restarting indefinitely over the scratch exit code?

akalenyu · 2024-02-29T15:05:36Z

/test pull-containerized-data-importer-e2e-nfs

akalenyu · 2024-02-29T15:06:23Z

BTW - there is work to standardize the communication between importer & controllers at #3103

alromeros · 2024-02-29T15:40:25Z

I think this is a really good idea!

You are saying the way we have it today (non-zero) is plain wrong, as up until now we were restarting indefinitely over the scratch exit code?

Yeah, when exiting with non-zero the pod keeps restarting indifinitely even when we just want to manually delete and recreate it. It isn't really an issue as it solves itself automatically but the deletion/creation is definitely racy.

akalenyu

Just questions

/approve
/cc @mhenriks

akalenyu · 2024-03-06T17:01:18Z

pkg/controller/import-controller.go

-		pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode > 0 {
-		log.Info("Pod termination code", "pod.Name", pod.Name, "ExitCode", pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode)
-		if pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode == common.ScratchSpaceNeededExitCode {
+		pod.Status.ContainerStatuses[0].State.Terminated != nil {


@0xFelix I am pretty sure there is no collision between this and 8462345, just making sure

akalenyu · 2024-03-06T17:21:11Z

pkg/controller/import-controller.go

-		pod.Status.ContainerStatuses[0].LastTerminationState.Terminated != nil &&
-		pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode > 0 {
-		log.Info("Pod termination code", "pod.Name", pod.Name, "ExitCode", pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode)
-		if pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode == common.ScratchSpaceNeededExitCode {


So LastTerminationState is not being set because requiring scratch now results in a "successful" pod?

akalenyu · 2024-03-06T17:21:39Z

pkg/controller/import-controller_test.go

 					State: v1.ContainerState{
 						Terminated: &corev1.ContainerStateTerminated{
-							ExitCode: 1,
-							Message:  "I went poof",


Why drop the message and reason? If I follow the code correctly they'll still get set

Since the pod doesn't error with scratch space needed, the LastTerminationState field won't be populated, just the Terminated field. AFAIK this field only contains a single terminated state, so both bound and running conditions would be fetched from the scratch space termmination state.

kubevirt-bot · 2024-03-06T17:22:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akalenyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [akalenyu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alromeros · 2024-03-07T10:45:08Z

/hold
Testing locally, works fine with rook-ceph-block but there is still an error code in the importer when using local storage class. Might be expected but I'll first make sure.

0xFelix

Actually, I think there is collision between this and #3101. If the importer wants to signal that scratch space is required IMO it would be cleaner to do that with the proposed communication struct instead of matching the termination message string.

0xFelix · 2024-03-07T12:27:00Z

During my work on #3101 and #3103 I also noticed that exit codes other than 0 do not work well in Kubernetes.

…ode when scratch space is required The restart policy on failure along with manual pod deletion caused some issues after the importer exited with scratch space needed. This commit sets the exit code to 0 when exiting for scratch space required so we manually delete the pod and avoid the described race condition. Signed-off-by: Alvaro Romero <alromero@redhat.com>

alromeros · 2024-03-07T16:07:47Z

/unhold

@akalenyu Fixed the issue commented above: Returning 0 as normally clashed with cleanup functions that assume the imported file will be there during regular termination. Using os.Exit(0) instead avoids the issue.

akalenyu · 2024-03-07T16:23:16Z

/unhold

@akalenyu Fixed the issue commented above: Returning 0 as normally clashed with cleanup functions that assume the imported file will be there during regular termination. Using os.Exit(0) instead avoids the issue.

Ah, I see... how come this didn't fail any of the e2e tests?

alromeros · 2024-03-07T16:27:45Z

/unhold
@akalenyu Fixed the issue commented above: Returning 0 as normally clashed with cleanup functions that assume the imported file will be there during regular termination. Using os.Exit(0) instead avoids the issue.

Ah, I see... how come this didn't fail any of the e2e tests?

Since the failing function is in defer, the scratch space required termination message was still written. After a couple of restarts, the pod with scratch space was created and the import succeeded anyway.

alromeros · 2024-03-07T16:49:35Z

/retest-required

0xFelix · 2024-03-08T08:51:23Z

Is there a chance for this to be part of #3101? IMO both this PR and #3101 try to solve very similar issues.

alromeros · 2024-03-08T11:22:19Z

Is there a chance for this to be part of #3101? IMO both this PR and #3101 try to solve very similar issues.

Hey @0xFelix, yeah it makes sense to eventually integrate this into #3101, probably using a specific field (ScratchSpaceRequired boolean maybe?) to avoid parsing the msg in the controller. That said, I would prefer to prioritize merging this first since this is part of a bugfix we plan to backport to v1.57, and it'd be safer to avoid backporting a larger feature. Thanks!

0xFelix · 2024-03-08T14:20:07Z

Alright, makes sense if you want to backport this.

alromeros · 2024-03-08T19:28:46Z

Added a new commit (3db5b72) to address a flake caused by the new behavior:

The test relied on the assumption that deleting the img from the http server would always cause the DV to restart at least once. However, with the new faster recovery time in the importer, a race condition happened where the file was deleted with the download already started, which just caused the polling to keep retrying without failing. Since we recreate the file fast enough, there was no time for the DV to error, so no restart was needed.

This test could also rely on false positives since the importer pod failing for scratch space always caused the DV to restart.

To fix this I deleted the part of the test that checks for dv restarts to be >= 1, and to make sure that everything is working as expected I added a md5sum check at the end.

Test [test_id:1990] relied on the assumption that deleting the file from an http server would always cause the DV to restart. The old scratch space required mechanism always caused restarts on the DV, masking some false positives: This doesn't happen in all cases since the polling from the server can keep retrying without failing if the file is restored fast enough. This commit adapts the test to work with faster importer recoveries and adds a md5sum check to make sure the imports ends up being succesfull despite removing the file. Signed-off-by: Alvaro Romero <alromero@redhat.com>

akalenyu · 2024-03-10T10:05:40Z

Added a new commit (3db5b72) to address a flake caused by the new behavior:

The test relied on the assumption that deleting the img from the http server would always cause the DV to restart at least once. However, with the new faster recovery time in the importer, a race condition happened where the file was deleted with the download already started, which just caused the polling to keep retrying without failing. Since we recreate the file fast enough, there was no time for the DV to error, so no restart was needed.

This test could also rely on false positives since the importer pod failing for scratch space always caused the DV to restart.

To fix this I deleted the part of the test that checks for dv restarts to be >= 1, and to make sure that everything is working as expected I added a md5sum check at the end.

Makes sense
/test pull-containerized-data-importer-e2e-nfs

akalenyu · 2024-03-10T10:07:34Z

/test pull-containerized-data-importer-fossa
weird one, fossa yells about github.com/gorhill/cronexpr which has been around for ages
EDIT:
#3127

akalenyu · 2024-03-10T12:05:06Z

/test pull-containerized-data-importer-e2e-nfs

akalenyu · 2024-03-10T14:34:39Z

/test pull-containerized-data-importer-e2e-nfs

akalenyu · 2024-03-11T15:29:09Z

/test pull-containerized-data-importer-fossa

mhenriks · 2024-03-12T01:29:43Z

/lgtm

alromeros · 2024-03-12T08:26:49Z

/test pull-containerized-data-importer-e2e-hpp-latest

alromeros · 2024-03-12T10:40:59Z

There's been some conversations around the possibility of not backporting this. I'll leave it like this until we make a decision.

akalenyu · 2024-03-12T11:23:36Z

There's been some conversations around the possibility of not backporting this. I'll leave it like this until we make a decision.

Yeah not a fan of backporting this, it's a subtle change, so unless someone is desperately asking for a backport I wouldn't do it

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Feb 29, 2024

kubevirt-bot requested review from awels and maya-r February 29, 2024 09:11

kubevirt-bot added the size/S label Feb 29, 2024

alromeros force-pushed the fix-importer-termination-3 branch from 64641fb to a2bc9a5 Compare February 29, 2024 09:28

kubevirt-bot added size/M and removed size/S labels Feb 29, 2024

kubevirt-bot requested a review from akalenyu February 29, 2024 14:04

akalenyu reviewed Feb 29, 2024

View reviewed changes

akalenyu reviewed Mar 6, 2024

View reviewed changes

kubevirt-bot requested a review from mhenriks March 6, 2024 17:22

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2024

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2024

0xFelix reviewed Mar 7, 2024

View reviewed changes

alromeros force-pushed the fix-importer-termination-3 branch from a2bc9a5 to d8fa3d1 Compare March 7, 2024 16:05

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2024

alromeros mentioned this pull request Mar 8, 2024

feat: Make importer datasource communication explicit #3101

Merged

alromeros force-pushed the fix-importer-termination-3 branch from 3db5b72 to aa7cade Compare March 9, 2024 21:24

kubevirt-bot assigned mhenriks Mar 12, 2024

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2024

kubevirt-bot merged commit 7fbe1c3 into kubevirt:main Mar 12, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid race condition during importer termination #3116

Avoid race condition during importer termination #3116

alromeros commented Feb 29, 2024

alromeros commented Feb 29, 2024

alromeros commented Feb 29, 2024

akalenyu left a comment •

edited

Loading

akalenyu commented Feb 29, 2024

akalenyu commented Feb 29, 2024

alromeros commented Feb 29, 2024

akalenyu left a comment

akalenyu Mar 6, 2024

akalenyu Mar 6, 2024

alromeros Mar 7, 2024

akalenyu Mar 6, 2024

alromeros Mar 7, 2024

kubevirt-bot commented Mar 6, 2024

alromeros commented Mar 7, 2024

0xFelix left a comment

0xFelix commented Mar 7, 2024

alromeros commented Mar 7, 2024

akalenyu commented Mar 7, 2024

alromeros commented Mar 7, 2024

alromeros commented Mar 7, 2024

0xFelix commented Mar 8, 2024

alromeros commented Mar 8, 2024

0xFelix commented Mar 8, 2024

alromeros commented Mar 8, 2024

akalenyu commented Mar 10, 2024

akalenyu commented Mar 10, 2024 •

edited

Loading

akalenyu commented Mar 10, 2024

akalenyu commented Mar 10, 2024

akalenyu commented Mar 11, 2024

mhenriks commented Mar 12, 2024

alromeros commented Mar 12, 2024

alromeros commented Mar 12, 2024

akalenyu commented Mar 12, 2024

Avoid race condition during importer termination #3116

Avoid race condition during importer termination #3116

Conversation

alromeros commented Feb 29, 2024

alromeros commented Feb 29, 2024

alromeros commented Feb 29, 2024

akalenyu left a comment • edited Loading

Choose a reason for hiding this comment

akalenyu commented Feb 29, 2024

akalenyu commented Feb 29, 2024

alromeros commented Feb 29, 2024

akalenyu left a comment

Choose a reason for hiding this comment

akalenyu Mar 6, 2024

Choose a reason for hiding this comment

akalenyu Mar 6, 2024

Choose a reason for hiding this comment

alromeros Mar 7, 2024

Choose a reason for hiding this comment

akalenyu Mar 6, 2024

Choose a reason for hiding this comment

alromeros Mar 7, 2024

Choose a reason for hiding this comment

kubevirt-bot commented Mar 6, 2024

alromeros commented Mar 7, 2024

0xFelix left a comment

Choose a reason for hiding this comment

0xFelix commented Mar 7, 2024

alromeros commented Mar 7, 2024

akalenyu commented Mar 7, 2024

alromeros commented Mar 7, 2024

alromeros commented Mar 7, 2024

0xFelix commented Mar 8, 2024

alromeros commented Mar 8, 2024

0xFelix commented Mar 8, 2024

alromeros commented Mar 8, 2024

akalenyu commented Mar 10, 2024

akalenyu commented Mar 10, 2024 • edited Loading

akalenyu commented Mar 10, 2024

akalenyu commented Mar 10, 2024

akalenyu commented Mar 11, 2024

mhenriks commented Mar 12, 2024

alromeros commented Mar 12, 2024

alromeros commented Mar 12, 2024

akalenyu commented Mar 12, 2024

akalenyu left a comment •

edited

Loading

akalenyu commented Mar 10, 2024 •

edited

Loading