Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid race condition during importer termination #3116

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

alromeros
Copy link
Collaborator

What this PR does / why we need it:

This pull request sets the exit code when scratch space is required to 0 so, when deleting the importer pod after exiting for scratchSpaceNeeded, we don't need to deal with unwanted automatic restarts (RestartPolicyOnFailure). This conflicting behavior caused some issues during termination while the pod was already restarting after scratchSpaceNeeded.

Importer pod lifecycle for a regular DV before the fix:

# k get pod importer-cirros-dv-source -woyaml | grep exitCode
selecting docker as container runtime
        exitCode: 42
        exitCode: 42
        exitCode: 42
        exitCode: 42
        exitCode: 2
        exitCode: 42
        exitCode: 2
        exitCode: 42
        exitCode: 2
        exitCode: 0
        exitCode: 0

After the fix:

# k get pod importer-cirros-dv-source -woyaml | grep exitCode
selecting docker as container runtime
        exitCode: 0
        exitCode: 0
        exitCode: 0
        exitCode: 0
        exitCode: 0
        exitCode: 0

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # https://bugzilla.redhat.com/show_bug.cgi?id=2240963 (follow-up)

Special notes for your reviewer:

I think this is a better solution than #3044 and #3060. Let me know what you think.

Release note:

Bugfix: Avoid race condition during importer termination

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Feb 29, 2024
@alromeros
Copy link
Collaborator Author

Third PR trying to accomplish this but I think this is the best solution yet. #3060 led to nasty behavior when pod was failing with errors other than scratch space required and #3044 0 grace period wasn't behaving well leaving containers stuck on the ceph lane.

@alromeros
Copy link
Collaborator Author

/cc @akalenyu
/cc @awels

Copy link
Collaborator

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a really good idea!

You are saying the way we have it today (non-zero) is plain wrong,
as up until now we were restarting indefinitely over the scratch exit code?

@akalenyu
Copy link
Collaborator

/test pull-containerized-data-importer-e2e-nfs

@akalenyu
Copy link
Collaborator

BTW - there is work to standardize the communication between importer & controllers at #3103

@alromeros
Copy link
Collaborator Author

I think this is a really good idea!

You are saying the way we have it today (non-zero) is plain wrong, as up until now we were restarting indefinitely over the scratch exit code?

Yeah, when exiting with non-zero the pod keeps restarting indifinitely even when we just want to manually delete and recreate it. It isn't really an issue as it solves itself automatically but the deletion/creation is definitely racy.

Copy link
Collaborator

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just questions

/approve
/cc @mhenriks

pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode > 0 {
log.Info("Pod termination code", "pod.Name", pod.Name, "ExitCode", pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode)
if pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode == common.ScratchSpaceNeededExitCode {
pod.Status.ContainerStatuses[0].State.Terminated != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xFelix I am pretty sure there is no collision between this and 8462345, just making sure

pod.Status.ContainerStatuses[0].LastTerminationState.Terminated != nil &&
pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode > 0 {
log.Info("Pod termination code", "pod.Name", pod.Name, "ExitCode", pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode)
if pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode == common.ScratchSpaceNeededExitCode {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So LastTerminationState is not being set because requiring scratch now results in a "successful" pod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

State: v1.ContainerState{
Terminated: &corev1.ContainerStateTerminated{
ExitCode: 1,
Message: "I went poof",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop the message and reason? If I follow the code correctly they'll still get set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the pod doesn't error with scratch space needed, the LastTerminationState field won't be populated, just the Terminated field. AFAIK this field only contains a single terminated state, so both bound and running conditions would be fetched from the scratch space termmination state.

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akalenyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2024
@alromeros
Copy link
Collaborator Author

/hold
Testing locally, works fine with rook-ceph-block but there is still an error code in the importer when using local storage class. Might be expected but I'll first make sure.

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2024
Copy link
Member

@0xFelix 0xFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think there is collision between this and #3101. If the importer wants to signal that scratch space is required IMO it would be cleaner to do that with the proposed communication struct instead of matching the termination message string.

@0xFelix
Copy link
Member

0xFelix commented Mar 7, 2024

During my work on #3101 and #3103 I also noticed that exit codes other than 0 do not work well in Kubernetes.

…ode when scratch space is required

The restart policy on failure along with manual pod deletion caused some issues after the importer exited with scratch space needed.

This commit sets the exit code to 0 when exiting for scratch space required so we manually delete the pod and avoid the described race condition.

Signed-off-by: Alvaro Romero <alromero@redhat.com>
@alromeros
Copy link
Collaborator Author

/unhold

@akalenyu Fixed the issue commented above: Returning 0 as normally clashed with cleanup functions that assume the imported file will be there during regular termination. Using os.Exit(0) instead avoids the issue.

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2024
@akalenyu
Copy link
Collaborator

akalenyu commented Mar 7, 2024

/unhold

@akalenyu Fixed the issue commented above: Returning 0 as normally clashed with cleanup functions that assume the imported file will be there during regular termination. Using os.Exit(0) instead avoids the issue.

Ah, I see... how come this didn't fail any of the e2e tests?

@alromeros
Copy link
Collaborator Author

/unhold
@akalenyu Fixed the issue commented above: Returning 0 as normally clashed with cleanup functions that assume the imported file will be there during regular termination. Using os.Exit(0) instead avoids the issue.

Ah, I see... how come this didn't fail any of the e2e tests?

Since the failing function is in defer, the scratch space required termination message was still written. After a couple of restarts, the pod with scratch space was created and the import succeeded anyway.

@alromeros
Copy link
Collaborator Author

/retest-required

@0xFelix
Copy link
Member

0xFelix commented Mar 8, 2024

Is there a chance for this to be part of #3101? IMO both this PR and #3101 try to solve very similar issues.

@alromeros
Copy link
Collaborator Author

Is there a chance for this to be part of #3101? IMO both this PR and #3101 try to solve very similar issues.

Hey @0xFelix, yeah it makes sense to eventually integrate this into #3101, probably using a specific field (ScratchSpaceRequired boolean maybe?) to avoid parsing the msg in the controller. That said, I would prefer to prioritize merging this first since this is part of a bugfix we plan to backport to v1.57, and it'd be safer to avoid backporting a larger feature. Thanks!

@0xFelix
Copy link
Member

0xFelix commented Mar 8, 2024

Alright, makes sense if you want to backport this.

@alromeros
Copy link
Collaborator Author

Added a new commit (3db5b72) to address a flake caused by the new behavior:

The test relied on the assumption that deleting the img from the http server would always cause the DV to restart at least once. However, with the new faster recovery time in the importer, a race condition happened where the file was deleted with the download already started, which just caused the polling to keep retrying without failing. Since we recreate the file fast enough, there was no time for the DV to error, so no restart was needed.

This test could also rely on false positives since the importer pod failing for scratch space always caused the DV to restart.

To fix this I deleted the part of the test that checks for dv restarts to be >= 1, and to make sure that everything is working as expected I added a md5sum check at the end.

Test [test_id:1990] relied on the assumption that deleting the file from an http server would always cause the DV to restart.

The old scratch space required mechanism always caused restarts on the DV, masking some false positives: This doesn't happen in all cases since the polling from the server can keep retrying without failing if the file is restored fast enough.

This commit adapts the test to work with faster importer recoveries and adds a md5sum check to make sure the imports ends up being succesfull despite removing the file.

Signed-off-by: Alvaro Romero <alromero@redhat.com>
@akalenyu
Copy link
Collaborator

Added a new commit (3db5b72) to address a flake caused by the new behavior:

The test relied on the assumption that deleting the img from the http server would always cause the DV to restart at least once. However, with the new faster recovery time in the importer, a race condition happened where the file was deleted with the download already started, which just caused the polling to keep retrying without failing. Since we recreate the file fast enough, there was no time for the DV to error, so no restart was needed.

This test could also rely on false positives since the importer pod failing for scratch space always caused the DV to restart.

To fix this I deleted the part of the test that checks for dv restarts to be >= 1, and to make sure that everything is working as expected I added a md5sum check at the end.

Makes sense
/test pull-containerized-data-importer-e2e-nfs

@akalenyu
Copy link
Collaborator

akalenyu commented Mar 10, 2024

/test pull-containerized-data-importer-fossa
weird one, fossa yells about github.com/gorhill/cronexpr which has been around for ages
EDIT:
#3127

@akalenyu
Copy link
Collaborator

/test pull-containerized-data-importer-e2e-nfs

1 similar comment
@akalenyu
Copy link
Collaborator

/test pull-containerized-data-importer-e2e-nfs

@akalenyu
Copy link
Collaborator

/test pull-containerized-data-importer-fossa

@mhenriks
Copy link
Member

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2024
@alromeros
Copy link
Collaborator Author

/test pull-containerized-data-importer-e2e-hpp-latest

@kubevirt-bot kubevirt-bot merged commit 7fbe1c3 into kubevirt:main Mar 12, 2024
18 checks passed
@alromeros
Copy link
Collaborator Author

There's been some conversations around the possibility of not backporting this. I'll leave it like this until we make a decision.

@akalenyu
Copy link
Collaborator

There's been some conversations around the possibility of not backporting this. I'll leave it like this until we make a decision.

Yeah not a fan of backporting this, it's a subtle change, so unless someone is desperately asking for a backport I wouldn't do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants