-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid race condition during importer termination #3116
Avoid race condition during importer termination #3116
Conversation
64641fb
to
a2bc9a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a really good idea!
You are saying the way we have it today (non-zero) is plain wrong,
as up until now we were restarting indefinitely over the scratch exit code?
/test pull-containerized-data-importer-e2e-nfs |
BTW - there is work to standardize the communication between importer & controllers at #3103 |
Yeah, when exiting with non-zero the pod keeps restarting indifinitely even when we just want to manually delete and recreate it. It isn't really an issue as it solves itself automatically but the deletion/creation is definitely racy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just questions
/approve
/cc @mhenriks
pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode > 0 { | ||
log.Info("Pod termination code", "pod.Name", pod.Name, "ExitCode", pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode) | ||
if pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode == common.ScratchSpaceNeededExitCode { | ||
pod.Status.ContainerStatuses[0].State.Terminated != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod.Status.ContainerStatuses[0].LastTerminationState.Terminated != nil && | ||
pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode > 0 { | ||
log.Info("Pod termination code", "pod.Name", pod.Name, "ExitCode", pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode) | ||
if pod.Status.ContainerStatuses[0].LastTerminationState.Terminated.ExitCode == common.ScratchSpaceNeededExitCode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So LastTerminationState
is not being set because requiring scratch now results in a "successful" pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right
State: v1.ContainerState{ | ||
Terminated: &corev1.ContainerStateTerminated{ | ||
ExitCode: 1, | ||
Message: "I went poof", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why drop the message and reason? If I follow the code correctly they'll still get set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the pod doesn't error with scratch space needed, the LastTerminationState
field won't be populated, just the Terminated
field. AFAIK this field only contains a single terminated state, so both bound and running conditions would be fetched from the scratch space termmination state.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: akalenyu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think there is collision between this and #3101. If the importer wants to signal that scratch space is required IMO it would be cleaner to do that with the proposed communication struct instead of matching the termination message string.
…ode when scratch space is required The restart policy on failure along with manual pod deletion caused some issues after the importer exited with scratch space needed. This commit sets the exit code to 0 when exiting for scratch space required so we manually delete the pod and avoid the described race condition. Signed-off-by: Alvaro Romero <alromero@redhat.com>
a2bc9a5
to
d8fa3d1
Compare
/unhold @akalenyu Fixed the issue commented above: Returning 0 as normally clashed with cleanup functions that assume the imported file will be there during regular termination. Using |
Ah, I see... how come this didn't fail any of the e2e tests? |
Since the failing function is in |
/retest-required |
Hey @0xFelix, yeah it makes sense to eventually integrate this into #3101, probably using a specific field ( |
Alright, makes sense if you want to backport this. |
Added a new commit (3db5b72) to address a flake caused by the new behavior: The test relied on the assumption that deleting the img from the http server would always cause the DV to restart at least once. However, with the new faster recovery time in the importer, a race condition happened where the file was deleted with the download already started, which just caused the polling to keep retrying without failing. Since we recreate the file fast enough, there was no time for the DV to error, so no restart was needed. This test could also rely on false positives since the importer pod failing for scratch space always caused the DV to restart. To fix this I deleted the part of the test that checks for dv restarts to be >= 1, and to make sure that everything is working as expected I added a md5sum check at the end. |
Test [test_id:1990] relied on the assumption that deleting the file from an http server would always cause the DV to restart. The old scratch space required mechanism always caused restarts on the DV, masking some false positives: This doesn't happen in all cases since the polling from the server can keep retrying without failing if the file is restored fast enough. This commit adapts the test to work with faster importer recoveries and adds a md5sum check to make sure the imports ends up being succesfull despite removing the file. Signed-off-by: Alvaro Romero <alromero@redhat.com>
3db5b72
to
aa7cade
Compare
Makes sense |
/test pull-containerized-data-importer-fossa |
/test pull-containerized-data-importer-e2e-nfs |
1 similar comment
/test pull-containerized-data-importer-e2e-nfs |
/test pull-containerized-data-importer-fossa |
/lgtm |
/test pull-containerized-data-importer-e2e-hpp-latest |
There's been some conversations around the possibility of not backporting this. I'll leave it like this until we make a decision. |
Yeah not a fan of backporting this, it's a subtle change, so unless someone is desperately asking for a backport I wouldn't do it |
What this PR does / why we need it:
This pull request sets the exit code when scratch space is required to 0 so, when deleting the importer pod after exiting for scratchSpaceNeeded, we don't need to deal with unwanted automatic restarts (RestartPolicyOnFailure). This conflicting behavior caused some issues during termination while the pod was already restarting after scratchSpaceNeeded.
Importer pod lifecycle for a regular DV before the fix:
# k get pod importer-cirros-dv-source -woyaml | grep exitCode selecting docker as container runtime exitCode: 42 exitCode: 42 exitCode: 42 exitCode: 42 exitCode: 2 exitCode: 42 exitCode: 2 exitCode: 42 exitCode: 2 exitCode: 0 exitCode: 0
After the fix:
# k get pod importer-cirros-dv-source -woyaml | grep exitCode selecting docker as container runtime exitCode: 0 exitCode: 0 exitCode: 0 exitCode: 0 exitCode: 0 exitCode: 0
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes # https://bugzilla.redhat.com/show_bug.cgi?id=2240963 (follow-up)
Special notes for your reviewer:
I think this is a better solution than #3044 and #3060. Let me know what you think.
Release note: