Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: skip clear message when node transition from pending to fail. Fixes #13200 #13201

Merged
merged 3 commits into from
Oct 3, 2024

Conversation

tczhao
Copy link
Member

@tczhao tczhao commented Jun 17, 2024

Fixes #13200

Motivation

Allow retry to use pod message when pod transition from pending to fail.

normally we have

timestamp0, status: pending, reason: ""
timestamp1, status: fail,    reason: "e.g. containerd issue"

but for podinitializing, the transition are the following

timestamp0,                    status: pending, reason: ""
timestamp1,                    status: pending, reason: "PodInitializing"
timestamp2(immediately after), status: fail,    reason: ""

Modifications

This PR fixes the issue, we don't overwrite message with "" when pod transition from pending to fail phase

timestamp0,                    status: pending, reason: ""
timestamp1,                    status: pending, reason: "PodInitializing"
timestamp2(immediately after), status: fail,    reason: "PodInitializing"

Verification

Add unit test, test failed
Add changes, test succeed
Also release to our production env for a week and pods able to retry on all PodInitializing message when configured in TRANSIENT_ERROR_PATTERN without seeing any other issue

Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>
@tczhao tczhao marked this pull request as ready for review June 17, 2024 15:54
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Jun 17, 2024
@agilgur5
Copy link
Contributor

I'm wondering if this might partially help with #11354 and #12572 too 🤔

@tczhao
Copy link
Member Author

tczhao commented Jun 18, 2024

I'm wondering if this might partially help with #11354 and #12572 too 🤔

I think so, as a workaround solution, use Message instead of ExitCode

Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>
@tczhao
Copy link
Member Author

tczhao commented Jun 18, 2024

/retest

Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

@juliev0
Copy link
Contributor

juliev0 commented Aug 23, 2024

I see this has been approved. @tczhao do you want to fix the merge conflict, and then it can be merged?

Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>
@tczhao tczhao enabled auto-merge (squash) October 3, 2024 12:31
@tczhao tczhao merged commit f1fbe09 into main Oct 3, 2024
28 checks passed
@tczhao tczhao deleted the 13200-node-message branch October 3, 2024 13:02
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Oct 3, 2024
isubasinghe pushed a commit that referenced this pull request Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod doesn't retry when using "PodInitializing" in TRANSIENT_ERROR_PATTERN
3 participants