Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry using different pod when node shutdown. #6572

Closed
wants to merge 1 commit into from

Conversation

yuzp1996
Copy link
Contributor

When node shutdown then the retry pod will always use the same pod because it can not recongize that the pod can not work anymore and k8s can not delete it before the node recovered and it will cause retry is actually not working.

I fix it by check the DeletionTimestamp to know if the pod is actually not work any more.

Fix #6558

Changes

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

When the node shutdown then the retry pod will always be the same pod
because it can not recognise that the pod can not work anymore and
k8s can not delete the pod before the node recovers and it causes the retry is actually not to work.

I fix it by checking the DeletionTimestamp to know if the pod is actually
not work any more.

Fix tektoncd#6558

Signed-off-by: yuzhipeng <zpyu@alauda.io>
@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Apr 22, 2023
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign imjasonh after the PR has been reviewed.
You can assign the PR to them by writing /assign @imjasonh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot
Copy link
Collaborator

Hi @yuzp1996. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 22, 2023
@lbernick
Copy link
Member

thanks @yuzp1996! Could you please add tests for this functionality?
I'm surprised that this change is all that's needed to retry on a new node. I would think the taskrun controller would need to handle creating a new pod and updating the pod name in the taskrun status. Could you share some more details on how you tested out this bug fix?

@pritidesai you might be interested in this PR.

@JeromeJu
Copy link
Member

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 24, 2023
@tekton-robot
Copy link
Collaborator

@yuzp1996: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-tekton-pipeline-beta-integration-tests bf509a7 link true /test pull-tekton-pipeline-beta-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@XinruZhang
Copy link
Member

/assign

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 22, 2023
@jerop jerop added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2023
@jerop
Copy link
Member

jerop commented Sep 20, 2023

@yuzp1996 please feel free to reopen this PR when you pick up this work 🙏🏾

/close

(cleaning up old PRs)

@tekton-robot
Copy link
Collaborator

@jerop: Closed this PR.

In response to this:

@yuzp1996 please feel free to reopen this PR when you pick up this work 🙏🏾

/close

(cleaning up old PRs)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesnt merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If node that runs the taskrun pod shutdown then retry will not work as expected
6 participants