-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recreate pod on TaskRun's pod deletion #758
Conversation
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for catching and fixing this @dicarlo2 !!
I have a request: now that the logic to get a TaskRun's associated pod is getting more complicated, can we move it into a different package with its own unit tests? (that doesnt depend on the reconciler) ❤️
A TaskRun's pod may be deleted either manually by the user or due to system constraints (e.g. node recreation). This change adds modifies the TaskRun reconciliation logic to recreate pods which are not found.
niiiice, looks great! 😎 thanks @dicarlo2 ❤️ !! /lgtm |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish, dicarlo2 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Should this require explicit opt-in from the user? Some tasks will not be idempotent, and I can imagine a scenario where automatically re-running them when they fail due to underlying platform issues would be surprising to users. If I understand correctly, this change also makes tasks work on preemptible VMs, since the controller will recreate and restart a task if its underlying node gets preempted. That sounds really compelling for a cheap CI solution where unit tests are likely idempotent, and I'd love to see a demo/guide for setting that up. But it still seems like something users should have to explicitly opt in to, rather than assuming all tasks can handle that gracefully. WDYT @dicarlo2 ? |
Yes, you're right, it should require opt-in. I'm happy to submit a PR for it. The only question I have is how we would like the user to configure it in the context of #658. Is it the same option? Is it considered a retry? IIRC, argo workflows use two separate options, one to enable retrying system failures (argo, kubernetes, etc.) and one for retrying user failures, which at first is a bit confusing and adds to the cognitive overhead of configuring argo, so I'm not sure if we want to follow that approach here or not. |
@dicarlo2 I don't think it should be considered as a "retry" if it's retrying because of platform issues. An idempotent taskrun that gets really unlucky could be preempted dozens of times, and should only be "retried" in terms of task failure once or twice. It would be confusing if those both counted toward the same retry limit. I think the option to enable this should be phrased as something like WDYT? |
@imjasonh SGTM, I'll get a PR up shortly. |
Changes
A TaskRun's pod may be deleted either manually by the user or due to system constraints (e.g. node recreation). This change adds modifies the TaskRun reconciliation logic to recreate pods which are not found.
Fixes #618
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide
for more details.
Release Notes