Recreate pod on TaskRun's pod deletion #758

dicarlo2 · 2019-04-13T20:29:29Z

Changes

A TaskRun's pod may be deleted either manually by the user or due to system constraints (e.g. node recreation). This change adds modifies the TaskRun reconciliation logic to recreate pods which are not found.

Fixes #618

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide
for more details.

Release Notes

Recreate TaskRun pods on deletion.

abayer · 2019-04-14T10:34:45Z

/ok-to-test

bobcatfish

Nice, thanks for catching and fixing this @dicarlo2 !!

I have a request: now that the logic to get a TaskRun's associated pod is getting more complicated, can we move it into a different package with its own unit tests? (that doesnt depend on the reconciler) ❤️

pkg/reconciler/v1alpha1/taskrun/taskrun.go

A TaskRun's pod may be deleted either manually by the user or due to system constraints (e.g. node recreation). This change adds modifies the TaskRun reconciliation logic to recreate pods which are not found.

bobcatfish · 2019-04-24T19:06:31Z

niiiice, looks great! 😎 thanks @dicarlo2 ❤️ !!

/lgtm
/approve
/meow space

tekton-robot · 2019-04-24T19:06:32Z

@bobcatfish:

In response to this:

niiiice, looks great! 😎 thanks @dicarlo2 ❤️ !!

/lgtm
/approve
/meow space

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2019-04-24T19:06:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, dicarlo2

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bobcatfish]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

imjasonh · 2019-04-25T20:55:38Z

Should this require explicit opt-in from the user? Some tasks will not be idempotent, and I can imagine a scenario where automatically re-running them when they fail due to underlying platform issues would be surprising to users.

If I understand correctly, this change also makes tasks work on preemptible VMs, since the controller will recreate and restart a task if its underlying node gets preempted. That sounds really compelling for a cheap CI solution where unit tests are likely idempotent, and I'd love to see a demo/guide for setting that up. But it still seems like something users should have to explicitly opt in to, rather than assuming all tasks can handle that gracefully.

WDYT @dicarlo2 ?

dicarlo2 · 2019-04-26T02:03:27Z

Yes, you're right, it should require opt-in. I'm happy to submit a PR for it. The only question I have is how we would like the user to configure it in the context of #658. Is it the same option? Is it considered a retry?

IIRC, argo workflows use two separate options, one to enable retrying system failures (argo, kubernetes, etc.) and one for retrying user failures, which at first is a bit confusing and adds to the cognitive overhead of configuring argo, so I'm not sure if we want to follow that approach here or not.

@bobcatfish

imjasonh · 2019-04-26T16:37:22Z

@dicarlo2 I don't think it should be considered as a "retry" if it's retrying because of platform issues. An idempotent taskrun that gets really unlucky could be preempted dozens of times, and should only be "retried" in terms of task failure once or twice. It would be confusing if those both counted toward the same retry limit.

I think the option to enable this should be phrased as something like idempotent: true (or idempotencyMode: Idempotent or something).

WDYT?

dicarlo2 · 2019-05-03T00:03:46Z

@imjasonh SGTM, I'll get a PR up shortly.

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Apr 13, 2019

tekton-robot requested review from bobcatfish and imjasonh April 13, 2019 20:29

tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 13, 2019

tekton-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 14, 2019

bobcatfish reviewed Apr 15, 2019

View reviewed changes

pkg/reconciler/v1alpha1/taskrun/taskrun.go Outdated Show resolved Hide resolved

Recreate pod on TaskRun's pod deletion

a7ebfc0

A TaskRun's pod may be deleted either manually by the user or due to system constraints (e.g. node recreation). This change adds modifies the TaskRun reconciliation logic to recreate pods which are not found.

dicarlo2 force-pushed the pod_delete branch from d1c46b0 to a7ebfc0 Compare April 23, 2019 05:36

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 23, 2019

tekton-robot assigned bobcatfish Apr 24, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2019

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2019

tekton-robot merged commit 11e03b7 into tektoncd:master Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recreate pod on TaskRun's pod deletion #758

Recreate pod on TaskRun's pod deletion #758

dicarlo2 commented Apr 13, 2019 •

edited

Loading

abayer commented Apr 14, 2019

bobcatfish left a comment

bobcatfish commented Apr 24, 2019

tekton-robot commented Apr 24, 2019

tekton-robot commented Apr 24, 2019

imjasonh commented Apr 25, 2019

dicarlo2 commented Apr 26, 2019

imjasonh commented Apr 26, 2019

dicarlo2 commented May 3, 2019

Recreate pod on TaskRun's pod deletion #758

Recreate pod on TaskRun's pod deletion #758

Conversation

dicarlo2 commented Apr 13, 2019 • edited Loading

Changes

Submitter Checklist

Release Notes

abayer commented Apr 14, 2019

bobcatfish left a comment

Choose a reason for hiding this comment

bobcatfish commented Apr 24, 2019

tekton-robot commented Apr 24, 2019

tekton-robot commented Apr 24, 2019

imjasonh commented Apr 25, 2019

dicarlo2 commented Apr 26, 2019

imjasonh commented Apr 26, 2019

dicarlo2 commented May 3, 2019

dicarlo2 commented Apr 13, 2019 •

edited

Loading