Cancel TaskRuns using entrypoint binary #3238

imjasonh · 2020-09-15T14:34:52Z

Today

When a TaskRun is cancelled, the TaskRun controller deletes the TaskRun's underlying Pod. This stops execution ~immediately, but also leads Kubernetes to reap the Pod's logs.

Feature Request

In #2559, we're discussing enforcing TaskRun-level timeouts in the entrypoint binary, so that timed-out TaskRun Pods don't get deleted and any logs lost. Instead of deleting the Pod, the entrypoint binary that runs each step will just stop executing and fail any running step, and not run any subsequent steps.

If we end up doing that, we could also enforce cancellation in the entrypoint binary, which would let us keep Pods and logs around for cancelled TaskRuns too.

To accomplish this, the entrypoint binary could take a new flag -cancel_file, which is a Downard API volume populated from a Pod annotation -- this is similar to how we signal the first step to start only after all sidecars are ready. In this model, when a TaskRun is cancelled, the TaskRun controller would annotate the Pod with, for example "cancelled=true", which would update the contents of the projected file, which the entrypoint binary would see, then it can stop executing the currently running step.

This behavior change should be guarded by a feature flag (opt-in at first) since some users might depend on the current behavior. This also gives us an opportunity to compare behavior and timing of cancellation between the two implementations.

/kind feature

The text was updated successfully, but these errors were encountered:

bobcatfish · 2020-09-15T17:56:01Z

Makes sense to me to revisit this!

@imjasonh are there any other options that are worth evaluating for this? For example, we could send a signal that the entrypoint binary could catch - iirc the only reason we haven't relied more on signals is b/c we haven't been able to rely on how some arbitrary process will handle it; we can control the entrypoint binary tho 🤔

I think @sbwsg has looked into this as well in the context of the initial cancellation feature

imjasonh · 2020-09-15T18:00:32Z

Can we send signals to containers in the pod from the controller? I didn't think that was an option, so we've relied on file-existence checks backed by Downard volumes instead.

We've also discussed having a sidecar in the Pod that can accept RPCs from the controller, but that's a much larger change, and ultimately sort of orthogonal to how entrypoint stops the user's entrypoint execution.

tekton-robot · 2021-01-03T16:16:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

vdemeester · 2021-01-05T15:08:52Z

/remove-lifecycle stale
/lifecycle frozen

Putting this in the "frozen" box as this is something that is worth exploring 🙃

The cancellation of taskruns is now done through the entrypoint binary through a new flag called 'cancel_file'. This removes the need for deleting the pods to cancel a taskrun, allowing examination of the logs on the pods from cancelled taskruns. Part of work on issue tektoncd#3238 Signed-off-by: Arash Deshmeh <adeshmeh@ca.ibm.com>

chengjoey · 2023-01-29T14:06:28Z

Is this still a work in progress, I would love to implement it

vdemeester · 2023-02-03T09:59:17Z

@chengjoey I didn't have time to keep the PR up-to-date, etc.. so yes, please go ahead 🙏🏼

chengjoey · 2023-04-07T14:09:42Z

/assign

through a new flag called 'stop_on_cancel'. This removes the need for deleting the pods to cancel a taskrun, allowing examination of the logs on the pods from cancelled taskruns. Part of work on issue tektoncd#3238 Signed-off-by: chengjoey <zchengjoey@gmail.com>

through a new flag called 'stop_on_cancel'. This removes the need for deleting the pods to cancel a taskrun, allowing examination of the logs on the pods from cancelled taskruns. Part of work on issue #3238 Signed-off-by: chengjoey <zchengjoey@gmail.com>

through a new flag called 'stop_on_cancel'. This removes the need for deleting the pods to cancel a taskrun, allowing examination of the logs on the pods from cancelled taskruns. Part of work on issue tektoncd#3238 Signed-off-by: chengjoey <zchengjoey@gmail.com>

paulweb515 · 2024-10-21T14:32:07Z

Hi, How close is this?

AlanGreene · 2024-10-21T14:50:32Z

I think this can be enabled via the keep-pod-on-cancel feature flag since v0.53, but is still disabled by default.

Not sure what work is outstanding @chengjoey @JeromeJu

tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 15, 2020

bobcatfish added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Sep 15, 2020

ghost added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 5, 2020

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2021

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 5, 2021

dibyom added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 9, 2021

imjasonh mentioned this issue Apr 22, 2021

Add /bin/kill so distroless works with TektonCD/Argo Workflows GoogleContainerTools/distroless#724

Closed

afrittoli mentioned this issue Jun 17, 2021

PipelineRun timeouts delete TaskRun related pods #4035

Open

adshmh mentioned this issue Feb 24, 2022

Cancel taskrun using entrypoint binary #4618

Closed

5 tasks

vdemeester mentioned this issue Aug 31, 2022

[Carry #3238] Cancel taskrun using entrypoint binary #5401

Closed

7 tasks

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

tekton-robot assigned chengjoey Apr 7, 2023

chengjoey mentioned this issue Apr 8, 2023

feat/Cancel taskrun using entrypoint binary #6511

Merged

7 tasks

lbernick mentioned this issue Jul 12, 2023

TaskRun with InvalidImageName runs forever #6105

Closed

chengjoey mentioned this issue Oct 23, 2024

misc: promote keep-pod-on-cancel to default #8343

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel TaskRuns using entrypoint binary #3238

Cancel TaskRuns using entrypoint binary #3238

imjasonh commented Sep 15, 2020

bobcatfish commented Sep 15, 2020

imjasonh commented Sep 15, 2020

tekton-robot commented Jan 3, 2021

vdemeester commented Jan 5, 2021

chengjoey commented Jan 29, 2023

vdemeester commented Feb 3, 2023

chengjoey commented Apr 7, 2023

paulweb515 commented Oct 21, 2024

AlanGreene commented Oct 21, 2024

Cancel TaskRuns using entrypoint binary #3238

Cancel TaskRuns using entrypoint binary #3238

Comments

imjasonh commented Sep 15, 2020

Today

Feature Request

bobcatfish commented Sep 15, 2020

imjasonh commented Sep 15, 2020

tekton-robot commented Jan 3, 2021

vdemeester commented Jan 5, 2021

chengjoey commented Jan 29, 2023

vdemeester commented Feb 3, 2023

chengjoey commented Apr 7, 2023

paulweb515 commented Oct 21, 2024

AlanGreene commented Oct 21, 2024