Enforce TaskRun timeout using entrypoint binary #2559

imjasonh · 2020-05-06T12:33:57Z

Problem

Today, TaskRun timeouts are enforced by the Tekton controller scheduling a goroutine to check in the future that a TaskRun has completed -- if it hasn't, it stops execution by deleting the TaskRun's underlying Pod, and updates the TaskRun's status to indicate timeout.

This leads to Pod logs being GCed more quickly than in the case of successful or failed TaskRuns, where the Pod stays around on the cluster, potentially making debugging timeouts harder than necessary.

Proposal

Instead, we could use the entrypoint binary we inject to order steps, and have the entrypoint binary also take a flag describing the time after which execution should fail with timeout. This would be passed in by the controller, and the entrypoint binary would execute the underlying command, with a context that times out after that time (time.Until(deadline)) and reports timeout.

Considerations

This moves timeout enforcement into each executing Pod, so timeouts wouldn't be enforced until the Pod actually starts executing -- if a TaskRun's Pod is never scheduled, then its timeout wouldn't be enforced, unlike today. We could keep controller-side timeout enforcement for scheduling timeouts, maybe?

Related to #1690 which proposes using the entrypoint binary to enforce per-step timeouts.

The text was updated successfully, but these errors were encountered:

vdemeester · 2020-05-06T17:30:24Z

/kind feature

tekton-robot · 2020-08-14T06:38:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-14T06:38:15Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-14T06:38:16Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-14T06:38:18Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bobcatfish · 2020-08-14T15:28:39Z

/reopen
/remove-lifecycle rotten

tekton-robot · 2020-08-14T15:28:40Z

@bobcatfish: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2020-09-13T15:32:21Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

imjasonh · 2020-09-13T15:51:16Z

/reopen
/remove-lifecycle-rotten

I think this is still a worthwhile change, and might even make a good-first-issue.

I think it would still be worth enforcing timeouts in the reconciler, to catch cases where a pod sat unschedulable until its timeout elapsed, and we could just not create the pod in that case.

imjasonh · 2020-09-14T13:50:16Z

/remove-lifecycle rotten

ywluogg · 2020-09-15T03:09:10Z

I'm interested in taking this if no one is working on it.

imjasonh · 2020-09-15T12:59:06Z

I'm interested in taking this if no one is working on it.

Sounds good! There's going to be some overlap with @Peaorl 's work in #3087, which you should just be aware of, but otherwise this seems fairly straightforward.

One thing we should make sure to do is guard this behavior change with a feature flag defaulting to false, in case the change in behavior takes anybody by surprise. In general I don't expect it to cause any problems though.

imjasonh · 2020-09-16T14:09:25Z

/assign @ywluogg

tekton-robot · 2020-12-15T14:40:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-01-14T15:17:11Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-07T09:40:43Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-07T09:40:45Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 6, 2020

tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020

tekton-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 14, 2020

tekton-robot closed this as completed Aug 14, 2020

bobcatfish mentioned this issue Aug 14, 2020

Timed Out or Cancelled TaskRun Pods are Deleted #3051

Closed

tekton-robot reopened this Aug 14, 2020

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2020

imjasonh added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Sep 13, 2020

tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 14, 2020

imjasonh mentioned this issue Sep 15, 2020

Cancel TaskRuns using entrypoint binary #3238

Open

tekton-robot assigned ywluogg Sep 16, 2020

ywluogg mentioned this issue Sep 21, 2020

Proposal: step timeout #1690

Closed

ywluogg mentioned this issue Sep 30, 2020

Add enforcement of TaskRun's timeout to step level #3304

Closed

4 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2020

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 14, 2021

tekton-robot closed this as completed May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce TaskRun timeout using entrypoint binary #2559

Enforce TaskRun timeout using entrypoint binary #2559

imjasonh commented May 6, 2020

vdemeester commented May 6, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

bobcatfish commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Sep 13, 2020

imjasonh commented Sep 13, 2020

imjasonh commented Sep 14, 2020

ywluogg commented Sep 15, 2020

imjasonh commented Sep 15, 2020

imjasonh commented Sep 16, 2020

tekton-robot commented Dec 15, 2020

tekton-robot commented Jan 14, 2021

tekton-robot commented May 7, 2021

tekton-robot commented May 7, 2021

Enforce TaskRun timeout using entrypoint binary #2559

Enforce TaskRun timeout using entrypoint binary #2559

Comments

imjasonh commented May 6, 2020

Problem

Proposal

Considerations

vdemeester commented May 6, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

bobcatfish commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Sep 13, 2020

imjasonh commented Sep 13, 2020

imjasonh commented Sep 14, 2020

ywluogg commented Sep 15, 2020

imjasonh commented Sep 15, 2020

imjasonh commented Sep 16, 2020

tekton-robot commented Dec 15, 2020

tekton-robot commented Jan 14, 2021

tekton-robot commented May 7, 2021

tekton-robot commented May 7, 2021