Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce TaskRun timeout using entrypoint binary #2559

Closed
imjasonh opened this issue May 6, 2020 · 17 comments
Closed

Enforce TaskRun timeout using entrypoint binary #2559

imjasonh opened this issue May 6, 2020 · 17 comments
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@imjasonh
Copy link
Member

imjasonh commented May 6, 2020

Problem

Today, TaskRun timeouts are enforced by the Tekton controller scheduling a goroutine to check in the future that a TaskRun has completed -- if it hasn't, it stops execution by deleting the TaskRun's underlying Pod, and updates the TaskRun's status to indicate timeout.

This leads to Pod logs being GCed more quickly than in the case of successful or failed TaskRuns, where the Pod stays around on the cluster, potentially making debugging timeouts harder than necessary.

Proposal

Instead, we could use the entrypoint binary we inject to order steps, and have the entrypoint binary also take a flag describing the time after which execution should fail with timeout. This would be passed in by the controller, and the entrypoint binary would execute the underlying command, with a context that times out after that time (time.Until(deadline)) and reports timeout.

Considerations

This moves timeout enforcement into each executing Pod, so timeouts wouldn't be enforced until the Pod actually starts executing -- if a TaskRun's Pod is never scheduled, then its timeout wouldn't be enforced, unlike today. We could keep controller-side timeout enforcement for scheduling timeouts, maybe?

Related to #1690 which proposes using the entrypoint binary to enforce per-step timeouts.

@vdemeester
Copy link
Member

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 6, 2020
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 14, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bobcatfish
Copy link
Collaborator

/reopen
/remove-lifecycle rotten

@tekton-robot
Copy link
Collaborator

@bobcatfish: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot reopened this Aug 14, 2020
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2020
@imjasonh
Copy link
Member Author

/reopen
/remove-lifecycle-rotten

I think this is still a worthwhile change, and might even make a good-first-issue.

I think it would still be worth enforcing timeouts in the reconciler, to catch cases where a pod sat unschedulable until its timeout elapsed, and we could just not create the pod in that case.

@imjasonh imjasonh added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Sep 13, 2020
@imjasonh
Copy link
Member Author

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 14, 2020
@ywluogg
Copy link
Contributor

ywluogg commented Sep 15, 2020

I'm interested in taking this if no one is working on it.

@imjasonh
Copy link
Member Author

I'm interested in taking this if no one is working on it.

Sounds good! There's going to be some overlap with @Peaorl 's work in #3087, which you should just be aware of, but otherwise this seems fairly straightforward.

One thing we should make sure to do is guard this behavior change with a feature flag defaulting to false, in case the change in behavior takes anybody by surprise. In general I don't expect it to cause any problems though.

@imjasonh
Copy link
Member Author

/assign @ywluogg

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2020
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 14, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants