-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce TaskRun timeout using entrypoint binary #2559
Comments
/kind feature |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@bobcatfish: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
/reopen I think this is still a worthwhile change, and might even make a good-first-issue. I think it would still be worth enforcing timeouts in the reconciler, to catch cases where a pod sat unschedulable until its timeout elapsed, and we could just not create the pod in that case. |
/remove-lifecycle rotten |
I'm interested in taking this if no one is working on it. |
Sounds good! There's going to be some overlap with @Peaorl 's work in #3087, which you should just be aware of, but otherwise this seems fairly straightforward. One thing we should make sure to do is guard this behavior change with a feature flag defaulting to false, in case the change in behavior takes anybody by surprise. In general I don't expect it to cause any problems though. |
/assign @ywluogg |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Problem
Today, TaskRun timeouts are enforced by the Tekton controller scheduling a goroutine to check in the future that a TaskRun has completed -- if it hasn't, it stops execution by deleting the TaskRun's underlying Pod, and updates the TaskRun's status to indicate timeout.
This leads to Pod logs being GCed more quickly than in the case of successful or failed TaskRuns, where the Pod stays around on the cluster, potentially making debugging timeouts harder than necessary.
Proposal
Instead, we could use the entrypoint binary we inject to order steps, and have the entrypoint binary also take a flag describing the time after which execution should fail with timeout. This would be passed in by the controller, and the entrypoint binary would execute the underlying command, with a context that times out after that time (
time.Until(deadline)
) and reports timeout.Considerations
This moves timeout enforcement into each executing Pod, so timeouts wouldn't be enforced until the Pod actually starts executing -- if a TaskRun's Pod is never scheduled, then its timeout wouldn't be enforced, unlike today. We could keep controller-side timeout enforcement for scheduling timeouts, maybe?
Related to #1690 which proposes using the entrypoint binary to enforce per-step timeouts.
The text was updated successfully, but these errors were encountered: