-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task time exceeded timeouts.tasks
when task retried
#4071
Comments
@ornew thanks for the issue. |
Hi @ornew thanks for the issue. The task timeout is processed at runtime from the pipelinerun start time. If you retry only a task, taking into consideration the pipelinerun start time, your task is already timedout. Today the coded logic is to give 1s to the task to run. I am actually not sure why we give that second... |
@souleb Thanks for checking this issue :) When TaskRun retries, the Timeout is not recalculated but the StartTime is reseted. It looks to me the retried task can pass through HasTimedOut.
One more... the following output is the status of the third attempt due to a timeout. The first and second attempts are in the same pod. Also, the second attempt ended with 255 instead of a timeout. It did not run until the timeout. completionTime: '2021-07-09T12:54:11Z'
conditions:
- lastTransitionTime: '2021-07-09T12:54:11Z'
message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
reason: TaskRunTimeout
status: 'False'
type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-2tc4n
retriesStatus:
- completionTime: '2021-07-09T12:54:00Z'
conditions:
- lastTransitionTime: '2021-07-09T12:54:00Z'
message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
reason: TaskRunTimeout
status: 'False'
type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
startTime: '2021-07-09T12:53:50Z'
steps:
- container: step-hi
imageID: >-
docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
name: hi
terminated:
exitCode: 1
finishedAt: '2021-07-09T12:54:00Z'
reason: TaskRunTimeout
startedAt: '2021-07-09T12:53:53Z'
taskSpec:
steps:
- image: 'alpine:3.12'
name: hi
resources: {}
script: |
echo 'hi'
#exit 1
sleep 30
- completionTime: '2021-07-09T12:54:01Z'
conditions:
- lastTransitionTime: '2021-07-09T12:54:01Z'
message: >
"step-hi" exited with code 255 (image:
"docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e");
for logs run: kubectl -n default logs
please-say-bye-62xd2-hi-wwzpm-pod-8s5nz -c step-hi
reason: Failed
status: 'False'
type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
startTime: '2021-07-09T12:54:00Z'
steps:
- container: step-hi
imageID: >-
docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
name: hi
terminated:
containerID: >-
containerd://f86b959de7ee03bbf4e2b9df638d5d937981a863d4f124a74dcb3580535fcef9
exitCode: 255
finishedAt: '2021-07-09T12:54:00Z'
reason: Error
startedAt: '2021-07-09T12:53:54Z'
taskSpec:
steps:
- image: 'alpine:3.12'
name: hi
resources: {}
script: |
echo 'hi'
#exit 1
sleep 30
startTime: '2021-07-09T12:54:01Z'
steps:
- container: step-hi
imageID: >-
docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
name: hi
terminated:
exitCode: 1
finishedAt: '2021-07-09T12:54:11Z'
reason: TaskRunTimeout
startedAt: '2021-07-09T12:54:04Z'
taskSpec:
steps:
- image: 'alpine:3.12'
name: hi
resources: {}
script: |
echo 'hi'
#exit 1
sleep 30 some dump
This is my speculation (I'm not sure about the details)
pipeline/pkg/reconciler/taskrun/taskrun.go Line 417 in 0e9d9e6
|
What you describe is basically what happens. Also, you are right that the
It takes the declared taskrun Does it make more sense? |
Yes, I saw what was happening. This is because it is not the expected behavior. I checked whether it was intended. The problem for me is that Tekton still doesn't guarantee the execution of I think it's counterintuitive that the task exceeds the time given and that finally runs for only one second. |
If you look at the related tep TEP-46, you will see what you talk about in the alternative section. There have been a lot of discussion around whether the execution of |
The real problem for us is that finally execution is not guaranteed. But I'm not saying that the timeout should guarantee the execution of finally. I would like to know out whether the reason why finally is not running is motivated by a proper and intentional reason. The TEP-46 said:
Actually, it run I'm sorry I haven't been able to follow all the discussions. I was checked TEPs in advance as much as possible, I cannot find that why it allows that the For example, our use cases corresponds to the dogfooding scenario shown in the TEP.
A simplification of the pipeline that actually caused the problem would look like this: apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: our-use-case-example-
spec:
timeouts:
# We are explicitly setting timeouts
pipeline: 5h
tasks: 4h
finally: 1h
pipelineSpec:
tasks:
- name: create-resources
retries: 2
taskSpec:
steps:
- name: create-resources
script: |
# Allocate a lot of cluster resources.
# For examples:
# - run AWS EMR clusters for processing big data
# - run many pod for load testing before deployment
- name: heavy-workload
retries: 2
taskSpec:
steps:
- name: workload
script: |
# This may be retried due to an external system...
finally:
- name: cleanup-all-resources
taskSpec:
steps:
- name: cleanup
script: |
# Cleanup all resources. We expect this to be run.
# We give the task 4 hours and the pipeline 5 hours. There is an hour grace period.
# However, if a retry occurs, this is now almost always given only 1 second.
# By not recalculating the task timeout, the overall task execution time can exceed 4 hours.
# In this case, the task will run for up to 5 hours. (why? We have specified timeouts.tasks as 4 hours.)
# It does not behave intuitively with respect to the given definition.
# As a result, some resources will not be released. I've seen why the task runs for more than 4 hours and explained the actual problem caused by not run the Of course, we are considering the possibility that |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
/assign lbernick |
/remove-lifecycle stale @lbernick and I are looking into it |
Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for their allocated timeout. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. As described by @ornew in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This issue is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful so we can record the `StartTime` of each retry in the `RetriesStatus`. However, using it as the reset start time of the `TaskRun` to check whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, but then it'll be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>
Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This prblem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>
Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>
Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>
Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>
Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <leebernick@google.com>
Just want to update with a summary of the work already done and that needs to be done:
Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now. /unassign |
@lbernick: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @vsinghai |
I'm returning to this issue to see if it has been resolved by #5134 (FYI @abayer). @ornew, I'm curious why you say in your original comment that the finally task should not be timed out if If you'd like to allow the finally tasks to run indefinitely but have the tasks section time out after some time, I think you need to specify There's still a bug, though, with the retried taskrun not being timed out when
In this example, each attempt (there are 3) sleeps for 10s and fails, the finally task is run, and the pipelinerun fails. I would expect that the first attempt fails and retries, the taskrun is canceled before the second attempt completes, the finally task runs, and the pipelinerun fails. |
I agree you that finally task should be timeout if timeouts.pipeline is exceeded. This issue reported originally that timeouts.tasks does not work as expected by the user. Tasks are retried even though timeouts.tasks has been exceeded until timeouts.pipeline is exceeded. When I said "finally should not timeout" here, I meant the user's expected behavior in the context of "as a result of timeouts.pipeline being consumed due to incorrect execution of timeouts.tasks". |
got it, that makes sense, thanks! |
I think this should have been fixed in #5807: #5807 (comment) |
Expected Behavior
If the
timeouts.tasks
time is exceeded, the task will not be retried.finally should always be executed.
Actual Behavior
If the task is retried, the task time is exceeded timeouts.tasks.
In addition, if this causes the pipeline execution time to exceed
timeouts.pipeline
, finally is force timeouted.Steps to Reproduce the Problem
Additional Info
The text was updated successfully, but these errors were encountered: