task time exceeded `timeouts.tasks` when task retried #4071

ornew · 2021-07-02T09:03:22Z

Expected Behavior

If the timeouts.tasks time is exceeded, the task will not be retried.

finally should always be executed.

Actual Behavior

If the task is retried, the task time is exceeded timeouts.tasks.

In addition, if this causes the pipeline execution time to exceed timeouts.pipeline, finally is force timeouted.

Steps to Reproduce the Problem

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: please-say-bye
spec:
  tasks:
    - name: hi
      retries: 2
      taskSpec:
        steps:
          - name: hi
            image: alpine:3.12
            script: |
              echo 'hi'
              sleep 10
  finally:
    - name: bye
      taskSpec:
        steps:
          - name: bye
            image: alpine:3.12
            script: |
              echo 'bye'

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: please-say-bye-
spec:
  timeouts:
    pipeline: 10s
    tasks: 5s
    finally: 5s
  pipelineRef:
    name: please-say-bye

❯ tkn -n pipelines tr ls --label tekton.dev/pipeline=please-say-bye

NAME                             STARTED          DURATION    STATUS
please-say-bye-tzjdt-bye-pft4x   25 seconds ago   1 second    Failed(TaskRunTimeout)
please-say-bye-tzjdt-hi-qqlgs    30 seconds ago   5 seconds   Failed(TaskRunTimeout)

❯ tkn -n pipelines pr desc please-say-bye-tzjdt
Name:              please-say-bye-tzjdt
Namespace:         pipelines
Pipeline Ref:      please-say-bye
Service Account:   default
Labels:
 tekton.dev/pipeline=please-say-bye

🌡️  Status

STARTED        DURATION     STATUS
1 minute ago   14 seconds   Failed

💌 Message

Tasks Completed: 2 (Failed: 2, Cancelled 0), Skipped: 0 (TaskRun "please-say-bye-tzjdt-bye-pft4x" failed to finish within "1s")

📦 Resources

 No resources

⚓ Params

 No params

📝 Results

 No results

📂 Workspaces

 No workspaces

🗂  Taskruns

 NAME                               TASK NAME   STARTED        DURATION    STATUS
 ∙ please-say-bye-tzjdt-bye-pft4x   bye         1 minute ago   1 second    Failed(TaskRunTimeout)
 ∙ please-say-bye-tzjdt-hi-qqlgs    hi          1 minute ago   5 seconds   Failed(TaskRunTimeout)

⏭️  Skipped Tasks

 No Skipped Tasks

Additional Info

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.4-eks-6b7464", GitCommit:"6b746440c04cb81db4426842b4ae65c3f7035e53", GitTreeState:"clean", BuildDate:"2021-03-19T19:33:03Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

Client version: 0.19.0
Pipeline version: v0.25.0
Triggers version: v0.14.0
Dashboard version: v0.17.0

The text was updated successfully, but these errors were encountered:

vdemeester · 2021-07-02T09:47:29Z

@ornew thanks for the issue.
cc @souleb @pritidesai @sbwsg

souleb · 2021-07-02T11:28:54Z

Hi @ornew thanks for the issue. The task timeout is processed at runtime from the pipelinerun start time. If you retry only a task, taking into consideration the pipelinerun start time, your task is already timedout. Today the coded logic is to give 1s to the task to run. I am actually not sure why we give that second...

ornew · 2021-07-09T15:00:55Z

@souleb Thanks for checking this issue :)

When TaskRun retries, the Timeout is not recalculated but the StartTime is reseted. It looks to me the retried task can pass through HasTimedOut.

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Line 730 in 0e9d9e6

    
           return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{})

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Line 749 in 0e9d9e6

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Line 896 in 0e9d9e6

tr.Status.StartTime = nil

pipeline/pkg/apis/pipeline/v1alpha1/taskrun_types.go

Line 231 in 0e9d9e6

runtime := time.Since(tr.Status.StartTime.Time)

One more... the following output is the status of the third attempt due to a timeout. The first and second attempts are in the same pod. Also, the second attempt ended with 255 instead of a timeout. It did not run until the timeout.

completionTime: '2021-07-09T12:54:11Z'
conditions:
  - lastTransitionTime: '2021-07-09T12:54:11Z'
    message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
    reason: TaskRunTimeout
    status: 'False'
    type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-2tc4n
retriesStatus:
  - completionTime: '2021-07-09T12:54:00Z'
    conditions:
      - lastTransitionTime: '2021-07-09T12:54:00Z'
        message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
        reason: TaskRunTimeout
        status: 'False'
        type: Succeeded
    podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
    startTime: '2021-07-09T12:53:50Z'
    steps:
      - container: step-hi
        imageID: >-
          docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
        name: hi
        terminated:
          exitCode: 1
          finishedAt: '2021-07-09T12:54:00Z'
          reason: TaskRunTimeout
          startedAt: '2021-07-09T12:53:53Z'
    taskSpec:
      steps:
        - image: 'alpine:3.12'
          name: hi
          resources: {}
          script: |
            echo 'hi'
            #exit 1
            sleep 30
  - completionTime: '2021-07-09T12:54:01Z'
    conditions:
      - lastTransitionTime: '2021-07-09T12:54:01Z'
        message: >
          "step-hi" exited with code 255 (image:
          "docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e");
          for logs run: kubectl -n default logs
          please-say-bye-62xd2-hi-wwzpm-pod-8s5nz -c step-hi
        reason: Failed
        status: 'False'
        type: Succeeded
    podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
    startTime: '2021-07-09T12:54:00Z'
    steps:
      - container: step-hi
        imageID: >-
          docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
        name: hi
        terminated:
          containerID: >-
            containerd://f86b959de7ee03bbf4e2b9df638d5d937981a863d4f124a74dcb3580535fcef9
          exitCode: 255
          finishedAt: '2021-07-09T12:54:00Z'
          reason: Error
          startedAt: '2021-07-09T12:53:54Z'
    taskSpec:
      steps:
        - image: 'alpine:3.12'
          name: hi
          resources: {}
          script: |
            echo 'hi'
            #exit 1
            sleep 30
startTime: '2021-07-09T12:54:01Z'
steps:
  - container: step-hi
    imageID: >-
      docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
    name: hi
    terminated:
      exitCode: 1
      finishedAt: '2021-07-09T12:54:11Z'
      reason: TaskRunTimeout
      startedAt: '2021-07-09T12:54:04Z'
taskSpec:
  steps:
    - image: 'alpine:3.12'
      name: hi
      resources: {}
      script: |
        echo 'hi'
        #exit 1
        sleep 30

some dump

2021-07-09T14:26:36.180Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=0 has_started=true is_done=true is_cancelled=false has_timed_out=true       {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "2e4788be-323f-4e61-8b52-5f47433e13a2", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.190Z        debug   tekton-pipelines-controller     pipelinerun/pipelinerun.go:731  MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 new_timeout=&{10000000000}       {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler", "knative.dev/kind": "tekton.dev.PipelineRun", "knative.dev/traceid": "6f6a60c3-b5dc-4020-97ef-8dfadce1eccb", "knative.dev/key": "default/please-say-bye-s666f"}
2021-07-09T14:26:36.221Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 has_started=false is_done=false is_cancelled=false has_timed_out=false   {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.239Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:420  MARK_DEBUG DidTaskRunFail name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk phase=Running    {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.345Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=1 has_started=true is_done=false is_cancelled=false has_timed_out=false     {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "c820bcf9-5bae-4526-bffe-8f0d41289baf", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}

This is my speculation (I'm not sure about the details)

Reset TaskRun status when retry
TaskRun reconcile occurs
Pass throgh HasTimedOut because there is no status
Get the pod faster than the pod status changed due to a delay of timeout, and reuse it (what is delay???)
However, the pod will exit with 255 soon

pipeline/pkg/reconciler/taskrun/taskrun.go

Line 417 in 0e9d9e6

pod = &po

souleb · 2021-07-16T15:56:57Z

What you describe is basically what happens.

Also, you are right that the Timeout is not recalculated. But the time that is allocated to your taskrun is. And this allocated time calculation take into account several parameters, including the pipelinerun and taskrun Timeout and StartTime.

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Line 969 in 0e9d9e6

    
           func getFinallyTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rprt *resources.ResolvedPipelineRunTask) *metav1.Duration {

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Line 1018 in 0e9d9e6

    
           func getTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rprt *resources.ResolvedPipelineRunTask) *metav1.Duration {

It takes the declared taskrun Timeout, add it to the StartTime, and check if we are behind schedule. In case we are behind schedule, it still allocate 1s to run.

Does it make more sense?

ornew · 2021-08-07T15:53:58Z

Yes, I saw what was happening. This is because it is not the expected behavior. I checked whether it was intended.

The problem for me is that Tekton still doesn't guarantee the execution of finally at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing timeout, and I expected the implementation of the new timeouts to be reliable. But it still doesn't work.

I think it's counterintuitive that the task exceeds the time given and that finally runs for only one second.
Could you rethink prioritizing finally task execution over task (and retried it)? I think it's worth it.

souleb · 2021-08-07T17:47:31Z

If you look at the related tep TEP-46, you will see what you talk about in the alternative section. There have been a lot of discussion around whether the execution of finally tasks must be guaranteed. If I remember well I think we agreed that the new timeouts would give enough flexibility to everyone to ensure that their tasks runs as wanted. Would you mind sharing the production uses cases where that is not the case? also maybe @pritidesai you may want to add something here.

ornew · 2021-08-08T14:32:10Z

The real problem for us is that finally execution is not guaranteed.

But I'm not saying that the timeout should guarantee the execution of finally. I would like to know out whether the reason why finally is not running is motivated by a proper and intentional reason.

The TEP-46 said:

The users have an ability to specify the timeout for the tasks section.

Actually, it run tasks beyond timeouts.tasks when retries occur. Does this behavior match the TEP description and intuition, really?

I'm sorry I haven't been able to follow all the discussions. I was checked TEPs in advance as much as possible, I cannot find that why it allows that the tasks execution to exceed the timeouts.tasks. If there is good reason for it to be allowed, it should be documented in the TEP.

For example, our use cases corresponds to the dogfooding scenario shown in the TEP.

Cleanup cluster resources after finishing (with success/failure) integration tests (Dogfooding Scenario)

A simplification of the pipeline that actually caused the problem would look like this:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: our-use-case-example-
spec:
  timeouts:
    # We are explicitly setting timeouts
    pipeline: 5h
    tasks: 4h
    finally: 1h
  pipelineSpec:
    tasks:
      - name: create-resources
        retries: 2
        taskSpec:
          steps:
            - name: create-resources
              script: |
                # Allocate a lot of cluster resources.
                # For examples:
                # - run AWS EMR clusters for processing big data
                # - run many pod for load testing before deployment
      - name: heavy-workload
        retries: 2
        taskSpec:
          steps:
            - name: workload
              script: |
                # This may be retried due to an external system...
    finally:
      - name: cleanup-all-resources
        taskSpec:
          steps:
            - name: cleanup
              script: |
                # Cleanup all resources. We expect this to be run.
                # We give the task 4 hours and the pipeline 5 hours. There is an hour grace period.
                # However, if a retry occurs, this is now almost always given only 1 second.
                # By not recalculating the task timeout, the overall task execution time can exceed 4 hours.
                # In this case, the task will run for up to 5 hours. (why? We have specified timeouts.tasks as 4 hours.)
                # It does not behave intuitively with respect to the given definition.
                # As a result, some resources will not be released.

I've seen why the task runs for more than 4 hours and explained the actual problem caused by not run the finally.

Of course, we are considering the possibility that finally will not be executed. In reality, we perform external monitoring that releases resources, but these resources and monitoring are generally very expensive. It costs enough to expect the timeout to complete with the specified value regardless of the retry.

tekton-robot · 2021-11-06T14:38:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

lbernick · 2021-11-08T22:18:02Z

/assign lbernick
/assign jerop

jerop · 2021-11-08T22:18:37Z

/remove-lifecycle stale

@lbernick and I are looking into it

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for their allocated timeout. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. As described by @ornew in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This issue is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful so we can record the `StartTime` of each retry in the `RetriesStatus`. However, using it as the reset start time of the `TaskRun` to check whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, but then it'll be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This prblem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <lbernick@google.com>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <leebernick@google.com>

lbernick · 2022-04-07T18:12:04Z

Just want to update with a summary of the work already done and that needs to be done:

we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.
We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.
We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.
Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in Account for finally TaskRun retries in PR timeouts #4508).

Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.

/unassign
/unassign @jerop
/help
/lifecycle frozen

tekton-robot · 2022-04-07T18:12:05Z

@lbernick:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Just want to update with a summary of the work already done and that needs to be done:

we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.

We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.

We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.

Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in Account for finally TaskRun retries in PR timeouts #4508).

Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.

/unassign
/unassign @jerop
/help
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vsinghai · 2022-05-25T17:13:53Z

/assign @vsinghai

lbernick · 2022-09-08T19:45:04Z

I'm returning to this issue to see if it has been resolved by #5134 (FYI @abayer).

@ornew, I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline is exceeded. I think not running the finally task is intended behavior, as timeouts.pipeline should refer to the entire time the pipeline is running. I tried using your example posted in the original comment, and it results in the pipelinerun being timed out after 10 seconds and finally tasks not being run, which I believe is the correct behavior, as the pipelinerun should stop running after 10s.

If you'd like to allow the finally tasks to run indefinitely but have the tasks section time out after some time, I think you need to specify timeouts.tasks = 5s and timeouts.pipeline = 0 (no timeout). (Unfortunately this doesn't work-- filed #5459 but should be easily fixable.

There's still a bug, though, with the retried taskrun not being timed out when timeouts.tasks is reached. Here's a reproducer:

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: please-say-bye-again
spec:
  tasks:
    - name: hi
      retries: 2
      taskSpec:
        steps:
          - name: hi
            image: alpine:3.12
            script: |
              echo 'hi'
              sleep 10
              exit 1
  finally:
    - name: bye
      taskSpec:
        steps:
          - name: bye
            image: alpine:3.12
            script: |
              echo 'bye'
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: please-say-bye-again-
spec:
  timeouts:
    pipeline: 1m
    tasks: 19s
  pipelineRef:
    name: please-say-bye-again

In this example, each attempt (there are 3) sleeps for 10s and fails, the finally task is run, and the pipelinerun fails. I would expect that the first attempt fails and retries, the taskrun is canceled before the second attempt completes, the finally task runs, and the pipelinerun fails.

ornew · 2022-09-14T14:50:08Z

@lbernick

I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline is exceeded.

I agree you that finally task should be timeout if timeouts.pipeline is exceeded.

This issue reported originally that timeouts.tasks does not work as expected by the user. Tasks are retried even though timeouts.tasks has been exceeded until timeouts.pipeline is exceeded.

When I said "finally should not timeout" here, I meant the user's expected behavior in the context of "as a result of timeouts.pipeline being consumed due to incorrect execution of timeouts.tasks".

lbernick · 2022-09-14T20:50:38Z

got it, that makes sense, thanks!

lbernick · 2023-01-18T16:16:30Z

I think this should have been fixed in #5807: #5807 (comment)

ornew added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2021

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2021

tekton-robot assigned jerop and lbernick Nov 8, 2021

tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2021

jerop added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 9, 2021

jerop mentioned this issue Nov 9, 2021

Stop executing retried Tasks past their timeouts #4368

Closed

5 tasks

jerop added this to the Pipelines v0.30 milestone Nov 9, 2021

jerop modified the milestones: Pipelines v0.30, Pipelines v0.31 Nov 16, 2021

jerop mentioned this issue Dec 9, 2021

Finally task not executed when tasks define "retries" #4409

Closed

jerop modified the milestones: Pipelines v0.31, Pipelines v0.32 Dec 14, 2021

lbernick mentioned this issue Dec 23, 2021

Use pr.Timeouts.Pipeline in Custom Task reconcile #4440

Merged

5 tasks

lbernick moved this to In Progress in Pipelines V1 Jan 11, 2022

jerop modified the milestones: Pipelines v0.32, Pipelines v0.33 Jan 13, 2022

This was referenced Jan 24, 2022

Account for PipelineRun elapsed time for timeouts #4506

Merged

Account for finally TaskRun retries in PR timeouts #4508

Closed

afrittoli modified the milestones: Pipelines v0.33, Pipelines v0.34 Feb 8, 2022

pritidesai modified the milestones: Pipelines v0.34, Pipelines v0.35 Mar 22, 2022

tekton-robot unassigned jerop and lbernick Apr 7, 2022

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Apr 7, 2022

lbernick mentioned this issue Apr 7, 2022

PipelineRun Timeouts #4460

Closed

pritidesai removed this from the Pipelines v0.35 milestone May 3, 2022

tekton-robot assigned vsinghai May 25, 2022

lbernick unassigned vsinghai Jul 8, 2022

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

XinruZhang mentioned this issue Nov 29, 2022

TEP-0121: Implement Retries in TaskRun #5807

Closed

7 tasks

lbernick closed this as completed Jan 18, 2023

github-project-automation bot moved this from Todo to Done in Tekton Community Roadmap Jan 18, 2023

github-project-automation bot moved this from In Progress to Done in Pipelines V1 Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task time exceeded `timeouts.tasks` when task retried #4071

task time exceeded `timeouts.tasks` when task retried #4071

ornew commented Jul 2, 2021

vdemeester commented Jul 2, 2021

souleb commented Jul 2, 2021 •

edited

Loading

ornew commented Jul 9, 2021

souleb commented Jul 16, 2021 •

edited

Loading

ornew commented Aug 7, 2021

souleb commented Aug 7, 2021

ornew commented Aug 8, 2021

tekton-robot commented Nov 6, 2021

lbernick commented Nov 8, 2021

jerop commented Nov 8, 2021

lbernick commented Apr 7, 2022

tekton-robot commented Apr 7, 2022

vsinghai commented May 25, 2022

lbernick commented Sep 8, 2022

ornew commented Sep 14, 2022 •

edited

Loading

lbernick commented Sep 14, 2022

lbernick commented Jan 18, 2023

task time exceeded timeouts.tasks when task retried #4071

task time exceeded timeouts.tasks when task retried #4071

Comments

ornew commented Jul 2, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

vdemeester commented Jul 2, 2021

souleb commented Jul 2, 2021 • edited Loading

ornew commented Jul 9, 2021

souleb commented Jul 16, 2021 • edited Loading

ornew commented Aug 7, 2021

souleb commented Aug 7, 2021

ornew commented Aug 8, 2021

tekton-robot commented Nov 6, 2021

lbernick commented Nov 8, 2021

jerop commented Nov 8, 2021

lbernick commented Apr 7, 2022

tekton-robot commented Apr 7, 2022

vsinghai commented May 25, 2022

lbernick commented Sep 8, 2022

ornew commented Sep 14, 2022 • edited Loading

lbernick commented Sep 14, 2022

lbernick commented Jan 18, 2023

task time exceeded `timeouts.tasks` when task retried #4071

task time exceeded `timeouts.tasks` when task retried #4071

souleb commented Jul 2, 2021 •

edited

Loading

souleb commented Jul 16, 2021 •

edited

Loading

ornew commented Sep 14, 2022 •

edited

Loading