Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task time exceeded timeouts.tasks when task retried #4071

Closed
ornew opened this issue Jul 2, 2021 · 17 comments
Closed

task time exceeded timeouts.tasks when task retried #4071

ornew opened this issue Jul 2, 2021 · 17 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@ornew
Copy link

ornew commented Jul 2, 2021

Expected Behavior

If the timeouts.tasks time is exceeded, the task will not be retried.

finally should always be executed.

Actual Behavior

If the task is retried, the task time is exceeded timeouts.tasks.

In addition, if this causes the pipeline execution time to exceed timeouts.pipeline, finally is force timeouted.

Steps to Reproduce the Problem

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: please-say-bye
spec:
  tasks:
    - name: hi
      retries: 2
      taskSpec:
        steps:
          - name: hi
            image: alpine:3.12
            script: |
              echo 'hi'
              sleep 10
  finally:
    - name: bye
      taskSpec:
        steps:
          - name: bye
            image: alpine:3.12
            script: |
              echo 'bye'
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: please-say-bye-
spec:
  timeouts:
    pipeline: 10s
    tasks: 5s
    finally: 5s
  pipelineRef:
    name: please-say-bye
❯ tkn -n pipelines tr ls --label tekton.dev/pipeline=please-say-bye

NAME                             STARTED          DURATION    STATUS
please-say-bye-tzjdt-bye-pft4x   25 seconds ago   1 second    Failed(TaskRunTimeout)
please-say-bye-tzjdt-hi-qqlgs    30 seconds ago   5 seconds   Failed(TaskRunTimeout)

❯ tkn -n pipelines pr desc please-say-bye-tzjdt
Name:              please-say-bye-tzjdt
Namespace:         pipelines
Pipeline Ref:      please-say-bye
Service Account:   default
Labels:
 tekton.dev/pipeline=please-say-bye

🌡️  Status

STARTED        DURATION     STATUS
1 minute ago   14 seconds   Failed

💌 Message

Tasks Completed: 2 (Failed: 2, Cancelled 0), Skipped: 0 (TaskRun "please-say-bye-tzjdt-bye-pft4x" failed to finish within "1s")

📦 Resources

 No resources

⚓ Params

 No params

📝 Results

 No results

📂 Workspaces

 No workspaces

🗂  Taskruns

 NAME                               TASK NAME   STARTED        DURATION    STATUS
 ∙ please-say-bye-tzjdt-bye-pft4x   bye         1 minute ago   1 second    Failed(TaskRunTimeout)
 ∙ please-say-bye-tzjdt-hi-qqlgs    hi          1 minute ago   5 seconds   Failed(TaskRunTimeout)

⏭️  Skipped Tasks

 No Skipped Tasks

スクリーンショット 2021-07-02 17 59 45

Additional Info

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.4-eks-6b7464", GitCommit:"6b746440c04cb81db4426842b4ae65c3f7035e53", GitTreeState:"clean", BuildDate:"2021-03-19T19:33:03Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
Client version: 0.19.0
Pipeline version: v0.25.0
Triggers version: v0.14.0
Dashboard version: v0.17.0
@ornew ornew added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2021
@vdemeester
Copy link
Member

@ornew thanks for the issue.
cc @souleb @pritidesai @sbwsg

@souleb
Copy link
Contributor

souleb commented Jul 2, 2021

Hi @ornew thanks for the issue. The task timeout is processed at runtime from the pipelinerun start time. If you retry only a task, taking into consideration the pipelinerun start time, your task is already timedout. Today the coded logic is to give 1s to the task to run. I am actually not sure why we give that second...

@ornew
Copy link
Author

ornew commented Jul 9, 2021

@souleb Thanks for checking this issue :)

When TaskRun retries, the Timeout is not recalculated but the StartTime is reseted. It looks to me the retried task can pass through HasTimedOut.

return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{})


tr.Status.StartTime = nil

runtime := time.Since(tr.Status.StartTime.Time)

One more... the following output is the status of the third attempt due to a timeout. The first and second attempts are in the same pod. Also, the second attempt ended with 255 instead of a timeout. It did not run until the timeout.

completionTime: '2021-07-09T12:54:11Z'
conditions:
  - lastTransitionTime: '2021-07-09T12:54:11Z'
    message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
    reason: TaskRunTimeout
    status: 'False'
    type: Succeeded
podName: please-say-bye-62xd2-hi-wwzpm-pod-2tc4n
retriesStatus:
  - completionTime: '2021-07-09T12:54:00Z'
    conditions:
      - lastTransitionTime: '2021-07-09T12:54:00Z'
        message: TaskRun "please-say-bye-62xd2-hi-wwzpm" failed to finish within "10s"
        reason: TaskRunTimeout
        status: 'False'
        type: Succeeded
    podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
    startTime: '2021-07-09T12:53:50Z'
    steps:
      - container: step-hi
        imageID: >-
          docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
        name: hi
        terminated:
          exitCode: 1
          finishedAt: '2021-07-09T12:54:00Z'
          reason: TaskRunTimeout
          startedAt: '2021-07-09T12:53:53Z'
    taskSpec:
      steps:
        - image: 'alpine:3.12'
          name: hi
          resources: {}
          script: |
            echo 'hi'
            #exit 1
            sleep 30
  - completionTime: '2021-07-09T12:54:01Z'
    conditions:
      - lastTransitionTime: '2021-07-09T12:54:01Z'
        message: >
          "step-hi" exited with code 255 (image:
          "docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e");
          for logs run: kubectl -n default logs
          please-say-bye-62xd2-hi-wwzpm-pod-8s5nz -c step-hi
        reason: Failed
        status: 'False'
        type: Succeeded
    podName: please-say-bye-62xd2-hi-wwzpm-pod-8s5nz
    startTime: '2021-07-09T12:54:00Z'
    steps:
      - container: step-hi
        imageID: >-
          docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
        name: hi
        terminated:
          containerID: >-
            containerd://f86b959de7ee03bbf4e2b9df638d5d937981a863d4f124a74dcb3580535fcef9
          exitCode: 255
          finishedAt: '2021-07-09T12:54:00Z'
          reason: Error
          startedAt: '2021-07-09T12:53:54Z'
    taskSpec:
      steps:
        - image: 'alpine:3.12'
          name: hi
          resources: {}
          script: |
            echo 'hi'
            #exit 1
            sleep 30
startTime: '2021-07-09T12:54:01Z'
steps:
  - container: step-hi
    imageID: >-
      docker.io/library/alpine@sha256:87703314048c40236c6d674424159ee862e2b96ce1c37c62d877e21ed27a387e
    name: hi
    terminated:
      exitCode: 1
      finishedAt: '2021-07-09T12:54:11Z'
      reason: TaskRunTimeout
      startedAt: '2021-07-09T12:54:04Z'
taskSpec:
  steps:
    - image: 'alpine:3.12'
      name: hi
      resources: {}
      script: |
        echo 'hi'
        #exit 1
        sleep 30

some dump

2021-07-09T14:26:36.180Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=0 has_started=true is_done=true is_cancelled=false has_timed_out=true       {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "2e4788be-323f-4e61-8b52-5f47433e13a2", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.190Z        debug   tekton-pipelines-controller     pipelinerun/pipelinerun.go:731  MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 new_timeout=&{10000000000}       {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler", "knative.dev/kind": "tekton.dev.PipelineRun", "knative.dev/traceid": "6f6a60c3-b5dc-4020-97ef-8dfadce1eccb", "knative.dev/key": "default/please-say-bye-s666f"}
2021-07-09T14:26:36.221Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=, timeout=&Duration{Duration:10s,} retry=1 has_started=false is_done=false is_cancelled=false has_timed_out=false   {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.239Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:420  MARK_DEBUG DidTaskRunFail name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk phase=Running    {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "eea9dc61-2f72-4718-910a-d6a82fc1d6e5", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}
2021-07-09T14:26:36.345Z        debug   tekton-pipelines-controller     taskrun/taskrun.go:94   MARK_DEBUG pod_name=please-say-bye-s666f-hi-f5rg9-pod-7jgnk, timeout=&Duration{Duration:10s,} retry=1 has_started=true is_done=false is_cancelled=false has_timed_out=false     {"commit": "0e9d9e6", "knative.dev/controller": "gh.neting.cc.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler", "knative.dev/kind": "tekton.dev.TaskRun", "knative.dev/traceid": "c820bcf9-5bae-4526-bffe-8f0d41289baf", "knative.dev/key": "default/please-say-bye-s666f-hi-f5rg9"}

This is my speculation (I'm not sure about the details)

  • Reset TaskRun status when retry
  • TaskRun reconcile occurs
  • Pass throgh HasTimedOut because there is no status
  • Get the pod faster than the pod status changed due to a delay of timeout, and reuse it (what is delay???)
  • However, the pod will exit with 255 soon

@souleb
Copy link
Contributor

souleb commented Jul 16, 2021

What you describe is basically what happens.

Also, you are right that the Timeout is not recalculated. But the time that is allocated to your taskrun is. And this allocated time calculation take into account several parameters, including the pipelinerun and taskrun Timeout and StartTime.

func getFinallyTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rprt *resources.ResolvedPipelineRunTask) *metav1.Duration {

func getTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rprt *resources.ResolvedPipelineRunTask) *metav1.Duration {

It takes the declared taskrun Timeout, add it to the StartTime, and check if we are behind schedule. In case we are behind schedule, it still allocate 1s to run.

Does it make more sense?

@ornew
Copy link
Author

ornew commented Aug 7, 2021

Yes, I saw what was happening. This is because it is not the expected behavior. I checked whether it was intended.

The problem for me is that Tekton still doesn't guarantee the execution of finally at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing timeout, and I expected the implementation of the new timeouts to be reliable. But it still doesn't work.

I think it's counterintuitive that the task exceeds the time given and that finally runs for only one second.
Could you rethink prioritizing finally task execution over task (and retried it)? I think it's worth it.

@souleb
Copy link
Contributor

souleb commented Aug 7, 2021

If you look at the related tep TEP-46, you will see what you talk about in the alternative section. There have been a lot of discussion around whether the execution of finally tasks must be guaranteed. If I remember well I think we agreed that the new timeouts would give enough flexibility to everyone to ensure that their tasks runs as wanted. Would you mind sharing the production uses cases where that is not the case? also maybe @pritidesai you may want to add something here.

@ornew
Copy link
Author

ornew commented Aug 8, 2021

The real problem for us is that finally execution is not guaranteed.

But I'm not saying that the timeout should guarantee the execution of finally. I would like to know out whether the reason why finally is not running is motivated by a proper and intentional reason.

The TEP-46 said:

The users have an ability to specify the timeout for the tasks section.

Actually, it run tasks beyond timeouts.tasks when retries occur. Does this behavior match the TEP description and intuition, really?

I'm sorry I haven't been able to follow all the discussions. I was checked TEPs in advance as much as possible, I cannot find that why it allows that the tasks execution to exceed the timeouts.tasks. If there is good reason for it to be allowed, it should be documented in the TEP.

For example, our use cases corresponds to the dogfooding scenario shown in the TEP.

Cleanup cluster resources after finishing (with success/failure) integration tests (Dogfooding Scenario)

A simplification of the pipeline that actually caused the problem would look like this:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: our-use-case-example-
spec:
  timeouts:
    # We are explicitly setting timeouts
    pipeline: 5h
    tasks: 4h
    finally: 1h
  pipelineSpec:
    tasks:
      - name: create-resources
        retries: 2
        taskSpec:
          steps:
            - name: create-resources
              script: |
                # Allocate a lot of cluster resources.
                # For examples:
                # - run AWS EMR clusters for processing big data
                # - run many pod for load testing before deployment
      - name: heavy-workload
        retries: 2
        taskSpec:
          steps:
            - name: workload
              script: |
                # This may be retried due to an external system...
    finally:
      - name: cleanup-all-resources
        taskSpec:
          steps:
            - name: cleanup
              script: |
                # Cleanup all resources. We expect this to be run.
                # We give the task 4 hours and the pipeline 5 hours. There is an hour grace period.
                # However, if a retry occurs, this is now almost always given only 1 second.
                # By not recalculating the task timeout, the overall task execution time can exceed 4 hours.
                # In this case, the task will run for up to 5 hours. (why? We have specified timeouts.tasks as 4 hours.)
                # It does not behave intuitively with respect to the given definition.
                # As a result, some resources will not be released.

I've seen why the task runs for more than 4 hours and explained the actual problem caused by not run the finally.

Of course, we are considering the possibility that finally will not be executed. In reality, we perform external monitoring that releases resources, but these resources and monitoring are generally very expensive. It costs enough to expect the timeout to complete with the specified value regardless of the retry.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2021
@lbernick
Copy link
Member

lbernick commented Nov 8, 2021

/assign lbernick
/assign jerop

@jerop
Copy link
Member

jerop commented Nov 8, 2021

/remove-lifecycle stale

@lbernick and I are looking into it

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2021
jerop added a commit to jerop/pipeline that referenced this issue Nov 9, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for their allocated
timeout. As such, the execution of `Finally Tasks` is not guaranteed
even when a `finally timeout` has been allocated.

As described by @ornew in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This issue is caused by the the interaction between retries and timeouts.
When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful so we can record the `StartTime` of
each retry in the `RetriesStatus`. However, using it as the reset start
time of the `TaskRun` to check whether it has timedout is incorrect and
causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, but then
it'll be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
@jerop jerop added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 9, 2021
@jerop jerop added this to the Pipelines v0.30 milestone Nov 9, 2021
jerop added a commit to jerop/pipeline that referenced this issue Nov 9, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This prblem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <lbernick@google.com>
jerop added a commit to jerop/pipeline that referenced this issue Nov 10, 2021
Today, when a `Task` is retried, it executes past its timeout. Even
worse, the `Task` executing past its timeout causes the execution of
`Finally Tasks` to be timedout before executing for as long as their
allocated timeouts. As such, the execution of `Finally Tasks` is not
guaranteed even when a `finally timeout` has been allocated.

@ornew described the problem in the issue report:
"The problem for me is that Tekton still doesn't guarantee the execution
of `finally` at all. This is a really serious problem for operating in
production, as Tekton we operate has caused a lot of resource leaks and
a lot of damage by not running finally. It's the biggest problem with
existing `timeout`."

This problem is caused by the the interaction between retries and
timeouts. When a `TaskRun` is retried:
- The status of its failed execution is stored in `RetriesStatus`.
- The status of the `TaskRun` is updated, including marking it as
`Running` and resetting its `StartTime`.

Resetting the `StartTime` is useful in recording the `StartTime` of
each retry in the `RetriesStatus`. However, using the reset time as the
reset start time of the `TaskRun` when checking whether it has timedout
is incorrect and causes the issues described above.

In this change, we use the actual start time of the `TaskRun` to check
whether the `TaskRun` has timedout. We do this by checking the start
time of previous attempts as well, instead of the current attempt only.

Alternative approaches considered include:
- not resetting the start time of the `TaskRun` upon retry, however
it will be challenging to know the execution times of each retry.
- keeping track of the actual start time in an extra field, but this is
information that's already available in the status

References:
- [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md)
- [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071)
- [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989)

Co-authored-by: Lee Bernick <leebernick@google.com>
@lbernick
Copy link
Member

lbernick commented Apr 7, 2022

Just want to update with a summary of the work already done and that needs to be done:

  • we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.
  • We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.
  • We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.
  • Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in Account for finally TaskRun retries in PR timeouts #4508).

Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.

/unassign
/unassign @jerop
/help
/lifecycle frozen

@tekton-robot
Copy link
Collaborator

@lbernick:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Just want to update with a summary of the work already done and that needs to be done:

  • we've updated timeouts.tasks and timeouts.finally to account for time already spent during the pipelinerun.
  • We need to update the pipelinerun controller to set a new timeout in the taskrun spec on every attempt. (We currently just update the status, including the start time.) Leaving the TaskRun's start time alone when creating a new retry will not work, as the timeout must be recalculated each time by the pipelinerun controller based on how much time has passed during execution of the pipelinerun's tasks, and the TaskRun's timeout reflects the minimum of that number and the timeout set for the pipeline task itself.
  • We need to update the function that sets timeouts for TaskRuns contained in pipelines to correctly handle the situation where a pipelinerun has already timed out, but a new task is being created anyway. Currently, it will create a new taskrun with a timeout of 1s, but no new taskruns should be created, and an error should be returned. If this is not done, we will keep creating retries with a 1s timeout even after pipeline-level timeouts have passed.
  • Once this is complete, we need to correctly account for time spent during retries for timeouts.finally (addressed in Account for finally TaskRun retries in PR timeouts #4508).

Unfortunately Jerop and I don't have the bandwidth to prioritize this issue right now.

/unassign
/unassign @jerop
/help
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Apr 7, 2022
@pritidesai pritidesai removed this from the Pipelines v0.35 milestone May 3, 2022
@vsinghai
Copy link

/assign @vsinghai

@lbernick
Copy link
Member

lbernick commented Sep 8, 2022

I'm returning to this issue to see if it has been resolved by #5134 (FYI @abayer).

@ornew, I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline is exceeded. I think not running the finally task is intended behavior, as timeouts.pipeline should refer to the entire time the pipeline is running. I tried using your example posted in the original comment, and it results in the pipelinerun being timed out after 10 seconds and finally tasks not being run, which I believe is the correct behavior, as the pipelinerun should stop running after 10s.

If you'd like to allow the finally tasks to run indefinitely but have the tasks section time out after some time, I think you need to specify timeouts.tasks = 5s and timeouts.pipeline = 0 (no timeout). (Unfortunately this doesn't work-- filed #5459 but should be easily fixable.

There's still a bug, though, with the retried taskrun not being timed out when timeouts.tasks is reached. Here's a reproducer:

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: please-say-bye-again
spec:
  tasks:
    - name: hi
      retries: 2
      taskSpec:
        steps:
          - name: hi
            image: alpine:3.12
            script: |
              echo 'hi'
              sleep 10
              exit 1
  finally:
    - name: bye
      taskSpec:
        steps:
          - name: bye
            image: alpine:3.12
            script: |
              echo 'bye'
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: please-say-bye-again-
spec:
  timeouts:
    pipeline: 1m
    tasks: 19s
  pipelineRef:
    name: please-say-bye-again

In this example, each attempt (there are 3) sleeps for 10s and fails, the finally task is run, and the pipelinerun fails. I would expect that the first attempt fails and retries, the taskrun is canceled before the second attempt completes, the finally task runs, and the pipelinerun fails.

@ornew
Copy link
Author

ornew commented Sep 14, 2022

@lbernick

I'm curious why you say in your original comment that the finally task should not be timed out if timeouts.pipeline is exceeded.

I agree you that finally task should be timeout if timeouts.pipeline is exceeded.

This issue reported originally that timeouts.tasks does not work as expected by the user. Tasks are retried even though timeouts.tasks has been exceeded until timeouts.pipeline is exceeded.

When I said "finally should not timeout" here, I meant the user's expected behavior in the context of "as a result of timeouts.pipeline being consumed due to incorrect execution of timeouts.tasks".

@lbernick
Copy link
Member

got it, that makes sense, thanks!

@lbernick
Copy link
Member

I think this should have been fixed in #5807: #5807 (comment)

@github-project-automation github-project-automation bot moved this from In Progress to Done in Pipelines V1 Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Done
Status: Done
9 participants