WIP: Ensure that PipelineRuns are marked as timed out if a task timed out due to the PR timeout #5133

abayer · 2022-07-13T16:39:07Z

Changes

We've been seeing sporadic flaky failures for a number of e2e tests for a while now, such as TestPipelineRunTimeout and sidecar-related tests. I recently dug into exactly what differed between a success and a failure, specifically for TestPipelineRunTimeout, the most frequent of those flakes. I was able to determine that sometimes, the TaskRun would be timed out due to the PipelineRun-level timeout, but pr.HasTimedOut would not return true on the next reconciliation of the PipelineRun. This strongly suggests that the TaskRun timeout was calculated to end slightly before the PipelineRun timeout would end, and then the PipelineRun reconciliation happened in the very brief (milliseconds at most) window between the TaskRun completing as timed out and the PipelineRun timeout being reached.

It's not possible for us to make the end of the generated TaskRun timeout exactly match the end of the specified PipelineRun timeout, since the TaskRun's StartTime won't be set until the TaskRun has actually been created, so there'll always be some difference there, as best as I can tell. So I decided to work around this inherent limitation by instead tracking cases where we've set the TaskRun timeout based on PipelineRun.Status.StartTime + PipelineRun.PipelineTimeout(ctx), i.e., the TaskRun timeout is the difference between elapsed time of the PipelineRun and the time at which the PipelineRun proper would be timed out.

Then, if all tasks in a PipelineRun have completed, and at least one of them has timed out and had its timeout set based on that difference, we know that the PipelineRun has timed out, even if pr.HasTimedOut is returning false because we haven't quite yet hit the end of the PipelineRun's timeout duration.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs included if any changes are user facing
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

PipelineRuns will always be marked as timed out if any of their tasks timed out due to the timeout set on the PipelineRun itself.

…due to the PR timeout Fixes tektoncd#5127 We've been seeing sporadic flaky failures for a number of e2e tests for a while now, such as `TestPipelineRunTimeout` and sidecar-related tests. I recently dug into exactly what differed between a success and a failure, specifically for `TestPipelineRunTimeout`, the most frequent of those flakes. I was able to determine that sometimes, the `TaskRun` would be timed out due to the `PipelineRun`-level timeout, but `pr.HasTimedOut` would not return true on the next reconciliation of the `PipelineRun`. This strongly suggests that the `TaskRun` timeout was calculated to end slightly before the `PipelineRun` timeout would end, and then the `PipelineRun` reconciliation happened in the very brief (milliseconds at most) window between the `TaskRun` completing as timed out and the `PipelineRun` timeout being reached. It's not possible for us to make the end of the generated `TaskRun` timeout exactly match the end of the specified `PipelineRun` timeout, since the `TaskRun`'s `StartTime` won't be set until the `TaskRun` has actually been created, so there'll always be some difference there, as best as I can tell. So I decided to work around this inherent limitation by instead tracking cases where we've set the `TaskRun` timeout based on `PipelineRun.Status.StartTime + PipelineRun.PipelineTimeout(ctx)`, i.e., the `TaskRun` timeout is the difference between elapsed time of the `PipelineRun` and the time at which the `PipelineRun` proper would be timed out. Then, if all tasks in a `PipelineRun` have completed, and at least one of them has timed out and had its timeout set based on that difference, we know that the `PipelineRun` has timed out, even if `pr.HasTimedOut` is returning false because we haven't quite yet hit the end of the `PipelineRun`'s timeout duration. Signed-off-by: Andrew Bayer <andrew.bayer@gmail.com>

tekton-robot · 2022-07-13T16:39:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from abayer after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abayer · 2022-07-13T16:41:15Z

I've marked this as WIP because I'm 100% sure that more unit test coverage is needed but wanted to start running e2e tests over and over to see if any of the flakes ever show up.

tekton-robot · 2022-07-13T16:43:13Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	86.3%	86.4%	0.1
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go	94.3%	91.0%	-3.2
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go	97.4%	97.4%	0.0

abayer · 2022-07-13T16:44:01Z

/test check-pr-has-kind-label

tekton-robot · 2022-07-13T16:44:02Z

@abayer: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-tekton-pipeline-alpha-integration-tests
/test pull-tekton-pipeline-build-tests
/test pull-tekton-pipeline-integration-tests
/test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

/test pull-tekton-pipeline-go-coverage
/test pull-tekton-pipeline-kind-alpha-integration-tests
/test pull-tekton-pipeline-kind-alpha-yaml-tests
/test pull-tekton-pipeline-kind-integration-tests
/test pull-tekton-pipeline-kind-yaml-tests

Use /test all to run the following jobs that were automatically triggered:

pull-tekton-pipeline-alpha-integration-tests
pull-tekton-pipeline-build-tests
pull-tekton-pipeline-go-coverage
pull-tekton-pipeline-integration-tests
pull-tekton-pipeline-unit-tests

In response to this:

/test check-pr-has-kind-label

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abayer · 2022-07-13T17:12:14Z

/test pull-pipeline-kind-k8s-v1-21-e2e

tekton-robot · 2022-07-13T17:12:15Z

@abayer: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-tekton-pipeline-alpha-integration-tests
/test pull-tekton-pipeline-build-tests
/test pull-tekton-pipeline-integration-tests
/test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

/test pull-tekton-pipeline-go-coverage
/test pull-tekton-pipeline-kind-alpha-integration-tests
/test pull-tekton-pipeline-kind-alpha-yaml-tests
/test pull-tekton-pipeline-kind-integration-tests
/test pull-tekton-pipeline-kind-yaml-tests

Use /test all to run the following jobs that were automatically triggered:

pull-tekton-pipeline-alpha-integration-tests
pull-tekton-pipeline-build-tests
pull-tekton-pipeline-go-coverage
pull-tekton-pipeline-integration-tests
pull-tekton-pipeline-unit-tests

In response to this:

/test pull-pipeline-kind-k8s-v1-21-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abayer · 2022-07-13T17:15:03Z

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

abayer · 2022-07-13T17:43:13Z

/test pull-pipeline-kind-k8s-v1-21-e2e

tekton-robot · 2022-07-13T17:43:14Z

@abayer: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-tekton-pipeline-alpha-integration-tests
/test pull-tekton-pipeline-build-tests
/test pull-tekton-pipeline-integration-tests
/test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

/test pull-tekton-pipeline-go-coverage
/test pull-tekton-pipeline-kind-alpha-integration-tests
/test pull-tekton-pipeline-kind-alpha-yaml-tests
/test pull-tekton-pipeline-kind-integration-tests
/test pull-tekton-pipeline-kind-yaml-tests

Use /test all to run the following jobs that were automatically triggered:

pull-tekton-pipeline-alpha-integration-tests
pull-tekton-pipeline-build-tests
pull-tekton-pipeline-go-coverage
pull-tekton-pipeline-integration-tests
pull-tekton-pipeline-unit-tests

In response to this:

/test pull-pipeline-kind-k8s-v1-21-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abayer · 2022-07-13T17:49:51Z

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

abayer · 2022-07-13T18:25:45Z

~~Boo - I hit TestPipelineRunTimeout failing once in ten local runs of the full e2e test suite, so I'm not sure this actually works.~~ Also @jerop had a waaaaaay better idea which I'm working on now. =)

EDIT: Ah, my local test was screwed up and still using the v0.37.2 images. Sigh. Well, @jerop's idea is still better.

abayer · 2022-07-13T18:43:29Z

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

abayer · 2022-07-13T19:21:53Z

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

abayer · 2022-07-13T19:59:40Z

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

abayer · 2022-07-13T20:38:41Z

/test pull-tekton-pipeline-integration-tests
/test pull-tekton-pipeline-alpha-integration-tests

tekton-robot · 2022-07-13T21:12:25Z

@abayer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-tekton-pipeline-integration-tests	`aca1517`	link	true	`/test pull-tekton-pipeline-integration-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abayer · 2022-07-14T15:04:43Z

Closing in favor of #5134.

tekton-robot requested review from afrittoli and pritidesai July 13, 2022 16:39

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 13, 2022

abayer removed the kind/flake Categorizes issue or PR as related to a flakey test label Jul 13, 2022

abayer mentioned this pull request Jul 13, 2022

PipelineRuns using timeout or timeouts fields sometimes are marked as failed rather than timed out #5127

Closed

abayer mentioned this pull request Jul 13, 2022

Switch PipelineRun timeout -> TaskRun logic to instead signal the TaskRuns to stop #5134

Merged

7 tasks

abayer closed this Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Ensure that PipelineRuns are marked as timed out if a task timed out due to the PR timeout #5133

WIP: Ensure that PipelineRuns are marked as timed out if a task timed out due to the PR timeout #5133

abayer commented Jul 13, 2022 •

edited

Loading

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022 •

edited

Loading

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 14, 2022

WIP: Ensure that PipelineRuns are marked as timed out if a task timed out due to the PR timeout #5133

WIP: Ensure that PipelineRuns are marked as timed out if a task timed out due to the PR timeout #5133

Conversation

abayer commented Jul 13, 2022 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022 • edited Loading

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

abayer commented Jul 13, 2022

tekton-robot commented Jul 13, 2022

abayer commented Jul 14, 2022

abayer commented Jul 13, 2022 •

edited

Loading

abayer commented Jul 13, 2022 •

edited

Loading