-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Ensure that PipelineRuns are marked as timed out if a task timed out due to the PR timeout #5133
Conversation
…due to the PR timeout Fixes tektoncd#5127 We've been seeing sporadic flaky failures for a number of e2e tests for a while now, such as `TestPipelineRunTimeout` and sidecar-related tests. I recently dug into exactly what differed between a success and a failure, specifically for `TestPipelineRunTimeout`, the most frequent of those flakes. I was able to determine that sometimes, the `TaskRun` would be timed out due to the `PipelineRun`-level timeout, but `pr.HasTimedOut` would not return true on the next reconciliation of the `PipelineRun`. This strongly suggests that the `TaskRun` timeout was calculated to end slightly before the `PipelineRun` timeout would end, and then the `PipelineRun` reconciliation happened in the very brief (milliseconds at most) window between the `TaskRun` completing as timed out and the `PipelineRun` timeout being reached. It's not possible for us to make the end of the generated `TaskRun` timeout exactly match the end of the specified `PipelineRun` timeout, since the `TaskRun`'s `StartTime` won't be set until the `TaskRun` has actually been created, so there'll always be some difference there, as best as I can tell. So I decided to work around this inherent limitation by instead tracking cases where we've set the `TaskRun` timeout based on `PipelineRun.Status.StartTime + PipelineRun.PipelineTimeout(ctx)`, i.e., the `TaskRun` timeout is the difference between elapsed time of the `PipelineRun` and the time at which the `PipelineRun` proper would be timed out. Then, if all tasks in a `PipelineRun` have completed, and at least one of them has timed out and had its timeout set based on that difference, we know that the `PipelineRun` has timed out, even if `pr.HasTimedOut` is returning false because we haven't quite yet hit the end of the `PipelineRun`'s timeout duration. Signed-off-by: Andrew Bayer <andrew.bayer@gmail.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I've marked this as WIP because I'm 100% sure that more unit test coverage is needed but wanted to start running e2e tests over and over to see if any of the flakes ever show up. |
The following is the coverage report on the affected files.
|
/test check-pr-has-kind-label |
@abayer: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-pipeline-kind-k8s-v1-21-e2e |
@abayer: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-tekton-pipeline-integration-tests |
/test pull-pipeline-kind-k8s-v1-21-e2e |
@abayer: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-tekton-pipeline-integration-tests |
EDIT: Ah, my local test was screwed up and still using the v0.37.2 images. Sigh. Well, @jerop's idea is still better. |
/test pull-tekton-pipeline-integration-tests |
2 similar comments
/test pull-tekton-pipeline-integration-tests |
/test pull-tekton-pipeline-integration-tests |
/test pull-tekton-pipeline-integration-tests |
@abayer: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Closing in favor of #5134. |
Changes
Fixes #5127
We've been seeing sporadic flaky failures for a number of e2e tests for a while now, such as
TestPipelineRunTimeout
and sidecar-related tests. I recently dug into exactly what differed between a success and a failure, specifically forTestPipelineRunTimeout
, the most frequent of those flakes. I was able to determine that sometimes, theTaskRun
would be timed out due to thePipelineRun
-level timeout, butpr.HasTimedOut
would not return true on the next reconciliation of thePipelineRun
. This strongly suggests that theTaskRun
timeout was calculated to end slightly before thePipelineRun
timeout would end, and then thePipelineRun
reconciliation happened in the very brief (milliseconds at most) window between theTaskRun
completing as timed out and thePipelineRun
timeout being reached.It's not possible for us to make the end of the generated
TaskRun
timeout exactly match the end of the specifiedPipelineRun
timeout, since theTaskRun
'sStartTime
won't be set until theTaskRun
has actually been created, so there'll always be some difference there, as best as I can tell. So I decided to work around this inherent limitation by instead tracking cases where we've set theTaskRun
timeout based onPipelineRun.Status.StartTime + PipelineRun.PipelineTimeout(ctx)
, i.e., theTaskRun
timeout is the difference between elapsed time of thePipelineRun
and the time at which thePipelineRun
proper would be timed out.Then, if all tasks in a
PipelineRun
have completed, and at least one of them has timed out and had its timeout set based on that difference, we know that thePipelineRun
has timed out, even ifpr.HasTimedOut
is returning false because we haven't quite yet hit the end of thePipelineRun
's timeout duration./kind bug
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes