-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestPipelineRunTimeout is flaky #3460
Comments
Every so often this test will fail with:
The setup is a task that sleeps for 10s and a pipeline that has a 5s timeout: pipeline/test/v1alpha1/timeout_test.go Lines 49 to 61 in adc4cfa
This check expects the "Reason" to be "PipelineRunTimeout": pipeline/test/v1alpha1/timeout_test.go Lines 100 to 103 in adc4cfa
However, when the test intermittently fails the status shows the TaskRun timed out, but the PipelineRun Failed:
The irony is that the TaskRun gets its timeout from the PipelineRun here: pipeline/pkg/reconciler/pipelinerun/pipelinerun.go Lines 807 to 816 in 6d8f451
When we transition the PipelineRun to "Failed" we skip this check: pipeline/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go Lines 206 to 218 in 3185d05
... digging into why, it looks at the StartTime in status, and in the cases where we fail the StartTime is less than 5s from the time of completion(!):
The same status update shows us propagating the following for the TaskRun:
|
Generally this means that any PipelineRun specifying a Timeout may simple show up as Failed due to this race. I think that my $0.02 on the appropriate "fix" would be that the PipelineRun's "StartTime" should never be greater than the "StartTime" of its constituent TaskRuns. This should guarantee that any copied-down timeout never manifests this way. cc @vdemeester @afrittoli @imjasonh @bobcatfish for thoughts 🙏 |
Sorry, I should mention that the first bit of yaml is actually a dump from a prior run that I lifted from my older bug that this replaces. So don't try to rationalize it with the later "diffs". The complete run for the latest failure (that I debugged) was here: https://github.com/mattmoor/mink/runs/1305814401?check_suite_focus=true#step:17:58 |
I think that this is derived from state informer caches and the retry logic for updating status that we have. On a subsequent reconciliation of the PipelineRun (+1s) from the stale uninitialized version, we re-InitializeConditions, which resets StartTime, but a prior reconciliation had already created the TaskRuns (likely what triggered us to be reconciled). The diff for this status update looks like:
Note the empty StartTime in the base, despite this in a prior update:
|
Occasionally, it is possible for us to be reconciling a PipelineRun and have the status we intend to report reflect an inaccurate StartTime (see issue for details). This corrects for those circumstances by ensuring that the StartTime we report for a PipelineRun is never later than the smallest CreationTimestamp of a child TaskRun. Fixes: tektoncd#3460
Occasionally, it is possible for us to be reconciling a PipelineRun and have the status we intend to report reflect an inaccurate StartTime (see issue for details). This corrects for those circumstances by ensuring that the StartTime we report for a PipelineRun is never later than the smallest CreationTimestamp of a child TaskRun. Fixes: #3460
TestPipelineRunTimeout test is still flaky |
Expected Behavior
TestPipelineRunTimeout consistently passes with the properly observed failure mode.
Actual Behavior
TestPipelineRunTimeout often observes a "Failed" status on the PipelineRun(!!!).
Steps to Reproduce the Problem
Run the test a lot.
I will post my analysis here shortly, as I believe I know WHY this is happening.
The text was updated successfully, but these errors were encountered: