-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestDAGPipelineRun is flaky #4418
Comments
My proposal for how to fix this would be as follows:
This:
I think this will be both more stable as well as provide more meaningful coverage. |
Some pointers... Here is the test (with a nice ASCII diagram): pipeline/test/v1alpha1/dag_test.go Lines 37 to 47 in 38b9f26
The body of pipeline/test/v1alpha1/dag_test.go Lines 75 to 79 in 38b9f26
Where we check that the tasks start within pipeline/test/v1alpha1/dag_test.go Lines 243 to 249 in 38b9f26
|
There is also a version of this in |
_See also the linked issue for a detailed explanation of the issue this fixes._ This change alters the DAG tests in two meaningful ways: 1. Have the tasks sleep, to actually increase the likelihood of task execution overlap, 2. Use the sleep duration for the minimum delta in start times. These changes combine should guarantee that the tasks *actually* executed in parallel, but the second part also enables this test to be less flaky on busy clusters where `5s` may not be sufficient for the task to start. A fun anecdote to note here is that the Kubernetes [SLO for Pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition) is `5s` at `99P`, which means Tekton has effectively zero room for overhead. Fixes: tektoncd#4418
_See also the linked issue for a detailed explanation of the issue this fixes._ This change alters the DAG tests in two meaningful ways: 1. Have the tasks sleep, to actually increase the likelihood of task execution overlap, 2. Use the sleep duration for the minimum delta in start times. These changes combine should guarantee that the tasks *actually* executed in parallel, but the second part also enables this test to be less flaky on busy clusters where `5s` may not be sufficient for the task to start. A fun anecdote to note here is that the Kubernetes [SLO for Pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition) is `5s` at `99P`, which means Tekton has effectively zero room for overhead. Fixes: tektoncd#4418
_See also the linked issue for a detailed explanation of the issue this fixes._ This change alters the DAG tests in two meaningful ways: 1. Have the tasks sleep, to actually increase the likelihood of task execution overlap, 2. Use the sleep duration for the minimum delta in start times. These changes combine should guarantee that the tasks *actually* executed in parallel, but the second part also enables this test to be less flaky on busy clusters where `5s` may not be sufficient for the task to start. A fun anecdote to note here is that the Kubernetes [SLO for Pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition) is `5s` at `99P`, which means Tekton has effectively zero room for overhead. Fixes: tektoncd#4418
_See also the linked issue for a detailed explanation of the issue this fixes._ This change alters the DAG tests in two meaningful ways: 1. Have the tasks sleep, to actually increase the likelihood of task execution overlap, 2. Use the sleep duration for the minimum delta in start times. These changes combine should guarantee that the tasks *actually* executed in parallel, but the second part also enables this test to be less flaky on busy clusters where `5s` may not be sufficient for the task to start. A fun anecdote to note here is that the Kubernetes [SLO for Pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition) is `5s` at `99P`, which means Tekton has effectively zero room for overhead. Fixes: #4418
Expected Behavior
TestDAGPipelineRun
passes consistently and actually verifies that the tasks that SHOULD run in parallel DO run in parallel.Actual Behavior
On busy clusters
TestDAGPipelineRun
will frequently fail with a message like:Steps to Reproduce the Problem
This is an intermittent flake on a fairly busy cluster.
Additional Info
The test as written today doesn't necessarily even verify that
pipeline-task-2-parallel-1
andpipeline-task-2-parallel-2
run in parallel. If they run in sequence quickly (<5s
) then the test will pass.The text was updated successfully, but these errors were encountered: