TestDAGPipelineRun is flaky #4418

mattmoor · 2021-12-14T01:31:34Z

Expected Behavior

TestDAGPipelineRun passes consistently and actually verifies that the tasks that SHOULD run in parallel DO run in parallel.

Actual Behavior

On busy clusters TestDAGPipelineRun will frequently fail with a message like:

dag_test.go:248: Expected parallel tasks to execute more or less at the same time, but they were 6s apart

Steps to Reproduce the Problem

This is an intermittent flake on a fairly busy cluster.

Additional Info

The test as written today doesn't necessarily even verify that pipeline-task-2-parallel-1 and pipeline-task-2-parallel-2 run in parallel. If they run in sequence quickly (<5s) then the test will pass.

The text was updated successfully, but these errors were encountered:

mattmoor · 2021-12-14T01:34:15Z

My proposal for how to fix this would be as follows:

Have the task definition executed by pipeline-task-2-parallel-* sleep for XX seconds (probably 30s)
Check that the tasks start within 30s of each other.

This:

Gives the test a fairly significant amount of leeway to schedule the second TaskRun even on a congested cluster.
Verifies that the tasks actually run in parallel because the second task MUST start before the first is done (b/c of the sleep).

I think this will be both more stable as well as provide more meaningful coverage.

mattmoor · 2021-12-14T01:35:44Z

Some pointers...

Here is the test (with a nice ASCII diagram):

pipeline/test/v1alpha1/dag_test.go

Lines 37 to 47 in 38b9f26

    
           // TestDAGPipelineRun creates a graph of arbitrary Tasks, then looks at the corresponding 
        
           // TaskRun start times to ensure they were run in the order intended, which is: 
        
           //                               | 
        
           //                        pipeline-task-1 
        
           //                       /               \ 
        
           //   pipeline-task-2-parallel-1    pipeline-task-2-parallel-2 
        
           //                       \                / 
        
           //                        pipeline-task-3 
        
           //                               | 
        
           //                        pipeline-task-4 
        
           func TestDAGPipelineRun(t *testing.T) {

The body of echo-task which is what we run in parallel:

pipeline/test/v1alpha1/dag_test.go

Lines 75 to 79 in 38b9f26

    
             steps: 
        
             - image: busybox 
        
               script: 'echo $(params["text"])' 
        
             - image: busybox 
        
               script: 'ln -s $(resources.inputs.repo.path) $(resources.outputs.repo.path)'

Where we check that the tasks start within 5s:

pipeline/test/v1alpha1/dag_test.go

Lines 243 to 249 in 38b9f26

    
           // Check that the two tasks that can run in parallel did 
        
           s1 := taskRuns[1].Status.StartTime.Time 
        
           s2 := taskRuns[2].Status.StartTime.Time 
        
           absDiff := time.Duration(math.Abs(float64(s2.Sub(s1)))) 
        
           if absDiff > (time.Second * 5) { 
        
           	t.Errorf("Expected parallel tasks to execute more or less at the same time, but they were %v apart", absDiff) 
        
           }

mattmoor · 2021-12-14T01:36:23Z

There is also a version of this in ./test/dag_test.go (I linked v1alpha1 above)

_See also the linked issue for a detailed explanation of the issue this fixes._ This change alters the DAG tests in two meaningful ways: 1. Have the tasks sleep, to actually increase the likelihood of task execution overlap, 2. Use the sleep duration for the minimum delta in start times. These changes combine should guarantee that the tasks *actually* executed in parallel, but the second part also enables this test to be less flaky on busy clusters where `5s` may not be sufficient for the task to start. A fun anecdote to note here is that the Kubernetes [SLO for Pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition) is `5s` at `99P`, which means Tekton has effectively zero room for overhead. Fixes: tektoncd#4418

_See also the linked issue for a detailed explanation of the issue this fixes._ This change alters the DAG tests in two meaningful ways: 1. Have the tasks sleep, to actually increase the likelihood of task execution overlap, 2. Use the sleep duration for the minimum delta in start times. These changes combine should guarantee that the tasks *actually* executed in parallel, but the second part also enables this test to be less flaky on busy clusters where `5s` may not be sufficient for the task to start. A fun anecdote to note here is that the Kubernetes [SLO for Pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition) is `5s` at `99P`, which means Tekton has effectively zero room for overhead. Fixes: #4418

mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Dec 14, 2021

mattmoor mentioned this issue Dec 14, 2021

Fix TestDAGPipelineRun flakiness. #4419

Merged

5 tasks

tekton-robot closed this as completed in #4419 Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestDAGPipelineRun is flaky #4418

TestDAGPipelineRun is flaky #4418

mattmoor commented Dec 14, 2021

mattmoor commented Dec 14, 2021

mattmoor commented Dec 14, 2021

mattmoor commented Dec 14, 2021

TestDAGPipelineRun is flaky #4418

TestDAGPipelineRun is flaky #4418

Comments

mattmoor commented Dec 14, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

mattmoor commented Dec 14, 2021

mattmoor commented Dec 14, 2021

mattmoor commented Dec 14, 2021