Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestDAGPipelineRun is flaky #4418

Closed
mattmoor opened this issue Dec 14, 2021 · 3 comments · Fixed by #4419
Closed

TestDAGPipelineRun is flaky #4418

mattmoor opened this issue Dec 14, 2021 · 3 comments · Fixed by #4419
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mattmoor
Copy link
Member

Expected Behavior

TestDAGPipelineRun passes consistently and actually verifies that the tasks that SHOULD run in parallel DO run in parallel.

Actual Behavior

On busy clusters TestDAGPipelineRun will frequently fail with a message like:

dag_test.go:248: Expected parallel tasks to execute more or less at the same time, but they were 6s apart

Steps to Reproduce the Problem

This is an intermittent flake on a fairly busy cluster.

Additional Info

The test as written today doesn't necessarily even verify that pipeline-task-2-parallel-1 and pipeline-task-2-parallel-2 run in parallel. If they run in sequence quickly (<5s) then the test will pass.

@mattmoor mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Dec 14, 2021
@mattmoor
Copy link
Member Author

My proposal for how to fix this would be as follows:

  1. Have the task definition executed by pipeline-task-2-parallel-* sleep for XX seconds (probably 30s)
  2. Check that the tasks start within 30s of each other.

This:

  • Gives the test a fairly significant amount of leeway to schedule the second TaskRun even on a congested cluster.
  • Verifies that the tasks actually run in parallel because the second task MUST start before the first is done (b/c of the sleep).

I think this will be both more stable as well as provide more meaningful coverage.

@mattmoor
Copy link
Member Author

Some pointers...

Here is the test (with a nice ASCII diagram):

// TestDAGPipelineRun creates a graph of arbitrary Tasks, then looks at the corresponding
// TaskRun start times to ensure they were run in the order intended, which is:
// |
// pipeline-task-1
// / \
// pipeline-task-2-parallel-1 pipeline-task-2-parallel-2
// \ /
// pipeline-task-3
// |
// pipeline-task-4
func TestDAGPipelineRun(t *testing.T) {

The body of echo-task which is what we run in parallel:

steps:
- image: busybox
script: 'echo $(params["text"])'
- image: busybox
script: 'ln -s $(resources.inputs.repo.path) $(resources.outputs.repo.path)'

Where we check that the tasks start within 5s:

// Check that the two tasks that can run in parallel did
s1 := taskRuns[1].Status.StartTime.Time
s2 := taskRuns[2].Status.StartTime.Time
absDiff := time.Duration(math.Abs(float64(s2.Sub(s1))))
if absDiff > (time.Second * 5) {
t.Errorf("Expected parallel tasks to execute more or less at the same time, but they were %v apart", absDiff)
}

@mattmoor
Copy link
Member Author

There is also a version of this in ./test/dag_test.go (I linked v1alpha1 above)

mattmoor added a commit to mattmoor/pipeline that referenced this issue Dec 14, 2021
_See also the linked issue for a detailed explanation of the issue this fixes._

This change alters the DAG tests in two meaningful ways:
1. Have the tasks sleep, to actually increase the likelihood of task execution overlap,
2. Use the sleep duration for the minimum delta in start times.

These changes combine should guarantee that the tasks *actually* executed in parallel,
but the second part also enables this test to be less flaky on busy clusters where
`5s` may not be sufficient for the task to start.

A fun anecdote to note here is that the Kubernetes [SLO for Pod startup
latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition)
is `5s` at `99P`, which means Tekton has effectively zero room for overhead.

Fixes: tektoncd#4418
mattmoor added a commit to mattmoor/pipeline that referenced this issue Dec 14, 2021
_See also the linked issue for a detailed explanation of the issue this fixes._

This change alters the DAG tests in two meaningful ways:
1. Have the tasks sleep, to actually increase the likelihood of task execution overlap,
2. Use the sleep duration for the minimum delta in start times.

These changes combine should guarantee that the tasks *actually* executed in parallel,
but the second part also enables this test to be less flaky on busy clusters where
`5s` may not be sufficient for the task to start.

A fun anecdote to note here is that the Kubernetes [SLO for Pod startup
latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition)
is `5s` at `99P`, which means Tekton has effectively zero room for overhead.

Fixes: tektoncd#4418
mattmoor added a commit to mattmoor/pipeline that referenced this issue Dec 14, 2021
_See also the linked issue for a detailed explanation of the issue this fixes._

This change alters the DAG tests in two meaningful ways:
1. Have the tasks sleep, to actually increase the likelihood of task execution overlap,
2. Use the sleep duration for the minimum delta in start times.

These changes combine should guarantee that the tasks *actually* executed in parallel,
but the second part also enables this test to be less flaky on busy clusters where
`5s` may not be sufficient for the task to start.

A fun anecdote to note here is that the Kubernetes [SLO for Pod startup
latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition)
is `5s` at `99P`, which means Tekton has effectively zero room for overhead.

Fixes: tektoncd#4418
tekton-robot pushed a commit that referenced this issue Dec 14, 2021
_See also the linked issue for a detailed explanation of the issue this fixes._

This change alters the DAG tests in two meaningful ways:
1. Have the tasks sleep, to actually increase the likelihood of task execution overlap,
2. Use the sleep duration for the minimum delta in start times.

These changes combine should guarantee that the tasks *actually* executed in parallel,
but the second part also enables this test to be less flaky on busy clusters where
`5s` may not be sufficient for the task to start.

A fun anecdote to note here is that the Kubernetes [SLO for Pod startup
latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition)
is `5s` at `99P`, which means Tekton has effectively zero room for overhead.

Fixes: #4418
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant