-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TestDAGPipelineRun
flakiness.
#4419
Conversation
|
Paging @dlorenc on the CLA bit 😅 |
awesome... tabs vs. spaces 🤦 |
_See also the linked issue for a detailed explanation of the issue this fixes._ This change alters the DAG tests in two meaningful ways: 1. Have the tasks sleep, to actually increase the likelihood of task execution overlap, 2. Use the sleep duration for the minimum delta in start times. These changes combine should guarantee that the tasks *actually* executed in parallel, but the second part also enables this test to be less flaky on busy clusters where `5s` may not be sufficient for the task to start. A fun anecdote to note here is that the Kubernetes [SLO for Pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#definition) is `5s` at `99P`, which means Tekton has effectively zero room for overhead. Fixes: tektoncd#4418
/test check-pr-has-kind-label |
/lgtm |
test/v1alpha1/dag_test.go
Outdated
@@ -34,6 +35,8 @@ import ( | |||
knativetest "knative.dev/pkg/test" | |||
) | |||
|
|||
const sleepDuration = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30s feels like a long time to wait in a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for a rerun of the e2e tests is a lot longer 😉
I'm happy to lower this, but not sure what you are comfortable with. 5s is the 99P scheduling latency on K8s when there's available capacity, and running these tests with t.Parallel()
on KinD things get busy quickly. I've seen 9s apart in recently memory, but think I've seen up to 13s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to be clear the 9s
and 13s
are failures with what's at HEAD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced this to 15s, which is larger than I can recall seeing this fail with.
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dlorenc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test check-pr-has-kind-label |
See also the linked issue for a detailed explanation of the issue this fixes.
This change alters the DAG tests in two meaningful ways:
These changes combine should guarantee that the tasks actually executed in parallel, but the second part also enables this test to be less flaky on busy clusters where
5s
may not be sufficient for the task to start.A fun anecdote to note here is that the Kubernetes SLO for Pod startup latency is
5s
at99P
, which means Tekton had effectively zero room for overhead. 😅Fixes: #4418
/kind bug
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
Release Notes