Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yaml tests seem to be consistently timing out #2540

Closed
bobcatfish opened this issue May 4, 2020 · 12 comments
Closed

yaml tests seem to be consistently timing out #2540

bobcatfish opened this issue May 4, 2020 · 12 comments
Assignees
Labels
area/testing Issues or PRs related to testing kind/bug Categorizes issue or PR as related to a bug.

Comments

@bobcatfish
Copy link
Collaborator

Expected Behavior

"yaml tests" should only fail if something is actually wrong

Actual Behavior

All of the runs for #2531 have failed:

https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/pull/tektoncd_pipeline/2531/pull-tekton-pipeline-integration-tests/

image

And recent runs across PRs it seems most are failing too
https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/directory/pull-tekton-pipeline-integration-tests

Steps to Reproduce the Problem

Not sure what's going on yet

Additional Info

I can't decipher 1..60 sleep 10 for the life of me:

function validate_run() {
local tests_finished=0
for i in {1..60}; do
local finished="$(kubectl get $1.tekton.dev --output=jsonpath='{.items[*].status.conditions[*].status}')"
if [[ ! "$finished" == *"Unknown"* ]]; then
tests_finished=1
break
fi
sleep 10
done
return ${tests_finished}
}

@bobcatfish bobcatfish self-assigned this May 4, 2020
@jlpettersson
Copy link
Member

I added example pipelinerun-with-parallel-tasks-using-pvc.yaml in #2521, few days ago.

Things looks worse after that. But I don't really understand what is causing that. Maybe volumes takes time and there are some time outs? I find it hard to see what test is causing trouble.

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 5, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 5, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
@vdemeester
Copy link
Member

I added example pipelinerun-with-parallel-tasks-using-pvc.yaml in #2521, few days ago.

Things looks worse after that. But I don't really understand what is causing that. Maybe volumes takes time and there are some time outs? I find it hard to see what test is causing trouble.

Yeah, that's my guess 😓

I can't decipher 1..60 sleep 10 for the life of me:

it's gonna do 60 loops of 10s to check the status of the pipelinerun (or taskrun), meaning it times out after 10min.

@vdemeester
Copy link
Member

/kind bug
/area testing

@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. area/testing Issues or PRs related to testing labels May 5, 2020
@vdemeester
Copy link
Member

There is few ways to fix:

  • the quick one, add more time (aka do 90 loops)
  • the longer migrate to go to run those tests

#2541 does the later.

@vdemeester
Copy link
Member

I've bump the timeout in #2534 (90 loops instead of 60). It should fix the CI while #2541 gets worked on.

@jlpettersson
Copy link
Member

@vdemeester did it work better?

If it is a regional cluster and the PVCs are zonal the two parallel tasks may be executing in different zones and the third task that mount both PVCs is deadlocked since it can't mount two zonal PVC in a pod. I propose that I remove the example, since it depends so much on what kind of storage and cluster that is used. The intentation was to document PVC access modes but it is not strictly necessary to have an example.

@vdemeester
Copy link
Member

@vdemeester did it work better?

Not entirely sure. There is less failures but I see some still.

If it is a regional cluster and the PVCs are zonal the two parallel tasks may be executing in different zones and the third task that mount both PVCs is deadlocked since it can't mount two zonal PVC in a pod. I propose that I remove the example, since it depends so much on what kind of storage and cluster that is used. The intentation was to document PVC access modes but it is not strictly necessary to have an example.

Yeah, having it in a no-ci folder would work

@ghost
Copy link

ghost commented May 5, 2020

It does appear this might have been related. Just spotted this in one of our release clusters:

image

And drilling down it does appear to be related to volume / node affinity.

@jlpettersson
Copy link
Member

jlpettersson commented May 5, 2020

@sbwsg thanks. It was exactly that task I was worried about. But that example does not provide much value, and it need to be adapted to any environment. So I think it is best to remove it.

But a similar problem may occur for other pipelines that use the same PVC in more than one task. We could move those to the no-ci folder as @vdemeester suggested.

I apologize for the flaky tests the last few days.

@ghost
Copy link

ghost commented May 5, 2020

But a similar problem may occur for other pipelines that use the same PVC in more than one.

Yeah this might be a good area we can add docs around at some point. I wonder how much of it is platform specific and how much Tekton can describe in a cross-platform way.

I apologize for the flaky tests the last few days.

No worries, thanks for making the PR to resolve, and all the contributions around Workspaces! We were bound to hit this issue eventually.

@jlpettersson
Copy link
Member

jlpettersson commented May 5, 2020

I am curious if we can use some kind of pod affinity to get tasks co-located on the same node.

Possibly co-locate all pods belonging to a single PipelineRun so they perfectly fine can use the same PVC as a workspace and perfectly fine can execute parallel. (this is essentially what any single-node CI/CD system does).

We would still be a distributed system where different PipelineRuns possibly scheduled to different nodes. Using different PVCs is "easier" for fan-out, but not for fan-in (e.g. git-clone and then parallel tasks using the same files)

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 21, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 21, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 21, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 21, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 21, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 26, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue May 26, 2020
In tektoncd#2540 we are seeing that some yaml tests are timing out, but it's
hard to see what yaml tests are failing. This commit moves the logic out
of bash and into individual go tests - now we will run an individual go
test for each yaml example, completing all v1alpha1 before all v1beta1
and cleaning up in between. The output will still be challenging to read
since it will be interleaved, however the failures should at least
be associated with a specific yaml file.

This also makes it easier to run all tests locally, though if you
interrupt the tests you end up with your cluster in a bad state and it
might be good to update these to execute each example in a separate
namespace (in which case we could run all of v1alpha1 and v1beta1 at the
same time as well!)
@bobcatfish
Copy link
Collaborator Author

I don't think we've seen any evidence of this since @jlpettersson 's fixes, closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants