-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterClass self-hosted cluster tests are very flaky #9522
Comments
/triage accepted @kubernetes-sigs/cluster-api-release-team |
Taking a look. |
This is the thread I have been posting my analysis so far. |
/assign @willie-yao |
These tests fail with three different errors:
These failures only seem to happen on the main branch. They seem to only occur on the ClusterClass versions of the tests. We're looking for issues that could be responsible in the diff between these branches. |
From the first few days after merging #9570 it looks like problem 1 and problem 2, i.e. Toward the end of the week if we don't get any more of those failures we should consider them solved. IMO that should be enough to remove The third failure |
I see one flake with Moreover, there are almost consistent flakes on Are we sure we want to drop the "release-blocking" tag from this issue? |
@nawazkh you're right - I think the state is still pretty unstable so we have to keep release-blocking for now. We have some additional Let's keep release blocking on this until those are in a better state. |
Current state:
|
After spending some time on it I've come up with a theory for what's causing
The additional time spent waiting for CAPI to recover from the ungraceful shutdown of the MachinePool is the root cause of this issue AFAICT. I think this is a bug that can be fixed by finding a safer way to remove the docker containers that the MachinePools are running on. We could try to only delete the infrastructure after the Node has already been deleted. We should consider removing MachinePools from this test after the RC next week to unblock the release. We can duplicate the jobs to keep some coverage so folks can work on fixing those tests. I'm not sure if there's some explanation leading from this for the @nawazkh @kubernetes-sigs/cluster-api-release-team @willie-yao (Seeing as you implemented the MP e2e testing): WDYT about removing MPs for this test for this release? |
-1 on removing tests, or disabling them. If we need, we'll delay the release indefinitely until either the tests are fixed, or the related code (not the tests) is reverted/removed. |
/priority critical-urgent |
I'm +1 on removing this as a last resort but I think ideally we should be fixing this ahead of the release. Just to be clear: This is only happening on the ClusterClass self-hosted test? And is this because of ClusterClass, or the existence of MachinePools in general? I think the regular self-hosted test without ClusterClass doesn't include MachinePools. If the bug is due to only MachinePools and not ClusterClass, I'm more inclined to remove the test temporarily since it is a pre-existing bug |
#8842 will fix this |
This should now be fixed. Let's observe the CI signal until next Tuesday (11/21) to confirm we no longer see the flakes. |
Looks like it is green after the fix, but will leave it up to CI team Lead and members to confirm. cc @nawazkh |
this is fixed on the |
Thanks. Let's create a separate issue for the new flake (which is different than the ones described here) if needed and close the current one. Feel free to re-open in the future, if you encounter the same flakes. /close |
@furkatgofurov7: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The ClusterClass self-hosted tests have become very flaky to the extent that they regularly fail four times in a row. They fail across all flavors of our e2e-full test jobs (full, dualstack, and mink8s).
These failures only seem to happen on the main branch. We're looking for issues that could be responsible in the diff between these branches.
These tests fail with three different errors as described below. Though there are three different errors it's likely that some change in the main branch caused either self-hosted clusters, or the self-hosted tests themselves, to work differently in a way that exposed these issues. We should investigate both the cause of each specific error and the underlying issue that caused them to be more prevalent on the main branch.
The errors are:
1.
etcdImageTagCondition: expected 3 pods, got 4
2.
old nodes remain
3.
error adding delete-for-move annotation
The text was updated successfully, but these errors were encountered: