ClusterClass self-hosted cluster tests are very flaky #9522

killianmuldoon · 2023-10-04T10:19:42Z

The ClusterClass self-hosted tests have become very flaky to the extent that they regularly fail four times in a row. They fail across all flavors of our e2e-full test jobs (full, dualstack, and mink8s).

These failures only seem to happen on the main branch. We're looking for issues that could be responsible in the diff between these branches.

These tests fail with three different errors as described below. Though there are three different errors it's likely that some change in the main branch caused either self-hosted clusters, or the self-hosted tests themselves, to work differently in a way that exposed these issues. We should investigate both the cause of each specific error and the underlying issue that caused them to be more prevalent on the main branch.

The errors are:

1. etcdImageTagCondition: expected 3 pods, got 4
- Example: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#aa41f168b84bfaa321bf
- Cause: Node delete not retried when cache is locked
- Fix: 🐛 Retry Node delete when CCT is locked #9570
2. old nodes remain
- Example: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#1d058cd63b8638de3e75
3. error adding delete-for-move annotation
- Example: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#ac8b8ee04f7fcd030081

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-10-04T10:21:01Z

/triage accepted

@kubernetes-sigs/cluster-api-release-team

nawazkh · 2023-10-04T16:37:36Z

Taking a look.

nawazkh · 2023-10-05T20:10:07Z

This is the thread I have been posting my analysis so far.

fabriziopandini · 2023-10-11T18:11:33Z

/assign @willie-yao

killianmuldoon · 2023-10-19T14:13:09Z

These tests fail with three different errors:

etcdImageTagCondition: expected 3 pods, got 4 https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#aa41f168b84bfaa321bf
old nodes remain https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#1d058cd63b8638de3e75
error adding delete-for-move annotation https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#ac8b8ee04f7fcd030081

These failures only seem to happen on the main branch. They seem to only occur on the ClusterClass versions of the tests. We're looking for issues that could be responsible in the diff between these branches.

killianmuldoon · 2023-10-23T11:11:06Z

From the first few days after merging #9570 it looks like problem 1 and problem 2, i.e. etcdImageTagCondition and old nodes remain failures have been fixed.

Toward the end of the week if we don't get any more of those failures we should consider them solved. IMO that should be enough to remove release-blocking from this issue, but we should consider that once we have more data in a few days.

The third failure error adding delete-for-move annotation is still occurring though, and we should look to fix it, though it's not the highest priority flake right now.

nawazkh · 2023-10-23T18:59:29Z

I see one flake with old nodes remain on capi-mink8s-e2e-main: reference run 1716485778235723776.

Moreover, there are almost consistent flakes on When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster test. I am taking a look at the the diff between a green run and a flake.

Are we sure we want to drop the "release-blocking" tag from this issue?

killianmuldoon · 2023-10-24T09:06:13Z

@nawazkh you're right - I think the state is still pretty unstable so we have to keep release-blocking for now. We have some additional old nodes remain failures as well as a cluster of error adding delete-for-move annotation failures.

Let's keep release blocking on this until those are in a better state.

killianmuldoon · 2023-10-25T11:01:35Z

Current state:

etcdImageTagCondition is fixed by 🐛 Retry Node delete when CCT is locked #9570
The other two errors are still occurring flakily.

killianmuldoon · 2023-10-30T16:05:25Z

After spending some time on it I've come up with a theory for what's causing old nodes remain.

CAPI components are often deployed on the MachinePool instance in the self-hosted cluster.
It is the responsibility of the CAPI MachinePool controller to clean up the nodes of MachinePools when they are no longer listed in Spec.ProviderIDList
When an upgrade happens the DockerMachinePool deletes the container running a MachinePool. This ungracefully stops the node and all running pods.
The capi-controller manager is dead at this point. It can not clean up the old nodes of the MachinePool. From a Kubernetes perspective the Node is NotReady but it doesn't clean up the pods.
This happens after all control plane nodes are upgraded but before all old nodes are drained. Once the CAPI controller manager goes down it no longer reconciles and the last old-version control plane node is not removed in time.
Eventually the nodeEvictionTimeout causes the CAPI pods from the dead node to the cluster.
Once the capi controllers are running again the MP nodes are eventually deleted and the rest of the cluster is eventually upgraded.

The additional time spent waiting for CAPI to recover from the ungraceful shutdown of the MachinePool is the root cause of this issue AFAICT.

I think this is a bug that can be fixed by finding a safer way to remove the docker containers that the MachinePools are running on. We could try to only delete the infrastructure after the Node has already been deleted.

We should consider removing MachinePools from this test after the RC next week to unblock the release. We can duplicate the jobs to keep some coverage so folks can work on fixing those tests.

I'm not sure if there's some explanation leading from this for the error adding delete-for-move annotation bug. I'm going to add some logging to try to figure out if something similar is going on.

@nawazkh @kubernetes-sigs/cluster-api-release-team

@willie-yao (Seeing as you implemented the MP e2e testing): WDYT about removing MPs for this test for this release?

vincepri · 2023-11-08T14:46:19Z

-1 on removing tests, or disabling them. If we need, we'll delay the release indefinitely until either the tests are fixed, or the related code (not the tests) is reverted/removed.

vincepri · 2023-11-08T14:46:33Z

/priority critical-urgent

willie-yao · 2023-11-08T19:41:01Z

I'm +1 on removing this as a last resort but I think ideally we should be fixing this ahead of the release. Just to be clear: This is only happening on the ClusterClass self-hosted test? And is this because of ClusterClass, or the existence of MachinePools in general? I think the regular self-hosted test without ClusterClass doesn't include MachinePools. If the bug is due to only MachinePools and not ClusterClass, I'm more inclined to remove the test temporarily since it is a pre-existing bug

CecileRobertMichon · 2023-11-14T17:01:55Z

#8842 will fix this

CecileRobertMichon · 2023-11-15T20:02:56Z

This should now be fixed. Let's observe the CI signal until next Tuesday (11/21) to confirm we no longer see the flakes.

furkatgofurov7 · 2023-11-20T12:31:58Z

This should now be fixed. Let's observe the CI signal until next Tuesday (11/21) to confirm we no longer see the flakes.

Looks like it is green after the fix, but will leave it up to CI team Lead and members to confirm.

cc @nawazkh

adilGhaffarDev · 2023-11-20T12:41:41Z

this is fixed on the main. I don't see any extraordinary flakes related to that.
I still see this flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1725069624287956992 , https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#c6ca02bf53df8d27d1c7
it happens rarely. We will look into it.

furkatgofurov7 · 2023-11-20T13:33:22Z

this is fixed on the main. I don't see any extraordinary flakes related to that. I still see this flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1725069624287956992 , https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#c6ca02bf53df8d27d1c7 it happens rarely. We will look into it.

Thanks. Let's create a separate issue for the new flake (which is different than the ones described here) if needed and close the current one. Feel free to re-open in the future, if you encounter the same flakes.

/close

k8s-ci-robot · 2023-11-20T13:33:31Z

@furkatgofurov7: Closing this issue.

In response to this:

this is fixed on the main. I don't see any extraordinary flakes related to that. I still see this flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1725069624287956992 , https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#c6ca02bf53df8d27d1c7 it happens rarely. We will look into it.

Thanks. Let's create a separate issue for the new flake (which is different than the ones described here) if needed and close the current one. Feel free to re-open in the future, if you encounter the same flakes.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 4, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2023

willie-yao mentioned this issue Oct 10, 2023

🌱 Add WaitForMachinePoolToBeUpgraded to self-hosted test #9540

Merged

k8s-ci-robot assigned willie-yao Oct 11, 2023

nawazkh mentioned this issue Oct 12, 2023

🌱 Dump all resource information for self-hosted tests #9547

Merged

killianmuldoon mentioned this issue Oct 18, 2023

🐛 Retry Node delete when CCT is locked #9570

Merged

killianmuldoon added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Oct 19, 2023

killianmuldoon changed the title ~~HA self-hosted cluster with test is failing~~ Self-hosted cluster tests are failing Oct 19, 2023

killianmuldoon changed the title ~~Self-hosted cluster tests are failing~~ ClusterClass Self-hosted cluster tests are very flakiy Oct 20, 2023

killianmuldoon changed the title ~~ClusterClass Self-hosted cluster tests are very flakiy~~ ClusterClass Self-hosted cluster tests are very flaky Oct 20, 2023

killianmuldoon added the kind/flake Categorizes issue or PR as related to a flaky test. label Oct 27, 2023

killianmuldoon mentioned this issue Oct 30, 2023

MachinePool workers support in ClusterClass #5991

Closed

39 tasks

killianmuldoon changed the title ~~ClusterClass Self-hosted cluster tests are very flaky~~ ClusterClass self-hosted cluster tests are very flaky Oct 31, 2023

This was referenced Nov 1, 2023

Fix DockerMachinePool rollout flow and deletion behaviour #9655

Closed

Consider dropping MachinePools from ClusterClass self-hosted tests #9656

Closed

killianmuldoon added area/clusterclass Issues or PRs related to clusterclass area/machinepool Issues or PRs related to machinepools labels Nov 1, 2023

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 8, 2023

This was referenced Nov 15, 2023

Tasks for v1.6 release cycle #9094

Closed

✨ Add MachinePool Machine implementation to CAPD components #8842

Merged

k8s-ci-robot closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterClass self-hosted cluster tests are very flaky #9522

ClusterClass self-hosted cluster tests are very flaky #9522

killianmuldoon commented Oct 4, 2023 •

edited

Loading

killianmuldoon commented Oct 4, 2023

nawazkh commented Oct 4, 2023

nawazkh commented Oct 5, 2023

fabriziopandini commented Oct 11, 2023

killianmuldoon commented Oct 19, 2023 •

edited

Loading

killianmuldoon commented Oct 23, 2023

nawazkh commented Oct 23, 2023

killianmuldoon commented Oct 24, 2023

killianmuldoon commented Oct 25, 2023

killianmuldoon commented Oct 30, 2023 •

edited

Loading

vincepri commented Nov 8, 2023

vincepri commented Nov 8, 2023

willie-yao commented Nov 8, 2023

CecileRobertMichon commented Nov 14, 2023

CecileRobertMichon commented Nov 15, 2023

furkatgofurov7 commented Nov 20, 2023

adilGhaffarDev commented Nov 20, 2023

furkatgofurov7 commented Nov 20, 2023

k8s-ci-robot commented Nov 20, 2023

ClusterClass self-hosted cluster tests are very flaky #9522

ClusterClass self-hosted cluster tests are very flaky #9522

Comments

killianmuldoon commented Oct 4, 2023 • edited Loading

killianmuldoon commented Oct 4, 2023

nawazkh commented Oct 4, 2023

nawazkh commented Oct 5, 2023

fabriziopandini commented Oct 11, 2023

killianmuldoon commented Oct 19, 2023 • edited Loading

killianmuldoon commented Oct 23, 2023

nawazkh commented Oct 23, 2023

killianmuldoon commented Oct 24, 2023

killianmuldoon commented Oct 25, 2023

killianmuldoon commented Oct 30, 2023 • edited Loading

vincepri commented Nov 8, 2023

vincepri commented Nov 8, 2023

willie-yao commented Nov 8, 2023

CecileRobertMichon commented Nov 14, 2023

CecileRobertMichon commented Nov 15, 2023

furkatgofurov7 commented Nov 20, 2023

adilGhaffarDev commented Nov 20, 2023

furkatgofurov7 commented Nov 20, 2023

k8s-ci-robot commented Nov 20, 2023

killianmuldoon commented Oct 4, 2023 •

edited

Loading

killianmuldoon commented Oct 19, 2023 •

edited

Loading

killianmuldoon commented Oct 30, 2023 •

edited

Loading