-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky tests #767
Fix flaky tests #767
Conversation
The test doesn't wait for all nodes to become ready, however it removes one out of 3 nodes and expects 2 nodes to be ready. Sometimes this assertion fails as one of the nodes never actually gets to "ready" state, so there remains just one ready node instead of two. We'll fix this flaky test by first ensuring that all cluster nodes are ready before proceeding to remove nodes.
k8sd sometimes fails when requested to remove nodes immediately after bootstrapping the cluster. It seems that it takes a little longer for trust store changes to be propagated to all nodes, which should probably be fixed on the microcluster side. For now, we'll perform some retries. failed to POST /k8sd/cluster/remove: failed to delete cluster member k8s-integration-c1aee0-2: No truststore entry found for node with name "k8s-integration-c1aee0-2" While at it, we're adding some flake8 rules to the ignore list since the linter cannot properly handle formatted strings.
"session_instance" is a fixture that allows sharing the same k8s environment for multiple tests, aiming to reduce test duration. The problem is that some tests interfere with each other and fail. We could try to improve the test cleanup, ensuring that the shared environment is brought back to a "clean" state when the test finishes, however we'd most probably continue to have flaky tests. To address this, we're reducing the scope of "session_instance" to function scope. Since it provides a single node cluster with all features enabled, we'll rename it to "aio_instance". Isolating the tests may also allow us to run them in parallel, which should reduce the overall test duration (note that there are some concerns around Cilium though).
After dropping the session scoped instance, the Github Action workers are running out of disk space. |
GitHub action runners have plenty of space, though they are a bit bloated with tools that may be useful in the most common scenarios. It makes sense to have them there overall, they won't have to be installed over and over, but it is detrimental in cases such as this. However, there is a solution for such cases: remove the bloat: https://github.com/canonical/k8s-workflows/blob/main/.github/workflows/build_rocks.yaml#L200-L209. I've used something like this as well, you can easily free 20-30GB. If you need more, the action has more options. |
ba8585e
to
63c1802
Compare
We dropped the "session_instance" fixture, always using isolated environments to prevent the test from interfering with each other. The problem is that we're now running out of disk space, for which reason we'll use large self-hosted runners. Most probably we were already close to the node capacity, so this problem would've surfaced sooner or later anyway. We tried to clean up disk space using the same helper as the Rockcraft jobs but it did not work, we sometime lose node connectivity: https://github.com/canonical/k8s-workflows/blob/main/.github/workflows/build_rocks.yaml#L200-L209 The hosted runner: GitHub Actions 61 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
We'll drop the aio_instance fixture (formerly called session_instance) altogether, exlicitly requesting bootstrap-all.yaml instead.
We dropped the "session_instance" fixture, explicitly using the "instances" fixture with the "bootstrap-all.yaml" config. "session_instances" waited for the k8s services to be ready, the "instances" fixture does not, so we'll have to update the tests accordingly. Some tests are currently failing because of this: Error from server (Forbidden): pods "busybox" is forbidden: error looking up service account default/default: serviceaccount "default" not found
This commit ensures that the PVC loopback devices are cleaned up when removing the k8s snap. The issue came up after multiple integration test runs, which were leaking loopback devices. At some point, there were no more loopback devices available and the tests started to experience mount failures. This also raises another problem: there's a small number of loopback devices by default, which may affect the number of PVCs that we can use.
At the moment, the "instances" fixture removes the test instances without uninstalling the k8s snap. In case of LXD instances, this can leak certain resources on the host, for example loopback devices. We'll update the fixture to remove the k8s snap as part of the instance cleanup procedure.
63c1802
to
583bc9a
Compare
Right now, the lxd instances cannot communicate with each other. This might be because of Docker, so we'll try to apply the well known iptables workaround.
We're now removing the k8s-snap before cleaning up the harness instances, however it sometimes fails with: error: snap "k8s" has "service-control" change in progress For this reason, we'll add some retries. At the same time, the "strict" job fails while waiting for some of the processes to go away. We'll try increasing the timeouts, although this might be related to the "strict" job issues mentioned by an inline comment.
de6e072
to
6894910
Compare
After switching to self-hosted runners, some tests time out while waiting for the k8s nodes to be ready. It could be that the self-hosted runners have reduced disk or network bandwidth compared to the Github runners. We'll try increasing the timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There might be some problems with the There's an inline note on this along with some checks that are commented out: k8s-snap/tests/integration/tests/test_cleanup.py Lines 22 to 55 in 851387c
|
We still have tests that time out waiting for k8s nodes to be ready even after doubling the timeouts, I'll have to take a closer look.
|
6d06a31
to
a9b2cc3
Compare
The Trivy jobs are currently affected by rate limits and fail to download the image. We'll use an alternate registry to solve this issue.
test_gateway and test_ingress are failing after switching from "session_instance". The reason is that we need to apply the load balancer configuration, otherwise the gateway won't receive an IP address.
These leaked mounts prevent the k8s snap from being removed, for which reason we'll update the snap remove hook to clean them up.
In "strict" mode, "snap remove k8s --purge" is unable to list and remove remaining shim processes, being denied by Apparmor. We'll solve this by adding the "system-observe" plug.
The tests may fail to remove the snap for various reasons. In order to avoid leaking harness instances that could potentially exhaust the host resources, we'll clean the harness instances even if we were unable to remove the snap.
In "strict" mode, the snap remove hook is unable to: * terminate processes * remove network namespaces * list mounts This happens despite having the necessary snap plugs, which indicates a possible snap issue. The netns assertions are already commented out. For the time being, we'll also comment out the code that checks if the shim processes where terminated since this is expected to fail in "strict". This issue came up as we started removing the k8s snap after *every* test.
a9b2cc3
to
8f335e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR fixes a few integration tests that occasionally fail:
test_skip_services_stop_on_remove
The test doesn't wait for all nodes to become ready, however it removes one out of 3 nodes and expects 2 nodes to be ready.
Sometimes this assertion fails as one of the nodes never actually gets to "ready" state, so there remains just one ready node instead of two.
We'll fix this flaky test by first ensuring that all cluster nodes are ready before proceeding to remove nodes.
node removal fails immediately after bootstraping the cluster
k8sd sometimes fails when requested to remove nodes immediately after bootstrapping the cluster. It seems that it takes a little
longer for trust store changes to be propagated to all nodes, which should probably be fixed on the microcluster side.
For now, we'll perform some retries.
While at it, we're adding some flake8 rules to the ignore list since the linter cannot properly handle formatted strings.
reduce the scope of
session_instance
session_instance
is a fixture that allows sharing the same k8s environment for multiple tests, aiming to reduce test duration.The problem is that some tests interfere with each other and fail.
We could try to improve the test cleanup, ensuring that the shared environment is brought back to a "clean" state when the test finishes, however we'd most probably continue to have flaky tests.
To address this, we're reducing the scope of
session_instance
to function scope. Since it provides a single node cluster with all features enabled, we'll rename it to "aio_instance".Isolating the tests may also allow us to run them in parallel, which should reduce the overall test duration (note that there are some concerns around Cilium though).
leaked disk loopback devices
The tests currently leak PVC loopback devices and eventually fail when there are no available loopback devices. We'll ensure that the loopback devices are cleaned up when removing the snap. At the same time, we'll update the test fixture to remove the snap before cleaning up the instance.
Trivy failures due to API rate imiting
The Trivy jobs often fail, being unable to download images due to rate limiting. To solve this, we'll use a different registry.
Leaked containerd mounts prevent the k8s-snap from being removed
We're updating the k8s-snap remove hook to clean up remaining containerd mounts.
Increase timeouts
In some cases, the k8s nodes do not come online in time, for which reason we'll increase the timeouts.
shim processes leaked in "strict" mode
In "strict" mode, the snap remove hook is unable to:
This happens despite having the necessary snap plugs, which indicates a possible snap issue.
The netns assertions are already commented out. For the time being, we'll also comment out the code that checks if the shim processes
where terminated since this is expected to fail in "strict".
This issue came up as we started removing the k8s snap after every test.