Fix flaky tests #767

petrutlucian94 · 2024-10-31T09:31:15Z

This PR fixes a few integration tests that occasionally fail:

`test_skip_services_stop_on_remove`

The test doesn't wait for all nodes to become ready, however it removes one out of 3 nodes and expects 2 nodes to be ready.

Sometimes this assertion fails as one of the nodes never actually gets to "ready" state, so there remains just one ready node instead of two.

We'll fix this flaky test by first ensuring that all cluster nodes are ready before proceeding to remove nodes.

node removal fails immediately after bootstraping the cluster

k8sd sometimes fails when requested to remove nodes immediately after bootstrapping the cluster. It seems that it takes a little
longer for trust store changes to be propagated to all nodes, which should probably be fixed on the microcluster side.

For now, we'll perform some retries.

failed to POST /k8sd/cluster/remove: failed to delete cluster member
k8s-integration-c1aee0-2: No truststore entry found for node with name
"k8s-integration-c1aee0-2"

While at it, we're adding some flake8 rules to the ignore list since the linter cannot properly handle formatted strings.

reduce the scope of `session_instance`

session_instance is a fixture that allows sharing the same k8s environment for multiple tests, aiming to reduce test duration.

The problem is that some tests interfere with each other and fail.

We could try to improve the test cleanup, ensuring that the shared environment is brought back to a "clean" state when the test finishes, however we'd most probably continue to have flaky tests.

To address this, we're reducing the scope of session_instance to function scope. Since it provides a single node cluster with all features enabled, we'll rename it to "aio_instance".

Isolating the tests may also allow us to run them in parallel, which should reduce the overall test duration (note that there are some concerns around Cilium though).

leaked disk loopback devices

The tests currently leak PVC loopback devices and eventually fail when there are no available loopback devices. We'll ensure that the loopback devices are cleaned up when removing the snap. At the same time, we'll update the test fixture to remove the snap before cleaning up the instance.

Trivy failures due to API rate imiting

The Trivy jobs often fail, being unable to download images due to rate limiting. To solve this, we'll use a different registry.

Leaked containerd mounts prevent the k8s-snap from being removed

We're updating the k8s-snap remove hook to clean up remaining containerd mounts.

Increase timeouts

In some cases, the k8s nodes do not come online in time, for which reason we'll increase the timeouts.

shim processes leaked in "strict" mode

In "strict" mode, the snap remove hook is unable to:

terminate processes
remove network namespaces
list mounts

This happens despite having the necessary snap plugs, which indicates a possible snap issue.

The netns assertions are already commented out. For the time being, we'll also comment out the code that checks if the shim processes
where terminated since this is expected to fail in "strict".

This issue came up as we started removing the k8s snap after every test.

The test doesn't wait for all nodes to become ready, however it removes one out of 3 nodes and expects 2 nodes to be ready. Sometimes this assertion fails as one of the nodes never actually gets to "ready" state, so there remains just one ready node instead of two. We'll fix this flaky test by first ensuring that all cluster nodes are ready before proceeding to remove nodes.

k8sd sometimes fails when requested to remove nodes immediately after bootstrapping the cluster. It seems that it takes a little longer for trust store changes to be propagated to all nodes, which should probably be fixed on the microcluster side. For now, we'll perform some retries. failed to POST /k8sd/cluster/remove: failed to delete cluster member k8s-integration-c1aee0-2: No truststore entry found for node with name "k8s-integration-c1aee0-2" While at it, we're adding some flake8 rules to the ignore list since the linter cannot properly handle formatted strings.

"session_instance" is a fixture that allows sharing the same k8s environment for multiple tests, aiming to reduce test duration. The problem is that some tests interfere with each other and fail. We could try to improve the test cleanup, ensuring that the shared environment is brought back to a "clean" state when the test finishes, however we'd most probably continue to have flaky tests. To address this, we're reducing the scope of "session_instance" to function scope. Since it provides a single node cluster with all features enabled, we'll rename it to "aio_instance". Isolating the tests may also allow us to run them in parallel, which should reduce the overall test duration (note that there are some concerns around Cilium though).

petrutlucian94 · 2024-11-01T10:18:42Z

After dropping the session scoped instance, the Github Action workers are running out of disk space.

claudiubelu · 2024-11-01T13:02:57Z

After dropping the session scoped instance, the Github Action workers are running out of disk space.

GitHub action runners have plenty of space, though they are a bit bloated with tools that may be useful in the most common scenarios. It makes sense to have them there overall, they won't have to be installed over and over, but it is detrimental in cases such as this.

However, there is a solution for such cases: remove the bloat: https://github.com/canonical/k8s-workflows/blob/main/.github/workflows/build_rocks.yaml#L200-L209. I've used something like this as well, you can easily free 20-30GB. If you need more, the action has more options.

We dropped the "session_instance" fixture, always using isolated environments to prevent the test from interfering with each other. The problem is that we're now running out of disk space, for which reason we'll use large self-hosted runners. Most probably we were already close to the node capacity, so this problem would've surfaced sooner or later anyway. We tried to clean up disk space using the same helper as the Rockcraft jobs but it did not work, we sometime lose node connectivity: https://github.com/canonical/k8s-workflows/blob/main/.github/workflows/build_rocks.yaml#L200-L209 The hosted runner: GitHub Actions 61 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

We'll drop the aio_instance fixture (formerly called session_instance) altogether, exlicitly requesting bootstrap-all.yaml instead.

We dropped the "session_instance" fixture, explicitly using the "instances" fixture with the "bootstrap-all.yaml" config. "session_instances" waited for the k8s services to be ready, the "instances" fixture does not, so we'll have to update the tests accordingly. Some tests are currently failing because of this: Error from server (Forbidden): pods "busybox" is forbidden: error looking up service account default/default: serviceaccount "default" not found

This commit ensures that the PVC loopback devices are cleaned up when removing the k8s snap. The issue came up after multiple integration test runs, which were leaking loopback devices. At some point, there were no more loopback devices available and the tests started to experience mount failures. This also raises another problem: there's a small number of loopback devices by default, which may affect the number of PVCs that we can use.

At the moment, the "instances" fixture removes the test instances without uninstalling the k8s snap. In case of LXD instances, this can leak certain resources on the host, for example loopback devices. We'll update the fixture to remove the k8s snap as part of the instance cleanup procedure.

Right now, the lxd instances cannot communicate with each other. This might be because of Docker, so we'll try to apply the well known iptables workaround.

We're now removing the k8s-snap before cleaning up the harness instances, however it sometimes fails with: error: snap "k8s" has "service-control" change in progress For this reason, we'll add some retries. At the same time, the "strict" job fails while waiting for some of the processes to go away. We'll try increasing the timeouts, although this might be related to the "strict" job issues mentioned by an inline comment.

After switching to self-hosted runners, some tests time out while waiting for the k8s nodes to be ready. It could be that the self-hosted runners have reduced disk or network bandwidth compared to the Github runners. We'll try increasing the timeouts.

bschimke95

LGTM, thanks!

petrutlucian94 · 2024-11-07T15:11:18Z

There might be some problems with the k8s snap removal on "strict". It times out while waiting for some processes to stop and I see that the k8s services are still running.

There's an inline note on this along with some checks that are commented out:

k8s-snap/tests/integration/tests/test_cleanup.py

Lines 22 to 55 in 851387c

    
           LOG.info("Waiting for shims to go away...") 
        
           util.stubbornly(retries=5, delay_s=5).on(instance).until( 
        
               lambda p: all( 
        
                   x not in p.stdout.decode() 
        
                   for x in ["containerd-shim", "cilium", "coredns", "/pause"] 
        
               ) 
        
           ).exec(["ps", "-fea"]) 
        
           LOG.info("Waiting for kubelet and containerd mounts to go away...") 
        
           util.stubbornly(retries=5, delay_s=5).on(instance).until( 
        
               lambda p: all( 
        
                   x not in p.stdout.decode() 
        
                   for x in ["/var/lib/kubelet/pods", "/run/containerd/io.containerd"] 
        
               ) 
        
           ).exec(["mount"]) 
        
           # NOTE(neoaggelos): Temporarily disable this as it fails on strict. 
        
           # For details, `snap changes` then `snap change $remove_k8s_snap_change`. 
        
           # Example output follows: 
        
           # 
        
           # 2024-02-23T14:10:42Z ERROR ignoring failure in hook "remove": 
        
           # ----- 
        
           # ... 
        
           # ip netns delete cni-UUID1 
        
           # Cannot remove namespace file "/run/netns/cni-UUID1": Device or resource busy 
        
           # ip netns delete cni-UUID2 
        
           # Cannot remove namespace file "/run/netns/cni-UUID2": Device or resource busy 
        
           # ip netns delete cni-UUID3 
        
           # Cannot remove namespace file "/run/netns/cni-UUID3": Device or resource busy 
        
           # LOG.info("Waiting for CNI network namespaces to go away...") 
        
           # util.stubbornly(retries=5, delay_s=5).on(instance).until( 
        
           #     lambda p: "cni-" not in p.stdout.decode() 
        
           # ).exec(["ip", "netns", "list"])

.

petrutlucian94 · 2024-11-07T15:31:29Z

We still have tests that time out waiting for k8s nodes to be ready even after doubling the timeouts, I'll have to take a closer look.

INFO     test_util.util:util.py:66 Attempt 59/60 failed. Error: Failed to meet condition
INFO     test_util.util:util.py:69 Retrying in 5 seconds...
DEBUG    test_util.harness.lxd:lxd.py:220 Execute command ['k8s', 'kubectl', 'get', 'node', 'k8s-integration-8358f1-13', '--no-headers'] in instance k8s-integration-863dd6-12
DEBUG    test_util.util:util.py:36 Execute command lxc shell k8s-integration-863dd6-12 -- bash -c 'k8s kubectl get node k8s-integration-8358f1-13 --no-headers' (kwargs={'capture_output': True, 'check': True})

The Trivy jobs are currently affected by rate limits and fail to download the image. We'll use an alternate registry to solve this issue.

test_gateway and test_ingress are failing after switching from "session_instance". The reason is that we need to apply the load balancer configuration, otherwise the gateway won't receive an IP address.

These leaked mounts prevent the k8s snap from being removed, for which reason we'll update the snap remove hook to clean them up.

In "strict" mode, "snap remove k8s --purge" is unable to list and remove remaining shim processes, being denied by Apparmor. We'll solve this by adding the "system-observe" plug.

The tests may fail to remove the snap for various reasons. In order to avoid leaking harness instances that could potentially exhaust the host resources, we'll clean the harness instances even if we were unable to remove the snap.

In "strict" mode, the snap remove hook is unable to: * terminate processes * remove network namespaces * list mounts This happens despite having the necessary snap plugs, which indicates a possible snap issue. The netns assertions are already commented out. For the time being, we'll also comment out the code that checks if the shim processes where terminated since this is expected to fail in "strict". This issue came up as we started removing the k8s snap after *every* test.

berkayoz

LGTM

petrutlucian94 added 3 commits October 30, 2024 14:45

petrutlucian94 requested a review from a team as a code owner October 31, 2024 09:31

petrutlucian94 force-pushed the flaky_tests branch 7 times, most recently from ba8585e to 63c1802 Compare November 6, 2024 12:48

petrutlucian94 added 5 commits November 6, 2024 13:32

Drop aio_instance/session_instance altogether

0ce5844

We'll drop the aio_instance fixture (formerly called session_instance) altogether, exlicitly requesting bootstrap-all.yaml instead.

petrutlucian94 force-pushed the flaky_tests branch from 63c1802 to 583bc9a Compare November 6, 2024 13:48

petrutlucian94 added 2 commits November 7, 2024 11:46

Apply lxd docker fix

625b185

Right now, the lxd instances cannot communicate with each other. This might be because of Docker, so we'll try to apply the well known iptables workaround.

petrutlucian94 force-pushed the flaky_tests branch from de6e072 to 6894910 Compare November 7, 2024 13:11

Increase timeouts

be90e8e

After switching to self-hosted runners, some tests time out while waiting for the k8s nodes to be ready. It could be that the self-hosted runners have reduced disk or network bandwidth compared to the Github runners. We'll try increasing the timeouts.

bschimke95 reviewed Nov 7, 2024

View reviewed changes

bschimke95 approved these changes Nov 7, 2024

View reviewed changes

petrutlucian94 force-pushed the flaky_tests branch from 6d06a31 to a9b2cc3 Compare November 12, 2024 16:41

petrutlucian94 added 2 commits November 13, 2024 07:27

Fix Trivy jobs

5de5c6c

The Trivy jobs are currently affected by rate limits and fail to download the image. We'll use an alternate registry to solve this issue.

Add missing lb settings

f58ee05

test_gateway and test_ingress are failing after switching from "session_instance". The reason is that we need to apply the load balancer configuration, otherwise the gateway won't receive an IP address.

petrutlucian94 added 4 commits November 13, 2024 07:27

Clean up remaning containerd mounts

ae9e613

These leaked mounts prevent the k8s snap from being removed, for which reason we'll update the snap remove hook to clean them up.

strict: add "system-observe" plug to "remove" hook

d8ad6fd

In "strict" mode, "snap remove k8s --purge" is unable to list and remove remaining shim processes, being denied by Apparmor. We'll solve this by adding the "system-observe" plug.

petrutlucian94 force-pushed the flaky_tests branch from a9b2cc3 to 8f335e3 Compare November 13, 2024 09:22

berkayoz approved these changes Nov 13, 2024

View reviewed changes

berkayoz merged commit 4291ca3 into canonical:main Nov 13, 2024
17 of 19 checks passed

evilnick pushed a commit to evilnick/k8s-snap that referenced this pull request Nov 14, 2024

Fix flaky tests (canonical#767)

f35e464

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky tests #767

Fix flaky tests #767

petrutlucian94 commented Oct 31, 2024 •

edited

Loading

petrutlucian94 commented Nov 1, 2024 •

edited

Loading

claudiubelu commented Nov 1, 2024

bschimke95 left a comment

petrutlucian94 commented Nov 7, 2024

petrutlucian94 commented Nov 7, 2024

berkayoz left a comment

Fix flaky tests #767

Fix flaky tests #767

Conversation

petrutlucian94 commented Oct 31, 2024 • edited Loading

test_skip_services_stop_on_remove

node removal fails immediately after bootstraping the cluster

reduce the scope of session_instance

leaked disk loopback devices

Trivy failures due to API rate imiting

Leaked containerd mounts prevent the k8s-snap from being removed

Increase timeouts

shim processes leaked in "strict" mode

petrutlucian94 commented Nov 1, 2024 • edited Loading

claudiubelu commented Nov 1, 2024

bschimke95 left a comment

Choose a reason for hiding this comment

petrutlucian94 commented Nov 7, 2024

petrutlucian94 commented Nov 7, 2024

berkayoz left a comment

Choose a reason for hiding this comment

petrutlucian94 commented Oct 31, 2024 •

edited

Loading

`test_skip_services_stop_on_remove`

reduce the scope of `session_instance`

petrutlucian94 commented Nov 1, 2024 •

edited

Loading