Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky tests #767

Merged
merged 17 commits into from
Nov 13, 2024
Merged

Fix flaky tests #767

merged 17 commits into from
Nov 13, 2024

Conversation

petrutlucian94
Copy link
Contributor

@petrutlucian94 petrutlucian94 commented Oct 31, 2024

This PR fixes a few integration tests that occasionally fail:

test_skip_services_stop_on_remove

The test doesn't wait for all nodes to become ready, however it removes one out of 3 nodes and expects 2 nodes to be ready.

Sometimes this assertion fails as one of the nodes never actually gets to "ready" state, so there remains just one ready node instead of two.

We'll fix this flaky test by first ensuring that all cluster nodes are ready before proceeding to remove nodes.

node removal fails immediately after bootstraping the cluster

k8sd sometimes fails when requested to remove nodes immediately after bootstrapping the cluster. It seems that it takes a little
longer for trust store changes to be propagated to all nodes, which should probably be fixed on the microcluster side.

For now, we'll perform some retries.

failed to POST /k8sd/cluster/remove: failed to delete cluster member
k8s-integration-c1aee0-2: No truststore entry found for node with name
"k8s-integration-c1aee0-2"

While at it, we're adding some flake8 rules to the ignore list since the linter cannot properly handle formatted strings.

reduce the scope of session_instance

session_instance is a fixture that allows sharing the same k8s environment for multiple tests, aiming to reduce test duration.

The problem is that some tests interfere with each other and fail.

We could try to improve the test cleanup, ensuring that the shared environment is brought back to a "clean" state when the test finishes, however we'd most probably continue to have flaky tests.

To address this, we're reducing the scope of session_instance to function scope. Since it provides a single node cluster with all features enabled, we'll rename it to "aio_instance".

Isolating the tests may also allow us to run them in parallel, which should reduce the overall test duration (note that there are some concerns around Cilium though).

leaked disk loopback devices

The tests currently leak PVC loopback devices and eventually fail when there are no available loopback devices. We'll ensure that the loopback devices are cleaned up when removing the snap. At the same time, we'll update the test fixture to remove the snap before cleaning up the instance.

Trivy failures due to API rate imiting

The Trivy jobs often fail, being unable to download images due to rate limiting. To solve this, we'll use a different registry.

Leaked containerd mounts prevent the k8s-snap from being removed

We're updating the k8s-snap remove hook to clean up remaining containerd mounts.

Increase timeouts

In some cases, the k8s nodes do not come online in time, for which reason we'll increase the timeouts.

shim processes leaked in "strict" mode

In "strict" mode, the snap remove hook is unable to:

  • terminate processes
  • remove network namespaces
  • list mounts

This happens despite having the necessary snap plugs, which indicates a possible snap issue.

The netns assertions are already commented out. For the time being, we'll also comment out the code that checks if the shim processes
where terminated since this is expected to fail in "strict".

This issue came up as we started removing the k8s snap after every test.

The test doesn't wait for all nodes to become ready, however it
removes one out of 3 nodes and expects 2 nodes to be ready.

Sometimes this assertion fails as one of the nodes never actually
gets to "ready" state, so there remains just one ready node instead
of two.

We'll fix this flaky test by first ensuring that all cluster nodes
are ready before proceeding to remove nodes.
k8sd sometimes fails when requested to remove nodes immediately
after bootstrapping the cluster. It seems that it takes a little
longer for trust store changes to be propagated to all nodes, which
should probably be fixed on the microcluster side.

For now, we'll perform some retries.

    failed to POST /k8sd/cluster/remove: failed to delete cluster member
    k8s-integration-c1aee0-2: No truststore entry found for node with name
    "k8s-integration-c1aee0-2"

While at it, we're adding some flake8 rules to the ignore list
since the linter cannot properly handle formatted strings.
"session_instance" is a fixture that allows sharing the same k8s
environment for multiple tests, aiming to reduce test duration.

The problem is that some tests interfere with each other and fail.

We could try to improve the test cleanup, ensuring that the shared
environment is brought back to a "clean" state when the test
finishes, however we'd most probably continue to have flaky tests.

To address this, we're reducing the scope of "session_instance" to
function scope. Since it provides a single node cluster with all
features enabled, we'll rename it to "aio_instance".

Isolating the tests may also allow us to run them in parallel, which
should reduce the overall test duration (note that there are some
concerns around Cilium though).
@petrutlucian94 petrutlucian94 requested a review from a team as a code owner October 31, 2024 09:31
@petrutlucian94
Copy link
Contributor Author

petrutlucian94 commented Nov 1, 2024

After dropping the session scoped instance, the Github Action workers are running out of disk space.

@claudiubelu
Copy link
Contributor

After dropping the session scoped instance, the Github Action workers are running out of disk space.

GitHub action runners have plenty of space, though they are a bit bloated with tools that may be useful in the most common scenarios. It makes sense to have them there overall, they won't have to be installed over and over, but it is detrimental in cases such as this.

However, there is a solution for such cases: remove the bloat: https://github.com/canonical/k8s-workflows/blob/main/.github/workflows/build_rocks.yaml#L200-L209. I've used something like this as well, you can easily free 20-30GB. If you need more, the action has more options.

@petrutlucian94 petrutlucian94 force-pushed the flaky_tests branch 7 times, most recently from ba8585e to 63c1802 Compare November 6, 2024 12:48
We dropped the "session_instance" fixture, always using isolated
environments to prevent the test from interfering with each other.

The problem is that we're now running out of disk space, for which
reason we'll use large self-hosted runners. Most probably we were
already close to the node capacity, so this problem would've
surfaced sooner or later anyway.

We tried to clean up disk space using the same helper as the
Rockcraft jobs but it did not work, we sometime lose node connectivity:

  https://github.com/canonical/k8s-workflows/blob/main/.github/workflows/build_rocks.yaml#L200-L209

  The hosted runner: GitHub Actions 61 lost communication with the
  server. Anything in your workflow that terminates the runner
  process, starves it for CPU/Memory, or blocks its network access
  can cause this error.
We'll drop the aio_instance fixture (formerly called
session_instance) altogether, exlicitly requesting bootstrap-all.yaml
instead.
We dropped the "session_instance" fixture, explicitly using
the "instances" fixture with the "bootstrap-all.yaml" config.

"session_instances" waited for the k8s services to be ready, the
"instances" fixture does not, so we'll have to update the tests
accordingly. Some tests are currently failing because of this:

    Error from server (Forbidden): pods "busybox" is forbidden:
    error looking up service account default/default: serviceaccount
    "default" not found
This commit ensures that the PVC loopback devices are cleaned up
when removing the k8s snap.

The issue came up after multiple integration test runs, which were
leaking loopback devices. At some point, there were no more loopback
devices available and the tests started to experience mount failures.

This also raises another problem: there's a small number of loopback
devices by default, which may affect the number of PVCs that we can
use.
At the moment, the "instances" fixture removes the test instances
without uninstalling the k8s snap. In case of LXD instances, this
can leak certain resources on the host, for example loopback
devices.

We'll update the fixture to remove the k8s snap as part of the
instance cleanup procedure.
Right now, the lxd instances cannot communicate with each other.

This might be because of Docker, so we'll try to apply the well
known iptables workaround.
We're now removing the k8s-snap before cleaning up the harness
instances, however it sometimes fails with:

  error: snap "k8s" has "service-control" change in progress

For this reason, we'll add some retries.

At the same time, the "strict" job fails while waiting for some of
the processes to go away. We'll try increasing the timeouts,
although this might be related to the "strict" job issues mentioned
by an inline comment.
After switching to self-hosted runners, some tests time out while
waiting for the k8s nodes to be ready.

It could be that the self-hosted runners have reduced disk or network
bandwidth compared to the Github runners. We'll try increasing the
timeouts.
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@petrutlucian94
Copy link
Contributor Author

There might be some problems with the k8s snap removal on "strict". It times out while waiting for some processes to stop and I see that the k8s services are still running.

There's an inline note on this along with some checks that are commented out:

LOG.info("Waiting for shims to go away...")
util.stubbornly(retries=5, delay_s=5).on(instance).until(
lambda p: all(
x not in p.stdout.decode()
for x in ["containerd-shim", "cilium", "coredns", "/pause"]
)
).exec(["ps", "-fea"])
LOG.info("Waiting for kubelet and containerd mounts to go away...")
util.stubbornly(retries=5, delay_s=5).on(instance).until(
lambda p: all(
x not in p.stdout.decode()
for x in ["/var/lib/kubelet/pods", "/run/containerd/io.containerd"]
)
).exec(["mount"])
# NOTE(neoaggelos): Temporarily disable this as it fails on strict.
# For details, `snap changes` then `snap change $remove_k8s_snap_change`.
# Example output follows:
#
# 2024-02-23T14:10:42Z ERROR ignoring failure in hook "remove":
# -----
# ...
# ip netns delete cni-UUID1
# Cannot remove namespace file "/run/netns/cni-UUID1": Device or resource busy
# ip netns delete cni-UUID2
# Cannot remove namespace file "/run/netns/cni-UUID2": Device or resource busy
# ip netns delete cni-UUID3
# Cannot remove namespace file "/run/netns/cni-UUID3": Device or resource busy
# LOG.info("Waiting for CNI network namespaces to go away...")
# util.stubbornly(retries=5, delay_s=5).on(instance).until(
# lambda p: "cni-" not in p.stdout.decode()
# ).exec(["ip", "netns", "list"])
.

@petrutlucian94
Copy link
Contributor Author

We still have tests that time out waiting for k8s nodes to be ready even after doubling the timeouts, I'll have to take a closer look.

INFO     test_util.util:util.py:66 Attempt 59/60 failed. Error: Failed to meet condition
INFO     test_util.util:util.py:69 Retrying in 5 seconds...
DEBUG    test_util.harness.lxd:lxd.py:220 Execute command ['k8s', 'kubectl', 'get', 'node', 'k8s-integration-8358f1-13', '--no-headers'] in instance k8s-integration-863dd6-12
DEBUG    test_util.util:util.py:36 Execute command lxc shell k8s-integration-863dd6-12 -- bash -c 'k8s kubectl get node k8s-integration-8358f1-13 --no-headers' (kwargs={'capture_output': True, 'check': True})

The Trivy jobs are currently affected by rate limits and fail to
download the image. We'll use an alternate registry to solve this
issue.
test_gateway and test_ingress are failing after switching from
"session_instance". The reason is that we need to apply the
load balancer configuration, otherwise the gateway won't receive
an IP address.
These leaked mounts prevent the k8s snap from being removed, for
which reason we'll update the snap remove hook to clean them up.
In "strict" mode, "snap remove k8s --purge" is unable to list
and remove remaining shim processes, being denied by Apparmor.

We'll solve this by adding the "system-observe" plug.
The tests may fail to remove the snap for various reasons. In order
to avoid leaking harness instances that could potentially
exhaust the host resources, we'll clean the harness instances even if
we were unable to remove the snap.
In "strict" mode, the snap remove hook is unable to:

* terminate processes
* remove network namespaces
* list mounts

This happens despite having the necessary snap plugs, which
indicates a possible snap issue.

The netns assertions are already commented out. For the time being,
we'll also comment out the code that checks if the shim processes
where terminated since this is expected to fail in "strict".

This issue came up as we started removing the k8s snap after *every*
test.
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@berkayoz berkayoz merged commit 4291ca3 into canonical:main Nov 13, 2024
17 of 19 checks passed
evilnick pushed a commit to evilnick/k8s-snap that referenced this pull request Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants