Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in testCreateDownloadAnOS() on tumbleweed, disable broken SELinux on rawhide #1724

Merged
merged 3 commits into from
Jul 16, 2024

Conversation

martinpitt
Copy link
Member

This test almost always fails on tumbleweed, but never ever locally. I even tried to reproduce the exact series of nondestructive tests:

TEST_OS=opensuse-tumbleweed test/common/run-tests -tv TestMachinesStoragePools.testStoragePoolsDeletion TestMachinesSettings.testBootOrder TestMachinesNetworks.testNetworkState TestMachinesNICs.testNICDelete TestMachinesHostDevs.testHostDevicesList TestMachinesFilesystems.testBasicSessionConnection TestMachinesDisks.testDetachDisk TestMachinesCreate.testDisabledCreate TestMachinesCreate.testCreateImportDisk TestMachinesCreate.testCreateDownloadAnOS

The weather report also shows that this fails exclusively on tumbleweed, in more than 77% of runs.

Turning debugging to 11.

I'll give this a little bit of investigation, but only so much -- if I can't figure it out quickly, I'll disable the test for now.

@Nykseli @Lunarequest @SludgeGirl FYI

@Lunarequest
Copy link
Contributor

I actually just ran into this on a machine in our lab. Consistently crashed 3 times before the vm started

@martinpitt
Copy link
Member Author

martinpitt commented Jul 15, 2024

So here we have debug output, with a broken test log. Locally, I have a working log. I'll do a deep dive after lunch.

@martinpitt
Copy link
Member Author

martinpitt commented Jul 15, 2024

In the working case, we see the call to Destroy and Undefine followed fairly quickly by the DomainEvent [5,1] and [1,0] signals, which decode to "stopped" and "undefined". In the failing case, these exist as well, but there are two extra events in between:

debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",5,1])
debug: handle DomainEvent on system: ignoring event AgentEvent
debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",4,0])
debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",2,0])
debug: libvirt call: system /org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486 org.libvirt.Domain GetXMLDesc [1] {"timeout":30000,"type":"u"}
debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",1,0])

which mean (4) "Resumed" and (2) "Started". So these supersede the previous "Stopped" event. This explains the screenshot where the VM is indeed shown as "Running".

So that's a typical race condition which is entirely timing dependent. Testing a fix in commit 6399fdb which waits for the initial state, but it failed.

Testing with a sleep in db28cac, just to make sure that this is really it (not a real solution of course). But this is still not enough.

I also tried a "proper" wait:

--- test/check-machines-create
+++ test/check-machines-create
@@ -1342,7 +1342,9 @@ vnc_password= "{vnc_passwd}"
                     user_login = virt_install_cmd_out.split("user-login=", 1)[1].split(",")[0].rstrip()
                     self.assertIn(user_login, self.user_login)
 
-            time.sleep(5)
+            # wait for virt-install to finish
+            if self.create_and_run:
+                self.machine.execute(f"while {virt_install_cmd}; do sleep 1; done", timeout=300)
 
         def fill(self):
             b = self.browser

With the default 60s timeout that fails (times out). With 300s timeout it eventually goes away. This is super annoying, but let's see how far it gets: commit 3cf37a4

@martinpitt
Copy link
Member Author

OK, succeeded . There's something wrong with virt-install on tumbleweed, but I'm running out of steam/time and domain knowledge here. Let's put that in.

…ed()

Similar to what we do in other scenarios, make sure that the VM is in
the expected state after creating the VM.

This does not fix any flake, it's just a good additional check.
@Nykseli
Copy link
Contributor

Nykseli commented Jul 15, 2024

Thanks @martinpitt for taking a look and figuring it out!

I should have time tomorrow to take a better look at this. Hopefully I get lucky and find what's going on with virt-install 🙂

@martinpitt
Copy link
Member Author

Awesome -- this now fails everywhere except tumbleweed. So I'll restrict the workaround for now.

this failure is unrelated and also very high on the weather report. But one thing at a time..

Wait for `virt-install` to finish, so that it doesn't race with the UI
shutting down and deleting the VM.

This takes *very* long on current OpenSUSE Tumbleweed unfortunately, but
that's better than almost always failing. virt-install doesn't end at
all anywhere else, so for now we have to restrict this hack to
tumbleweed.
@martinpitt
Copy link
Member Author

Meh, this rawhide failure is entirely unrelated and new, also happened in fedora-selinux/selinux-policy#2235:

error: Failed to get libvirt version from the dbus API: {"problem":null,"name":"org.libvirt.Error","message":"Failed to connect socket to '/var/run/libvirt/virtqemud-sock': Permission denied"}

This feels like the same as in https://issues.redhat.com/browse/RHEL-46893 - let's discuss in the selinux-policy PR.

@martinpitt
Copy link
Member Author

Filed https://bugzilla.redhat.com/show_bug.cgi?id=2297965 . I added another commit to disable SELinux in rawhide, similar to our already existing RHEL 10 hack. sigh

@martinpitt martinpitt marked this pull request as ready for review July 15, 2024 15:37
@martinpitt martinpitt changed the title TEST: Debug testCreateDownloadAnOS flake on opensuse-tumbleweed Fix race condition in testCreateDownloadAnOS() on tumbleweed, disable broken SELinux on rawhide Jul 15, 2024
@mvollmer
Copy link
Member

Awesome -- this now fails everywhere except tumbleweed.

@martinpitt, do you have an idea why?

Copy link
Member

@mvollmer mvollmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't argue with the results. :-)

@mvollmer mvollmer merged commit 1e3ffd5 into cockpit-project:main Jul 16, 2024
27 checks passed
@martinpitt martinpitt deleted the flake branch July 16, 2024 05:32
@martinpitt
Copy link
Member Author

@martinpitt, do you have an idea why?

Unfortunately not. See my debugging notes above for everything else I've tried. The exact inner workings of virt-install are still mysterious to me -- sometimes it keeps running forever, sometimes it finishes, I can't predict it 😢

So this is indeed just a "throw hands into the air" hack, and I'm sure it'll bite back some day.

@Nykseli
Copy link
Contributor

Nykseli commented Jul 19, 2024

I tried to look into it but wasn't able to figure out anything either.
I guess I'll have to leave it be for now and get back to it if and when it starts breaking again 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants