Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail to assign pci device at start of usbvm #1544

Closed
Rudd-O opened this issue Dec 26, 2015 · 47 comments
Closed

fail to assign pci device at start of usbvm #1544

Rudd-O opened this issue Dec 26, 2015 · 47 comments
Labels
C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. R: cannot reproduce Resolution: Attempts to replicate the problem have not been reliably successful enough to proceed. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@Rudd-O
Copy link

Rudd-O commented Dec 26, 2015

I'm experiencing a weird error starting a usbvm:

[user@dom0 ~]$ qvm-start usbvm
--> Creating volatile image: /var/lib/qubes/appvms/usbvm/volatile.img...
--> Loading the VM (type = AppVM)...
Traceback (most recent call last):
  File "/usr/bin/qvm-start", line 125, in <module>
    main()
  File "/usr/bin/qvm-start", line 109, in main
    xid = vm.start(verbose=options.verbose,
preparing_dvm=options.preparing_dvm, start_guid=not options.noguid,
notify_function=tray_notify_generic if options.tray else None)
  File "/usr/lib64/python2.7/site-packages/qubes/modules/000QubesVm.py",
line 1849, in start
    nd.dettach()
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 5249, in
dettach
    if ret == -1: raise libvirtError ('virNodeDeviceDettach() failed')
libvirt.libvirtError: Requested operation is not valid: PCI device
0000:00:1a.0 is in use by driver xenlight, domain usbvm

Restarting libvirtd only aggravates the issue:

[user@dom0 ~]$ qvm-start usbvm
--> Creating volatile image: /var/lib/qubes/appvms/usbvm/volatile.img...
--> Loading the VM (type = AppVM)...
Traceback (most recent call last):
  File "/usr/bin/qvm-start", line 125, in <module>
    main()
  File "/usr/bin/qvm-start", line 109, in main
    xid = vm.start(verbose=options.verbose, preparing_dvm=options.preparing_dvm, start_guid=not options.noguid, notify_function=tray_notify_generic if options.tray else None)
  File "/usr/lib64/python2.7/site-packages/qubes/modules/000QubesVm.py", line 1857, in start
    self.libvirt_domain.createWithFlags(libvirt.VIR_DOMAIN_START_PAUSED)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1059, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirt.libvirtError: internal error: libxenlight failed to create new domain 'usbvm'

Weird errors in libxl log:

2015-12-26 x:31:47 TZ libxl: error: libxl_pci.c:1000:do_pci_add: xc_assign_device failed: Operation not permitted
2015-12-26 x:31:47 TZ libxl: error: libxl_create.c:1422:domcreate_attach_pci: libxl_device_pci_add failed: -3

The hypervisor log says:

(XEN) [VT-D] It's disallowed to assign 0000:00:1a.0 with shared RMRR at dbe9a000 for Dom19.
(XEN) XEN_DOMCTL_assign_device: assign 0000:00:1a.0 to dom19 failed (-1)
@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

What's that about the RMRR?

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Appears to be some new shit:

http://www.gossamer-threads.com/lists/xen/devel/391684

@marmarek
Copy link
Member

The VM has never been started.

Not even using autostart at boot?

What's that about the RMRR?
Appears to be some new shit:
http://www.gossamer-threads.com/lists/xen/devel/391684

We have a way to set rdm_policy=relaxed, bundled with pci_strictreset=false - it should be set by default salt formula for sys-usb, exactly for this reason.
My understanding is that those devices in fact shares some resources, so can't be safely isolated from each other. And Xen doesn't support group assignment (at least for now), so don't know that you are going to assign all such devices to the same VM (which should be safe).

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Yes, the VM had autostart at boot and the systemd service had failed for this reason.

How do I determine which devices share the RMRR? I couldn't find anything in my logs.

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Holy shit, setting pci_strictreset to False actually let me start that stupid VM!

@marmarek
Copy link
Member

I don't know, but guess it is the other USB controller (or USB2.0/USB3.0). If you assign both/all of them to the same VM, you'll see the same address in xen log (assuming you set pci_strictreset=False first, otherwise VM start will fail at the first device...). Yes, kinda ugly way to determine that...

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Wait, spoke too soon. The VM never ran qrexec-daemon.

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

It now says in the hypervisor log "It's risky to assign blah blah".

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

libxl log:

<date> libxl: error: libxl_device.c:1215:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/24/0 not ready

@marmarek
Copy link
Member

Did it crashed just after startup (state "c" on xl list)? If so, probably not enough continuous memory available (take a look #1038 for details) . You can try to free some with xl mem-set 0 <some-number-in-MB> to reduce dom0 memory drastically. For example down to 1500. Sometimes it helps. Otherwise, reboot...

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Yes. It's state crashed. I just checked.

Assigning all USB devices to the same VM worked to fix the problem.

This sucks. Now I don't have my mouse.

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Note that assigning all USB PCI devices did NOT help start the VM. Even with pci_strictreset set to false. It just killed my mouse.

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

I will try rebooting now. BRB.

@marmarek
Copy link
Member

The VM crash at startup is generally a problem with starting VM with PCI devices after some system uptime, memory is much fragmented then. It is independent of previous problem (which is solved with pci_strictreset).

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Alright, excellent. My USB VM has now received the assignment of the two USB PCI devices I intended to isolate (the Bluetooth and camera devices). I still keep the ability to use my mouse. This is GREAT.

Thanks for the pci_strictreset trick.

Improvement: it really should be somehow autodetected whether it is necessary or not.

@marmarek
Copy link
Member

It is set for USB VM by salt formula by default.

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Yes, that's true, but it would not be the default for a manually created USB VM, which was my case, and I bet the case in many cases. A smart default lower in the stack would reduce the support load.

@marmarek
Copy link
Member

The proper solution would be to have PCI group assignment supported by Xen. This way it would detect whether it is really risky to assign particular devices to the VM.

@Rudd-O
Copy link
Author

Rudd-O commented Dec 26, 2015

Agreed.

@marmarek marmarek added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: Xen labels Jan 6, 2016
@marmarek marmarek added this to the Far in the future milestone Jan 6, 2016
@marmarek
Copy link
Member

marmarek commented Jan 6, 2016

Summary:

  • Xen missing feature of PCI group device assignment
  • libvirt bug in tracking which PCI device is used where (libvirt.libvirtError: Requested operation is not valid: PCI device 0000:00:1a.0 is in use by driver xenlight, domain usbvm while starting usbvm)

@Rudd-O
Copy link
Author

Rudd-O commented Jan 16, 2016

Quick update: I assigned two of my USB PCI devices (out of three) to a USB VM. That caused hangs and reboots to happen around once each day. They stopped happening as soon as I decided never to power on that VM again. I still have yet to try adding all three USB PCI devices to the USB VM (doing so would disable all USB ports on this machine) We'll see if that causes hangs.

@andrewclausen
Copy link

The pci_strictreset option didn't have any effect for me. (Exactly the same error messages, etc.)

@Rudd-O
Copy link
Author

Rudd-O commented Mar 15, 2016

I found my problem. It was a mouse whose receiver stopped working properly, and started causing lockups and hard reboots whenever it was plugged, irrespective of which VMs it was assigned to. The mouse is now in the trash. PCI strict reset did work for starting the VM though.

@marmarek
Copy link
Member

There is a strange issue related to some Logitech receivers: #1689 . I can confirm it indeed happens, but no idea why. I'd rather blame some kernel driver, not the device itself.

@nothingmuch
Copy link

nothingmuch commented Oct 23, 2016

I am getting this error too, with pci_strictreset set to false, on a clean install of R3.2 on a Lenovo Yoga 12 which previously had R3.1 working with a usbvm. Disabling USB3 in the bios seemed to work, upgrading the BIOS as mentioned in this thread https://groups.google.com/forum/#!msg/qubes-users/Z6bEMZTjiz4/FbV6T-l_AQAJ did not seem to make a difference.

@Rudd-O
Copy link
Author

Rudd-O commented Oct 23, 2016

The Logitech device issue is no longer a problem in modern kernels.

@nothingmuch
Copy link

I'm seeing this with no external USB devices connected.

@marmarek
Copy link
Member

Probably well known memory fragmentation issue - PV VM with PCI device needs few megs of physically continuous memory for DMA purpose. You can free some by getting it away from dom0: xl mem-set 0 <some-memory-size-in-MB>, where the size is smaller than the current one (for example 500MB smaller). If it does not help, try shutdown some VMs. If still nothing, reboot...

@Rudd-O
Copy link
Author

Rudd-O commented Oct 28, 2016

Let me try. But if this works, this really should be documented somewhere!

@Rudd-O
Copy link
Author

Rudd-O commented Oct 28, 2016

Nope, it did not work at all.

@marmarek
Copy link
Member

Ok, lets try harder: touch /var/run/qubes/do-not-membalance. Then try again xl mem-set and qvm-start. And if it doesn't work, repeat (just one more time).

@Rudd-O
Copy link
Author

Rudd-O commented Oct 30, 2016

On 10/28/2016 10:40 PM, Marek Marczykowski-Górecki wrote:

Ok, lets try harder: |touch /var/run/qubes/do-not-membalance|. Then
try again |xl mem-set| and |qvm-start|. And if it doesn't work, repeat
(just one more time).

A reboot fixed it.

Rudd-O
http://rudd-o.com/

@xloem
Copy link

xloem commented Jan 1, 2017

Same experience. I needed to to touch /var/run/qubes/do-not-membalance to get xl mem-set to do anything at all. I kept dropping the dom0 ram in 512MB increments, and qvm-start kept failing, until the system stopped responding. Then things worked after reboot.

Maybe some file to review to determine memory fragmentation, and where the VM memory is getting allocated, for next time? Or some way to determine what made the VM crash?

@andrewdavidwong
Copy link
Member

This bug report has seen no activity in a very long time, and it is not assigned to any current release milestone. It looks like it was left open by mistake, so I'm closing it now. However, if anyone is still affected by this bug on a currently-supported release, please leave a comment, and we'll be happy to reopen this. Thank you.

@brendanhoar
Copy link

brendanhoar commented May 15, 2022

R4.0 kernel-latest=5.17.7 current-testing:

Just ran into the "failed to get contiguous memory for dma from xen" in sys-net-dm after shutting down all networking VMs and trying to start them again.

Several retries failed.

I saved all my work, shut down everything but dom0 and sys-net started w/o issue.

Pretty sure this happened one other time recently as well, can't prove it though.

Next time it happens I'll try the xl mem-set 0 (smaller size) approach.

B

@xloem
Copy link

xloem commented May 15, 2022

I'm no longer using Qubes, but it looks like a workable next step here would be to take the effort to find what device file (and possibly kernel parameters) display the physical memory mapping of the system. Then this can be reviewed on next occurrence to verify that the instance is memory fragmentation, see if shrinking dom0 resolves it, and possibly discern a minimum contiguous block needed by the device.

It's likely possible to remap memory to resolve this confidently, but might need implementation by a dev.

@andrewdavidwong andrewdavidwong added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels May 15, 2022
@andrewdavidwong andrewdavidwong modified the milestones: Release TBD, Release 4.0 updates May 15, 2022
@DemiMarie DemiMarie modified the milestones: Release 4.0 updates, Release TBD Mar 14, 2023
@DemiMarie DemiMarie added S: blocked Status: blocked. Work on this issue is currently blocked. diagnosed Technical diagnosis has been performed (see issue comments). and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Mar 14, 2023
@andrewdavidwong andrewdavidwong modified the milestones: Release TBD, Release 4.1 updates Apr 6, 2023
@andrewdavidwong
Copy link
Member

Just a reminder that, for bug reports, the milestone designates the earliest supported release in which the bug is known to exist, not when we plan to fix it.

@Rudd-O
Copy link
Author

Rudd-O commented Apr 19, 2023

This is from 2015 and I have not been able to repro this since.

@andrewdavidwong
Copy link
Member

This is from 2015 and I have not been able to repro this since.

Closing as "cannot reproduce" (we were unable to reproduce this issue). If anyone believes this is a mistake, or if anyone can reproduce the issue, please leave a comment, and we'll be happy to reopen this. Thank you.

@andrewdavidwong andrewdavidwong closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2023
@andrewdavidwong andrewdavidwong added R: cannot reproduce Resolution: Attempts to replicate the problem have not been reliably successful enough to proceed. and removed S: blocked Status: blocked. Work on this issue is currently blocked. labels Apr 19, 2023
@andrewdavidwong andrewdavidwong removed this from the Release 4.1 updates milestone Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. R: cannot reproduce Resolution: Attempts to replicate the problem have not been reliably successful enough to proceed. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

8 participants