PVM guest kernel panic after restore from snapshot #2

zhuangel · 2024-03-11T03:44:38Z

Description

Cloud Hypervisor support take a snapshot for running virtual machine, then restore virtual machine from snapshot file. We try to test this feature with virtual machine running with PVM guest kernel on PVM host kernel.
After we restore the virtual machine from the snapshot file, the virtual machine panic with such exception.

[ 36.865628] kernel BUG at fs/buffer.c:1309!
[ 36.866930] invalid opcode: 0000 [#1] SMP PTI
[ 36.868247] CPU: 0 PID: 274 Comm: systemd-rc-loca Not tainted 6.7.0-rc6-virt-pvm-guest+ torvalds#55
[ 36.870556] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0
[ 36.872259] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 36.873659] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 36.878668] RSP: 0018:ffffd2000084fa78 EFLAGS: 00010046
[ 36.880195] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 36.882268] RDX: 0000000000001000 RSI: 000000000000402c RDI: ffffc9800308e580
[ 36.884237] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 36.886105] R10: ffffd2000084fab8 R11: 0000000000000000 R12: 0000000000000000
[ 36.887960] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 36.889867] FS: 00007f1f91c3f900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 36.891976] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 36.893598] CR2: 000055c387859065 CR3: 00000000041c8003 CR4: 0000000000070ef0
[ 36.895722] Call Trace:
[ 36.896521]
[ 36.897192] ? __die_body+0x15/0x50
[ 36.898212] ? die+0x33/0x50
[ 36.899034] ? do_trap+0x100/0x110
[ 36.900007] ? __find_get_block+0x1f2/0x2c0
[ 36.901217] ? do_error_trap+0x65/0x80
[ 36.902302] ? __find_get_block+0x1f2/0x2c0
[ 36.903483] ? exc_invalid_op+0x49/0x60
[ 36.904621] ? __find_get_block+0x1f2/0x2c0
[ 36.905741] ? pvm_kernel_exception_entry+0x4b/0x100
[ 36.907046] ? __find_get_block+0x1f2/0x2c0
[ 36.908215] ? ext4_es_lookup_extent+0x101/0x150
[ 36.909450] ? __find_get_block+0xf/0x2c0
[ 36.910525] bdev_getblk+0x20/0x220
[ 36.911551] ext4_getblk+0xc2/0x2c0
[ 36.912612] ext4_bread_batch+0x4b/0x150
[ 36.913776] __ext4_find_entry+0x150/0x420
[ 36.914995] ? __d_alloc+0x11c/0x1c0
[ 36.916009] ? d_alloc_parallel+0xab/0x360
[ 36.917146] ext4_lookup+0x7d/0x1d0
[ 36.918119] __lookup_slow+0x8a/0x130
[ 36.919245] ? __legitimize_path.isra.46+0x27/0x60
[ 36.920616] walk_component+0x7e/0x160
[ 36.921686] path_lookupat.isra.53+0x62/0x130
[ 36.922943] filename_lookup.part.71+0xbe/0x180
[ 36.924297] ? strncpy_from_user+0x96/0x110
[ 36.925544] user_path_at_empty+0x4c/0x50
[ 36.926715] do_faccessat+0xf1/0x2f0
[ 36.927765] do_syscall_64+0x4d/0xf0
[ 36.928889] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 36.930371] RIP: 0033:0x7f1f9263a94b
[ 36.931399] Code: 77 05 c3 0f 1f 40 00 48 8b 15 e1 54 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 15 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 b1 54 10 00 f7 d8
[ 36.936338] RSP: 002b:00007ffd6caff888 EFLAGS: 00000202 ORIG_RAX: 0000000000000015
[ 36.938320] RAX: ffffffffffffffda RBX: 000055c387859065 RCX: 00007f1f9263a94b
[ 36.940191] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000055c387859065
[ 36.942067] RBP: 00007ffd6caff9b8 R08: 000055c3897c54d0 R09: 00000000ffffffff
[ 36.944034] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffd6caff9b8
[ 36.946046] R13: 000055c387858240 R14: 000055c38785ad18 R15: 00007f1f92a48040
[ 36.948088]
[ 36.948759] ---[ end trace 0000000000000000 ]---
[ 36.950000] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 36.951291] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 36.956169] RSP: 0018:ffffd2000084fa78 EFLAGS: 00010046
[ 36.957689] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 36.959772] RDX: 0000000000001000 RSI: 000000000000402c RDI: ffffc9800308e580
[ 36.961827] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 36.963903] R10: ffffd2000084fab8 R11: 0000000000000000 R12: 0000000000000000
[ 36.965982] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 36.968078] FS: 00007f1f91c3f900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 36.970375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 36.972021] CR2: 000055c387859065 CR3: 00000000041c8003 CR4: 0000000000070ef0
[ 36.974080] note: systemd-rc-loca[274] exited with irqs disabled
[ 36.999660] ------------[ cut here ]------------
[ 37.000909] kernel BUG at fs/buffer.c:1309!
[ 37.002140] invalid opcode: 0000 [#2] SMP PTI
[ 37.003425] CPU: 0 PID: 276 Comm: systemd-system- Tainted: G D 6.7.0-rc6-virt-pvm-guest+ torvalds#55
[ 37.006174] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0
[ 37.008054] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 37.009464] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 37.014586] RSP: 0018:ffffd2000085fa78 EFLAGS: 00010046
[ 37.016040] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 37.018085] RDX: 0000000000001000 RSI: 0000000000004021 RDI: ffffc9800308e580
[ 37.020095] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 37.022142] R10: ffffc98006ccc6c0 R11: 0000000000000005 R12: 0000000000000000
[ 37.024102] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 37.026007] FS: 00007faf42db1900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 37.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.029963] CR2: 000055a2ec223000 CR3: 00000000070c4001 CR4: 0000000000070ef0
[ 37.032008] Call Trace:
[ 37.032772]
[ 37.033443] ? __die_body+0x15/0x50
[ 37.034477] ? die+0x33/0x50
[ 37.035376] ? do_trap+0x100/0x110
[ 37.036339] ? __find_get_block+0x1f2/0x2c0
[ 37.037559] ? do_error_trap+0x65/0x80
[ 37.038631] ? __find_get_block+0x1f2/0x2c0
[ 37.039871] ? exc_invalid_op+0x49/0x60
[ 37.040949] ? __find_get_block+0x1f2/0x2c0
[ 37.042109] ? pvm_kernel_exception_entry+0x4b/0x100
[ 37.043444] ? __find_get_block+0x1f2/0x2c0
[ 37.044622] ? ext4_es_lookup_extent+0x101/0x150
[ 37.046005] ? __find_get_block+0xf/0x2c0
[ 37.047195] bdev_getblk+0x20/0x220
[ 37.048154] ext4_getblk+0xc2/0x2c0
[ 37.049200] ext4_bread_batch+0x4b/0x150
[ 37.050279] __ext4_find_entry+0x150/0x420
[ 37.051403] ? __d_alloc+0x11c/0x1c0
[ 37.052397] ? d_alloc_parallel+0xab/0x360
[ 37.053540] ext4_lookup+0x7d/0x1d0
[ 37.054579] __lookup_slow+0x8a/0x130
[ 37.055688] ? __legitimize_path.isra.46+0x27/0x60
[ 37.057056] walk_component+0x7e/0x160
[ 37.058181] path_lookupat.isra.53+0x62/0x130
[ 37.059414] filename_lookup.part.71+0xbe/0x180
[ 37.060693] ? strncpy_from_user+0x96/0x110
[ 37.061889] user_path_at_empty+0x4c/0x50
[ 37.063081] do_faccessat+0xf1/0x2f0
[ 37.064191] do_syscall_64+0x4d/0xf0
[ 37.065208] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 37.066659] RIP: 0033:0x7faf437acaf4
[ 37.067715] Code: 89 cd 41 54 41 89 d4 55 53 48 81 ec a8 00 00 00 64 48 8b 04 25 28 00 00 00 48 89 84 24 98 00 00 00 31 c0 b8 b7 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 40 01 00 00 41 89 c0 85 c0 0f 84 15 01 00
[ 37.072853] RSP: 002b:00007ffe493719a0 EFLAGS: 00000246 ORIG_RAX: 00000000000001b7
[ 37.074871] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007faf437acaf4
[ 37.076815] RDX: 0000000000000000 RSI: 000055a2ec223000 RDI: 00000000ffffff9c
[ 37.078732] RBP: 00007ffe49371aa0 R08: 000055a2ede694d0 R09: 00000000ffffffff
[ 37.080761] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000000
[ 37.082720] R13: 0000000000000100 R14: 000055a2ec224d08 R15: 00007faf43bba040
[ 37.084608]
[ 37.085240] ---[ end trace 0000000000000000 ]---
[ 37.086555] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 37.087948] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 37.093082] RSP: 0018:ffffd2000084fa78 EFLAGS: 00010046
[ 37.094588] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 37.096597] RDX: 0000000000001000 RSI: 000000000000402c RDI: ffffc9800308e580
[ 37.098620] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 37.100562] R10: ffffd2000084fab8 R11: 0000000000000000 R12: 0000000000000000
[ 37.102429] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 37.104331] FS: 00007faf42db1900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 37.106629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.108329] CR2: 000055a2ec223000 CR3: 00000000070c4001 CR4: 0000000000070ef0
[ 37.110371] note: systemd-system-[276] exited with irqs disabled

Step to reproduce

Build PVM host kernel and PVM guest kernel follow pvm-get-started-with-kata.md
Guest VM resource from Guide
cloud-hypervisor v37
VM image from Guide
Start VM and Create snapshot
#Start VM
cloud-hypervisor.v37 \
--api-socket ch.sock \
--log-file vmm.log \
--cpus boot=1,max_phys_bits=43 \
--kernel vmlinux.virt-pvm-guest \
--cmdline 'console=ttyS0 root=/dev/vda1 rw clocksource=kvm-clock pti=off' \
--memory size=1G,hugepages=off,shared=false,prefault=off \
--disk id=disk_0,path=ubuntu-22.04-pvm-kata.raw \
-v --console off --serial tty

#Pause VM in another shell
curl --unix-socket ch.sock -i -X PUT 'http://localhost/api/v1/vm.pause'

#Generate snapshot
curl --unix-socket $path -i \
-X PUT 'http://localhost/api/v1/vm.snapshot' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"destination_url":"file:///tmp/snapshot"
}'

#Shutdown the paused VM
curl --unix-socket ch.sock -i -X PUT 'http://localhost/api/v1/vmm.shutdown'
Restore VM
#Restore VM from snapshot
cloud-hypervisor.v37 \
--api-socket ch.sock \
--restore source_url=file:///tmp/snapshot \
-v --log-file vmm.log

#Resume the VM in another shell
curl --unix-socket ch.sock -i -X PUT 'http://localhost/api/v1/vm.pause'

The text was updated successfully, but these errors were encountered:

…onally During the VM restore process, if the VMM (e.g., Cloud Hypervisor) restores MSRs before adding the user memory region, it can result in a failure in kvm_gpc_activate() because no memslot has been added yet. As a consequence, the VM will panic after the VM restore since the GPC is not active. However, if we store the value even if kvm_gpc_activate() fails later when the GPC is active, it can be refreshed by the addition of the user memory region before the VM entry. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: #2

bysui · 2024-04-02T12:40:47Z

Thank you for your report! I apologize for the delayed response. I am currently working on resolving a similar migration issue in QEMU.

Based on the vmm.log, it appears that the restoration of MSR_PVM_VCPU_STRUCT has encountered a failure.

cloud-hypervisor: 834.108417ms: <vmm> INFO:arch/src/x86_64/mod.rs:579 -- Running under nested virtualisation. Hypervisor string: KVMKVMKVM
cloud-hypervisor: 834.127740ms: <vmm> INFO:arch/src/x86_64/mod.rs:585 -- Generating guest CPUID for with physical address size: 43
cloud-hypervisor: 834.202578ms: <vmm> INFO:vmm/src/cpu.rs:833 -- Request to create new vCPUs: desired = 1, max = 1, allocated = 0, present = 0
cloud-hypervisor: 834.232237ms: <vmm> INFO:vmm/src/cpu.rs:758 -- Creating vCPU: cpu_id = 0
cloud-hypervisor: 835.112786ms: <vmm> WARN:hypervisor/src/kvm/mod.rs:2074 -- Detected faulty MSR 0x4b564d02 while setting MSRs
cloud-hypervisor: 835.156128ms: <vmm> WARN:hypervisor/src/kvm/mod.rs:2074 -- Detected faulty MSR 0x4b564d04 while setting MSRs
cloud-hypervisor: 835.181187ms: <vmm> WARN:hypervisor/src/kvm/mod.rs:2074 -- Detected faulty MSR 0x4b564df1 while setting MSRs
cloud-hypervisor: 835.289422ms: <vmm> INFO:vmm/src/pci_segment.rs:104 -- Adding PCI segment: id=0, PCI MMIO config address: 0xe8000000, mem32 area [0xc0000000-0xe7ffffff, mem64 area [0x100000000-0x7feffffffff

The MSR_PVM_VCPU_STRUCT is crucial for PVM as it stores the address of PVCS (PVM VCPU Structure). The reason for the restoration failure is that kvm_gpc_activate() failed due to the absence of a memslot during the MSR setting process. It seems that Cloud Hypervisor attempts to restore guest MSRs before restoring guest memory. In fact, there are other MSRs associated with guest memory that also fail to restore. To resolve this issue, the following fix can be implemented:

diff --git a/arch/x86/kvm/pvm/pvm.c b/arch/x86/kvm/pvm/pvm.c
index d76f731d0b0d..f71290816e5f 100644
--- a/arch/x86/kvm/pvm/pvm.c
+++ b/arch/x86/kvm/pvm/pvm.c
@@ -1149,12 +1149,21 @@ static int pvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
        case MSR_PVM_VCPU_STRUCT:
                if (!PAGE_ALIGNED(data))
                        return 1;
+               /*
+                * During the VM restore process, if the VMM restores MSRs
+                * before adding the user memory region, it can result in a
+                * failure in kvm_gpc_activate() because no memslot has been
+                * added yet. As a consequence, the VM will panic after the VM
+                * restore since the GPC is not active. However, if we store
+                * the value even if kvm_gpc_activate() fails later when the
+                * GPC is active, it can be refreshed by the addition of the
+                * user memory region before the VM entry.
+                */
+               pvm->msr_vcpu_struct = data;
                if (!data)
                        kvm_gpc_deactivate(&pvm->pvcs_gpc);
                else if (kvm_gpc_activate(&pvm->pvcs_gpc, data, PAGE_SIZE))
                        return 1;
-
-               pvm->msr_vcpu_struct = data;
                break;
        case MSR_PVM_SUPERVISOR_RSP:

Similar to the failed MSR_KVM_ASYNC_PF_EN, record the value before the kvm_gpc_activate() call. Then, when the memslot is created, the GPC can be refreshed with the recorded value.

However, I suspect that this issue may be specific to Cloud Hypervisor rather than KVM. In the case of the failed MSR_KVM_PV_EOI_EN, it does not record the value before the kvm_gfn_to_hva_cache_init() call, potentially leading to a broken feature.

pojntfx · 2024-04-07T02:04:51Z

We've gotten snapshot/resume to work in the @loopholelabs Firecracker fork with a simple patch that includes the PVM MSRs in the snapshot - should be the same procedures for other hypervisors, esp. Cloud Hypervisor: loopholelabs/firecracker@e266c12

bysui · 2024-04-08T11:43:21Z

Hi @pojntfx, thank you for trying and testing PVM with Firecracker. Yes, we encountered the same issue with QEMU snapshots. It doesn't save or restore PVM MSRs. However, Cloud Hypervisor does the right thing by using the supported MSRs index acquired from the KVM to perform MSRs save/restore. I believe this is the correct approach for VMMs, utilizing the information provided by the KVM.

zhuangel · 2024-04-10T08:19:55Z

@bysui Thanks for the fix!

I have verified the issue on my test environment, after update host kernel code with pvm-fix branch, I could restore the VM successfully, sorry for missed to check the failure message in cloud-hypervisor log file :).

…onally During the VM restore process, if the VMM (e.g., Cloud Hypervisor) restores MSRs before adding the user memory region, it can result in a failure in kvm_gpc_activate() because no memslot has been added yet. As a consequence, the VM will panic after the VM restore since the GPC is not active. However, if we store the value even if kvm_gpc_activate() fails later when the GPC is active, it can be refreshed by the addition of the user memory region before the VM entry. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: #2

pojntfx · 2024-04-17T00:36:34Z

I just re-tested this issue with 6a03a61 on Firecracker. Without our patch, Firecracker still fails with the same error (since it doesn't store the MSRs). Would the clean fix for this be to port Cloud Hypervisor's MSR handling logic to Firecracker instead of hard-coding these additional MSRs to save like we currently do?

bysui · 2024-04-17T06:08:03Z

I just re-tested this issue with 6a03a61 on Firecracker. Without our patch, Firecracker still fails with the same error (since it doesn't store the MSRs). Would the clean fix for this be to port Cloud Hypervisor's MSR handling logic to Firecracker instead of hard-coding these additional MSRs to save like we currently do?

Yes, I believe this is a long-term and reasonable approach in case there are new MSRs added in the future. It is the responsibility of the VMM to use the information provided by KVM to do the right thing.

With the current bandwidth allocation we end up reserving too much for the USB 3.x and PCIe tunnels that leads to reduced capabilities for the second DisplayPort tunnel. Fix this by decreasing the USB 3.x allocation to 900 Mb/s which then allows both tunnels to get the maximum HBR2 bandwidth. This way, the reserved bandwidth for USB 3.x and PCIe, would be 1350 Mb/s (taking weights of USB 3.x and PCIe into account). So bandwidth allocations on a link are: USB 3.x + PCIe tunnels => 1350 Mb/s DisplayPort tunnel #1 => 17280 Mb/s DisplayPort tunnel virt-pvm#2 => 17280 Mb/s Total consumed bandwidth is 35910 Mb/s. So that all the above can be tunneled on a Gen 3 link (which allows maximum of 36000 Mb/s). Fixes: 582e70b ("thunderbolt: Change bandwidth reservations to comply USB4 v2") Signed-off-by: Gil Fine <gil.fine@linux.intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Coverity Scan reports the following issue. But it's impossible that mlx5_get_dev_index returns 7 for PF, even if the index is calculated from PCI FUNC ID. So add the checking to make coverity slience. CID 610894 (virt-pvm#2 of 2): Out-of-bounds write (OVERRUN) Overrunning array esw->fdb_table.offloads.peer_miss_rules of 4 8-byte elements at element index 7 (byte offset 63) using index mlx5_get_dev_index(peer_dev) (which evaluates to 7). Fixes: 9bee385 ("net/mlx5: E-switch, refactor FDB miss rule add/remove") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

syzbot found a potential circular dependency leading to a deadlock: -> virt-pvm#3 (&hdev->req_lock){+.+.}-{3:3}: __mutex_lock_common+0x1b6/0x1bc2 kernel/locking/mutex.c:599 __mutex_lock kernel/locking/mutex.c:732 [inline] mutex_lock_nested+0x17/0x1c kernel/locking/mutex.c:784 hci_dev_do_close+0x3f/0x9f net/bluetooth/hci_core.c:551 hci_rfkill_set_block+0x130/0x1ac net/bluetooth/hci_core.c:935 rfkill_set_block+0x1e6/0x3b8 net/rfkill/core.c:345 rfkill_fop_write+0x2d8/0x672 net/rfkill/core.c:1274 vfs_write+0x277/0xcf5 fs/read_write.c:594 ksys_write+0x19b/0x2bd fs/read_write.c:650 do_syscall_x64 arch/x86/entry/common.c:55 [inline] do_syscall_64+0x51/0xba arch/x86/entry/common.c:93 entry_SYSCALL_64_after_hwframe+0x61/0xcb -> virt-pvm#2 (rfkill_global_mutex){+.+.}-{3:3}: __mutex_lock_common+0x1b6/0x1bc2 kernel/locking/mutex.c:599 __mutex_lock kernel/locking/mutex.c:732 [inline] mutex_lock_nested+0x17/0x1c kernel/locking/mutex.c:784 rfkill_register+0x30/0x7e3 net/rfkill/core.c:1045 hci_register_dev+0x48f/0x96d net/bluetooth/hci_core.c:2622 __vhci_create_device drivers/bluetooth/hci_vhci.c:341 [inline] vhci_create_device+0x3ad/0x68f drivers/bluetooth/hci_vhci.c:374 vhci_get_user drivers/bluetooth/hci_vhci.c:431 [inline] vhci_write+0x37b/0x429 drivers/bluetooth/hci_vhci.c:511 call_write_iter include/linux/fs.h:2109 [inline] new_sync_write fs/read_write.c:509 [inline] vfs_write+0xaa8/0xcf5 fs/read_write.c:596 ksys_write+0x19b/0x2bd fs/read_write.c:650 do_syscall_x64 arch/x86/entry/common.c:55 [inline] do_syscall_64+0x51/0xba arch/x86/entry/common.c:93 entry_SYSCALL_64_after_hwframe+0x61/0xcb -> #1 (&data->open_mutex){+.+.}-{3:3}: __mutex_lock_common+0x1b6/0x1bc2 kernel/locking/mutex.c:599 __mutex_lock kernel/locking/mutex.c:732 [inline] mutex_lock_nested+0x17/0x1c kernel/locking/mutex.c:784 vhci_send_frame+0x68/0x9c drivers/bluetooth/hci_vhci.c:75 hci_send_frame+0x1cc/0x2ff net/bluetooth/hci_core.c:2989 hci_sched_acl_pkt net/bluetooth/hci_core.c:3498 [inline] hci_sched_acl net/bluetooth/hci_core.c:3583 [inline] hci_tx_work+0xb94/0x1a60 net/bluetooth/hci_core.c:3654 process_one_work+0x901/0xfb8 kernel/workqueue.c:2310 worker_thread+0xa67/0x1003 kernel/workqueue.c:2457 kthread+0x36a/0x430 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298 -> #0 ((work_completion)(&hdev->tx_work)){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3053 [inline] check_prevs_add kernel/locking/lockdep.c:3172 [inline] validate_chain kernel/locking/lockdep.c:3787 [inline] __lock_acquire+0x2d32/0x77fa kernel/locking/lockdep.c:5011 lock_acquire+0x273/0x4d5 kernel/locking/lockdep.c:5622 __flush_work+0xee/0x19f kernel/workqueue.c:3090 hci_dev_close_sync+0x32f/0x1113 net/bluetooth/hci_sync.c:4352 hci_dev_do_close+0x47/0x9f net/bluetooth/hci_core.c:553 hci_rfkill_set_block+0x130/0x1ac net/bluetooth/hci_core.c:935 rfkill_set_block+0x1e6/0x3b8 net/rfkill/core.c:345 rfkill_fop_write+0x2d8/0x672 net/rfkill/core.c:1274 vfs_write+0x277/0xcf5 fs/read_write.c:594 ksys_write+0x19b/0x2bd fs/read_write.c:650 do_syscall_x64 arch/x86/entry/common.c:55 [inline] do_syscall_64+0x51/0xba arch/x86/entry/common.c:93 entry_SYSCALL_64_after_hwframe+0x61/0xcb This change removes the need for acquiring the open_mutex in vhci_send_frame, thus eliminating the potential deadlock while maintaining the required packet ordering. Fixes: 92d4abd ("Bluetooth: vhci: Fix race when opening vhci device") Signed-off-by: Ying Hsu <yinghsu@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Calling led_trigger_register() when attaching a PHY located on an SFP module potentially (and practically) leads into a deadlock. Fix this by not calling led_trigger_register() for PHYs localted on SFP modules as such modules actually never got any LEDs. ====================================================== WARNING: possible circular locking dependency detected 6.7.0-rc4-next-20231208+ #0 Tainted: G O ------------------------------------------------------ kworker/u8:2/43 is trying to acquire lock: ffffffc08108c4e8 (triggers_list_lock){++++}-{3:3}, at: led_trigger_register+0x4c/0x1a8 but task is already holding lock: ffffff80c5c6f318 (&sfp->sm_mutex){+.+.}-{3:3}, at: cleanup_module+0x2ba8/0x3120 [sfp] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> virt-pvm#3 (&sfp->sm_mutex){+.+.}-{3:3}: __mutex_lock+0x88/0x7a0 mutex_lock_nested+0x20/0x28 cleanup_module+0x2ae0/0x3120 [sfp] sfp_register_bus+0x5c/0x9c sfp_register_socket+0x48/0xd4 cleanup_module+0x271c/0x3120 [sfp] platform_probe+0x64/0xb8 really_probe+0x17c/0x3c0 __driver_probe_device+0x78/0x164 driver_probe_device+0x3c/0xd4 __driver_attach+0xec/0x1f0 bus_for_each_dev+0x60/0xa0 driver_attach+0x20/0x28 bus_add_driver+0x108/0x208 driver_register+0x5c/0x118 __platform_driver_register+0x24/0x2c init_module+0x28/0xa7c [sfp] do_one_initcall+0x70/0x2ec do_init_module+0x54/0x1e4 load_module+0x1b78/0x1c8c __do_sys_init_module+0x1bc/0x2cc __arm64_sys_init_module+0x18/0x20 invoke_syscall.constprop.0+0x4c/0xdc do_el0_svc+0x3c/0xbc el0_svc+0x34/0x80 el0t_64_sync_handler+0xf8/0x124 el0t_64_sync+0x150/0x154 -> virt-pvm#2 (rtnl_mutex){+.+.}-{3:3}: __mutex_lock+0x88/0x7a0 mutex_lock_nested+0x20/0x28 rtnl_lock+0x18/0x20 set_device_name+0x30/0x130 netdev_trig_activate+0x13c/0x1ac led_trigger_set+0x118/0x234 led_trigger_write+0x104/0x17c sysfs_kf_bin_write+0x64/0x80 kernfs_fop_write_iter+0x128/0x1b4 vfs_write+0x178/0x2a4 ksys_write+0x58/0xd4 __arm64_sys_write+0x18/0x20 invoke_syscall.constprop.0+0x4c/0xdc do_el0_svc+0x3c/0xbc el0_svc+0x34/0x80 el0t_64_sync_handler+0xf8/0x124 el0t_64_sync+0x150/0x154 -> #1 (&led_cdev->trigger_lock){++++}-{3:3}: down_write+0x4c/0x13c led_trigger_write+0xf8/0x17c sysfs_kf_bin_write+0x64/0x80 kernfs_fop_write_iter+0x128/0x1b4 vfs_write+0x178/0x2a4 ksys_write+0x58/0xd4 __arm64_sys_write+0x18/0x20 invoke_syscall.constprop.0+0x4c/0xdc do_el0_svc+0x3c/0xbc el0_svc+0x34/0x80 el0t_64_sync_handler+0xf8/0x124 el0t_64_sync+0x150/0x154 -> #0 (triggers_list_lock){++++}-{3:3}: __lock_acquire+0x12a0/0x2014 lock_acquire+0x100/0x2ac down_write+0x4c/0x13c led_trigger_register+0x4c/0x1a8 phy_led_triggers_register+0x9c/0x214 phy_attach_direct+0x154/0x36c phylink_attach_phy+0x30/0x60 phylink_sfp_connect_phy+0x140/0x510 sfp_add_phy+0x34/0x50 init_module+0x15c/0xa7c [sfp] cleanup_module+0x1d94/0x3120 [sfp] cleanup_module+0x2bb4/0x3120 [sfp] process_one_work+0x1f8/0x4ec worker_thread+0x1e8/0x3d8 kthread+0x104/0x110 ret_from_fork+0x10/0x20 other info that might help us debug this: Chain exists of: triggers_list_lock --> rtnl_mutex --> &sfp->sm_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sfp->sm_mutex); lock(rtnl_mutex); lock(&sfp->sm_mutex); lock(triggers_list_lock); *** DEADLOCK *** 4 locks held by kworker/u8:2/43: #0: ffffff80c000f938 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: process_one_work+0x150/0x4ec #1: ffffffc08214bde8 ((work_completion)(&(&sfp->timeout)->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x4ec virt-pvm#2: ffffffc0810902f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x18/0x20 virt-pvm#3: ffffff80c5c6f318 (&sfp->sm_mutex){+.+.}-{3:3}, at: cleanup_module+0x2ba8/0x3120 [sfp] stack backtrace: CPU: 0 PID: 43 Comm: kworker/u8:2 Tainted: G O 6.7.0-rc4-next-20231208+ #0 Hardware name: Bananapi BPI-R4 (DT) Workqueue: events_power_efficient cleanup_module [sfp] Call trace: dump_backtrace+0xa8/0x10c show_stack+0x14/0x1c dump_stack_lvl+0x5c/0xa0 dump_stack+0x14/0x1c print_circular_bug+0x328/0x430 check_noncircular+0x124/0x134 __lock_acquire+0x12a0/0x2014 lock_acquire+0x100/0x2ac down_write+0x4c/0x13c led_trigger_register+0x4c/0x1a8 phy_led_triggers_register+0x9c/0x214 phy_attach_direct+0x154/0x36c phylink_attach_phy+0x30/0x60 phylink_sfp_connect_phy+0x140/0x510 sfp_add_phy+0x34/0x50 init_module+0x15c/0xa7c [sfp] cleanup_module+0x1d94/0x3120 [sfp] cleanup_module+0x2bb4/0x3120 [sfp] process_one_work+0x1f8/0x4ec worker_thread+0x1e8/0x3d8 kthread+0x104/0x110 ret_from_fork+0x10/0x20 Signed-off-by: Daniel Golle <daniel@makrotopia.org> Fixes: 01e5b72 ("net: phy: Add a binding for PHY LEDs") Link: https://lore.kernel.org/r/102a9dce38bdf00215735d04cd4704458273ad9c.1702339354.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Trying to suspend to RAM on SAMA5D27 EVK leads to the following lockdep warning: ============================================ WARNING: possible recursive locking detected 6.7.0-rc5-wt+ torvalds#532 Not tainted -------------------------------------------- sh/92 is trying to acquire lock: c3cf306c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100 but task is already holding lock: c3d7c46c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&irq_desc_lock_class); lock(&irq_desc_lock_class); *** DEADLOCK *** May be due to missing lock nesting notation 6 locks held by sh/92: #0: c3aa0258 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0xd8/0x178 #1: c4c2df44 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x138/0x284 virt-pvm#2: c32684a0 (kn->active){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x148/0x284 virt-pvm#3: c232b6d4 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend+0x13c/0x4e8 virt-pvm#4: c387b088 (&dev->mutex){....}-{3:3}, at: __device_suspend+0x1e8/0x91c virt-pvm#5: c3d7c46c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100 stack backtrace: CPU: 0 PID: 92 Comm: sh Not tainted 6.7.0-rc5-wt+ torvalds#532 Hardware name: Atmel SAMA5 unwind_backtrace from show_stack+0x18/0x1c show_stack from dump_stack_lvl+0x34/0x48 dump_stack_lvl from __lock_acquire+0x19ec/0x3a0c __lock_acquire from lock_acquire.part.0+0x124/0x2d0 lock_acquire.part.0 from _raw_spin_lock_irqsave+0x5c/0x78 _raw_spin_lock_irqsave from __irq_get_desc_lock+0xe8/0x100 __irq_get_desc_lock from irq_set_irq_wake+0xa8/0x204 irq_set_irq_wake from atmel_gpio_irq_set_wake+0x58/0xb4 atmel_gpio_irq_set_wake from irq_set_irq_wake+0x100/0x204 irq_set_irq_wake from gpio_keys_suspend+0xec/0x2b8 gpio_keys_suspend from dpm_run_callback+0xe4/0x248 dpm_run_callback from __device_suspend+0x234/0x91c __device_suspend from dpm_suspend+0x224/0x43c dpm_suspend from dpm_suspend_start+0x9c/0xa8 dpm_suspend_start from suspend_devices_and_enter+0x1e0/0xa84 suspend_devices_and_enter from pm_suspend+0x460/0x4e8 pm_suspend from state_store+0x78/0xe4 state_store from kernfs_fop_write_iter+0x1a0/0x284 kernfs_fop_write_iter from vfs_write+0x38c/0x6f4 vfs_write from ksys_write+0xd8/0x178 ksys_write from ret_fast_syscall+0x0/0x1c Exception stack(0xc52b3fa8 to 0xc52b3ff0) 3fa0: 00000004 005a0ae8 00000001 005a0ae8 00000004 00000001 3fc0: 00000004 005a0ae8 00000001 00000004 00000004 b6c616c0 00000020 0059d190 3fe0: 00000004 b6c61678 aec5a041 aebf1a26 This warning is raised because pinctrl-at91-pio4 uses chained IRQ. Whenever a wake up source configures an IRQ through irq_set_irq_wake, it will lock the corresponding IRQ desc, and then call irq_set_irq_wake on "parent" IRQ which will do the same on its own IRQ desc, but since those two locks share the same class, lockdep reports this as an issue. Fix lockdep false positive by setting a different class for parent and children IRQ Fixes: 7761808 ("pinctrl: introduce driver for Atmel PIO4 controller") Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Link: https://lore.kernel.org/r/20231215-lockdep_warning-v1-1-8137b2510ed5@bootlin.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>

…kernel/git/kvmarm/kvmarm into kvm-master KVM/arm64 fixes for 6.7, part virt-pvm#2 - Ensure a vCPU's redistributor is unregistered from the MMIO bus if vCPU creation fails - Fix building KVM selftests for arm64 from the top-level Makefile

zhuangel changed the title ~~Guest kernel Panic after restore from snapshot~~ PVM guest kernel panic after restore from snapshot Mar 13, 2024

bysui added the bug Something isn't working label Apr 2, 2024

bysui self-assigned this Apr 2, 2024

bysui mentioned this issue Apr 10, 2024

Question: what's the state and next step for this new feature? #5

Open

zhuangel closed this as completed Apr 11, 2024

bysui mentioned this issue Apr 26, 2024

PVM host kernel panic after restore from snapshot in Cloud Hypervisor on EC2 #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVM guest kernel panic after restore from snapshot #2

PVM guest kernel panic after restore from snapshot #2

zhuangel commented Mar 11, 2024

bysui commented Apr 2, 2024

pojntfx commented Apr 7, 2024 •

edited

Loading

bysui commented Apr 8, 2024

zhuangel commented Apr 10, 2024

pojntfx commented Apr 17, 2024

bysui commented Apr 17, 2024

PVM guest kernel panic after restore from snapshot #2

PVM guest kernel panic after restore from snapshot #2

Comments

zhuangel commented Mar 11, 2024

Description

Step to reproduce

bysui commented Apr 2, 2024

pojntfx commented Apr 7, 2024 • edited Loading

bysui commented Apr 8, 2024

zhuangel commented Apr 10, 2024

pojntfx commented Apr 17, 2024

bysui commented Apr 17, 2024

pojntfx commented Apr 7, 2024 •

edited

Loading