Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] support for resetting virtio devices #3074

Open
3 tasks done
cperciva opened this issue Jul 31, 2022 · 5 comments · May be fixed by #4389
Open
3 tasks done

[Feature Request] support for resetting virtio devices #3074

cperciva opened this issue Jul 31, 2022 · 5 comments · May be fixed by #4389
Labels
Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` Status: Parked Indicates that an issues or pull request will be revisited later

Comments

@cperciva
Copy link
Contributor

Feature Request

During the FreeBSD boot process, it resets network devices. (Why? Good question.) When running in Firecrackers, this means resetting the vtnet device.

Describe the desired solution

When a status of 0 is written to the virtio device, it should reset.

It looks like the MMIO code in Firecracker gets this right, but it calls a _reset routine for the device and none of the devices have implemented it. As a result, the device is currently entering FAILED state instead.

Describe possible alternatives

FreeBSD's virtio driver doesn't actually notice that the device is in FAILED state, and Firecracker's devices continue to work despite being marked as FAILED -- so everything ends up working except for Firecracker logging warnings of ack virtio features in invalid state 0x8f and update virtio queue in invalid state 0x8f. But this (arguably buggy) behaviour of ignoring that the device has failed could change in the future, so it would be good to avoid having it marked as failed in the first place.

Additional context

Checks

  • Have you searched the Firecracker Issues database for similar requests?
  • Have you read all the existing relevant Firecracker documentation?
  • Have you read and understood Firecracker's core tenets?
@alsrdn
Copy link

alsrdn commented Aug 8, 2022

@cperciva can you share your kernel and rootfs so we can reproduce the issue?

@cperciva
Copy link
Contributor Author

cperciva commented Aug 8, 2022

@alsrdn Happy to share, but FreeBSD relies on PVH booting (see #3041) so it doesn't work on mainline Firecracker yet. Unless you want to work on virtio reset urgently I'd suggest waiting until I can get PVH support merged to Firecracker (which is next on my to-do list after I finish the outstanding issues on FreeBSD's side).

@alsrdn
Copy link

alsrdn commented Aug 9, 2022

Ah yes, that makes sense. Better to wait then.

@mattschlebusch mattschlebusch added the Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` label May 9, 2023
@pb8o
Copy link
Contributor

pb8o commented Sep 18, 2023

Waiting on #4073

@iMilnb
Copy link

iMilnb commented Jan 11, 2024

I can confirm the same behavior occurs also in NetBSD when bringing up the VirtIO NIC.

@acj acj linked a pull request Jan 22, 2024 that will close this issue
9 tasks
roypat added a commit to roypat/firecracker that referenced this issue Jul 3, 2024
When the guest driver sets a virtio devices status to `DRIVER_OK`, we
proceed with calling `VirtioDevice::activate`. However, our MMIO
transport layer assumes that this activation cannot go wrong, and calls
`.expect(...)` on the result. For most devices, this is fine, as the
activate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
`activate` method, which includes communication with the
vhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes `activate` to return an error, and thus consequently
a panic in the MMIO code.

The virtio spec, in Section 2.2, has the following to say [1]:

> The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
  that a reset is needed. If DRIVER_OK is set, after it sets
  DEVICE_NEEDS_RESET, the device MUST send a device configuration
  change notification to the driver.

So the spec-conform way of handling an activation error is setting
the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what
this commit does).

This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also firecracker-microvm#3074. Thus, the device will simply be
"dead" to the guest. But at least Firecracker won't crash anymore.

[1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
roypat added a commit to roypat/firecracker that referenced this issue Jul 3, 2024
When the guest driver sets a virtio devices status to `DRIVER_OK`, we
proceed with calling `VirtioDevice::activate`. However, our MMIO
transport layer assumes that this activation cannot go wrong, and calls
`.expect(...)` on the result. For most devices, this is fine, as the
activate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
`activate` method, which includes communication with the
vhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes `activate` to return an error, and thus consequently
a panic in the MMIO code.

The virtio spec, in Section 2.2, has the following to say [1]:

> The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
  that a reset is needed. If DRIVER_OK is set, after it sets
  DEVICE_NEEDS_RESET, the device MUST send a device configuration
  change notification to the driver.

So the spec-conform way of handling an activation error is setting
the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what
this commit does).

This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also firecracker-microvm#3074. Thus, the device will simply be
"dead" to the guest: All operations where we currently simply abort and
do nothing if the device is in the FAILED state will do the same in the
DEVICE_NEEDS_RESET state.

[1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
roypat added a commit that referenced this issue Jul 4, 2024
When the guest driver sets a virtio devices status to `DRIVER_OK`, we
proceed with calling `VirtioDevice::activate`. However, our MMIO
transport layer assumes that this activation cannot go wrong, and calls
`.expect(...)` on the result. For most devices, this is fine, as the
activate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
`activate` method, which includes communication with the
vhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes `activate` to return an error, and thus consequently
a panic in the MMIO code.

The virtio spec, in Section 2.2, has the following to say [1]:

> The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
  that a reset is needed. If DRIVER_OK is set, after it sets
  DEVICE_NEEDS_RESET, the device MUST send a device configuration
  change notification to the driver.

So the spec-conform way of handling an activation error is setting
the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what
this commit does).

This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also #3074. Thus, the device will simply be
"dead" to the guest: All operations where we currently simply abort and
do nothing if the device is in the FAILED state will do the same in the
DEVICE_NEEDS_RESET state.

[1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
ghost pushed a commit to JamesC1305/firecracker that referenced this issue Aug 14, 2024
When the guest driver sets a virtio devices status to `DRIVER_OK`, we
proceed with calling `VirtioDevice::activate`. However, our MMIO
transport layer assumes that this activation cannot go wrong, and calls
`.expect(...)` on the result. For most devices, this is fine, as the
activate method doesn't do much besides writing to an event_fd (and I
can't think of a scenario in which this could go wrong). However, our
vhost-user-blk device has some non-trivial logic inside of its
`activate` method, which includes communication with the
vhost-user-backend via a unix socket. If this unix socket gets closed
early, this causes `activate` to return an error, and thus consequently
a panic in the MMIO code.

The virtio spec, in Section 2.2, has the following to say [1]:

> The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
  that a reset is needed. If DRIVER_OK is set, after it sets
  DEVICE_NEEDS_RESET, the device MUST send a device configuration
  change notification to the driver.

So the spec-conform way of handling an activation error is setting
the `DEVICE_NEEDS_RESET` flag in the device_status field (which is what
this commit does).

This will fix the panic, however it will most certainly still not result
in correct device operations (e.g. a vhost-user-backend dying will still
be unrecoverable). This is because Firecracker does not actually
implement device reset, see also firecracker-microvm#3074. Thus, the device will simply be
"dead" to the guest: All operations where we currently simply abort and
do nothing if the device is in the FAILED state will do the same in the
DEVICE_NEEDS_RESET state.

[1]: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
@ShadowCurse ShadowCurse added the Status: Parked Indicates that an issues or pull request will be revisited later label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` Status: Parked Indicates that an issues or pull request will be revisited later
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants