Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running inside a VM with vfio-noiommu #272

Merged
merged 4 commits into from
Nov 18, 2020

Conversation

atyronesmith
Copy link
Contributor

@atyronesmith atyronesmith commented Oct 7, 2020

This PR is required for openshift/sriov-network-operator#369.

This PR enables the mapping of a no-iommu vfio device into a container.

Associated system configuration enables no-iommu for the vfio-pci driver.

This PR is required to allow Kubernetes to run in a virtual environment that does not provide a virtualized iommu. The initial target environment is OpenStack. OpenStack Nova does not currently provide a virtualized iommu. Support for vfio-noiommu was added to linux a few years ago with the caveats outlined below...

`There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system. There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines. However, there are still those users
that want userspace drivers even under those conditions. The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has. In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver. This
should make it very clear that this mode is not safe. Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode. Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported. This patch includes no-iommu support for the vfio-pci bus
driver only.`

In the future, the current QEMU machine type in use for OpenStack is i440fx. The
machine type should be switched to Q35 in the future. The Q35 machine provides
a virtualized iommu implementation and will obsolete this patch.

In summary, the use of noiommu introduces security concerns, but the performance
boost provided by vfio-pci can be justified in many cases.

pkg/utils/utils.go Show resolved Hide resolved
pkg/utils/utils.go Show resolved Hide resolved
@zshi-redhat
Copy link
Collaborator

/lgtm

@zshi-redhat
Copy link
Collaborator

/cc @ahalim-intel @adrianchiris @martinkennelly @moshe010 PTAL

pkg/resources/vfioResource.go Outdated Show resolved Hide resolved
Copy link
Contributor

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks OK to me.
Small comment posted.

@atyronesmith atyronesmith changed the title WIP - Add support for running inside a VM with vfio-noiommu Add support for running inside a VM with vfio-noiommu Oct 23, 2020
Copy link
Contributor

@moshe010 moshe010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can update the readme for the VM with vfio-noiommu?
  2. According to your PR description the pod should have SYS_RAWIO in the capabilities or to be privileged maybe you can have examples like we have for dpdk or rdma https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/tree/master/docs/dpdk.
  3. It is not explicitly stated in the PR, but my understanding is that the use-case is for running dpdk POD in the VM. Am I right? if that the case I think we need to update plugin/tree/master/docs/dpdk/README file with that use-case.

@martinkennelly
Copy link
Member

Second the three points made by @moshe010 especially emphasizing the addition of documentation. PR looks good. I am currently testing it locally. Thank you for this.

Fixed link path

Check for missing name

fix name missing

fix err name

Add dual return

Additional Documentation
@atyronesmith
Copy link
Contributor Author

@moshe010 @martinkennelly Docs pushed

docs/dpdk/README-virt.md Outdated Show resolved Hide resolved
docs/dpdk/README-virt.md Outdated Show resolved Hide resolved

With `vfio-pci` an application must run privilege Pod with **IPC_LOCK** and **CAP_SYS_RAWIO** capability.

# Example deployment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this feature does not work with SRIOV CNI - can you add in an example CNI to use and state this feature doesn't work with SRIOV CNI. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with host-device, but also failed.
I am getting:

failed to find host device: no net directory under pci device 0000:00:06.0

Is there a patch outstanding on a CNI to fix this for vfio-pci devices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so at this point no CNI is expected to be used if i understood correctly

@martinkennelly
Copy link
Member

LGTM overall besides nits / CNI Question.

@atyronesmith
Copy link
Contributor Author

@martinkennelly Aplogies, forgot to document the SR-IOV CNI situation. I will include an example that I have been using for testing. Basically, just remove the annotations...

  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-net1

and keep the resources...

    resources:
      requests:
        intel.com/intel_sriov_netdevice: '1'
      limits:
        intel.com/intel_sriov_netdevice: '1'
```
unchanged.


# Pod Usage

Unfortunately, the SR-IOV CNI is not compatible with the noiommu feature. To use the noiommu feature, remove the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section kinda hints that SR-IOV CNI should support this, which i am not sure is the case.

suggestion:
just have the pod example, then a note:

Note: When consuming a VFIO device in a virtual environment secondary network is not required as network configuration for the underlying VF should be performed at the hypervisor level.

README.md Outdated
@@ -367,6 +368,45 @@ $ kubectl get node node1 -o json | jq '.status.allocatable'

```

### Virtual environments with no iommu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to me a rather specialized usecase, IMO we move this section under docs/dpdk/README-virt

we can have a general description here and point to docs/dpdk/README-virt
we should also explain that not all selectors are applicable.

e.g

Virtual environments

SR-IOV network device plugin supports running in a virtualized environment, however, not all device selectors are applicable as the VFs are passthrough to the VM without any association to their respective PF, hence any device selector that relies on the association between a VF and its PF will not work.

The following selectors will not work in a virtualized environment: pfNames, rootDevices

Virtual environments with no iommu

SR-IOV network device plugin supports allocating VFIO devices in a virtualized environment without a virtualized iommu
for more information refer to {add link}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S
I realize we are missing some documentation on consuming sr-iov resources in virtualized environment in general, however thats not related to this PR.

- name: testpmd
image: <DPDK testpmd image>
securityContext:
privileged: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in the example pod should it be privileged ? or will it generally work with CAP_IPC_LOCK and CAP_SYS_RAWIO ?
id prefer to list the minimum capabilities.


# Usage

_When consuming a VFIO device in a virtual environment, a secondary network is not required as network configuration for the underlying VF should be performed at the hypervisor level._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting her should be with a > to emphasize its a note/commet similarly as done in other part of the docs e.g L40 in this file

When consuming a VFIO device in a virtual environment, a secondary network is not required as network configuration for the underlying VF should be performed at the hypervisor level

@zshi-redhat
Copy link
Collaborator

@adrianchiris could you take a look at the new doc update?

Copy link
Contributor

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you !

Copy link
Member

@martinkennelly martinkennelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants