Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added support for multiple resource endpoints #26

Merged
merged 5 commits into from
Nov 9, 2018

Conversation

ahalimx86
Copy link
Collaborator

This changes will add following new features to sriov-device-pLlugin

- Handles SRIOV capable/not-capable devices (NICs and Accelerators alike)
- Supports devices with both Kernel and userspace(uio and VFIO) drivers
- Supports PF bound to DPDK driver to meet certain use-cases
- Allow grouping together multiple PCI devices as one aggregated
resource pool
- Can represent each PF as a separately addressable resource pool to K8s
- User configurable resourceName
- Detects Kubelet restarts and auto-re-register
- Detects Link status (for Linux network devices) and updates associated
VFs health accordingly
- Extensible to support new device types with minimal effort if not
already supported

Change-Id: I314e1ea75e647e747bcfdb2f447e593adbcd14c5

@rkamudhan
Copy link
Member

@zshi-redhat @tomo @dougbtv @fepan Please review the multiple endpoint features. It is really good implementation by @ahalim-intel

Copy link
Collaborator

@zshi-redhat zshi-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, Abdul!
Adding few comments inline by reviewing the code, may come back and add more after testing.

readOnly: true
- name: log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod specs was changed during testing. Will add back

deployments/kernel-net-demo/pod-tc1.yaml Outdated Show resolved Hide resolved
deployments/kernel-net-demo/pod-tc1.yaml Outdated Show resolved Hide resolved
return
}
default:
_, err := os.Lstat(sockPath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider moving to kubelet plugin watcher mode instead of watching the sriov endpoint file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! This will reduce the burden of watching kubelet. Let's see if we can merge your kubelet plugin watcher PR with this implementation. This was implemented long before Kubelet plugin watcher support.

pkg/utils/utils.go Show resolved Hide resolved
type resourcePool struct {
config *types.ResourceConfig
devices map[string]*pluginapi.Device
bindScript string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this bindScript will be used? I didn't find an example for existing pools, such as vfio, uio, generic, netDevice; is it for binding interface with userspace drivers like dpdk_bind script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I was thinking of do the driver binding unbinding using the dpdk-devbind.py script if that was needed. But this may add additional complexities and issues. It's better to leave the driver binding part outside of the scope of device plugin. This way device plugin can stay as simple as possible. Left the binding code here should we need in the future. Or else can be removed.

return newVfioResourcePool(rc)
case "uio":
return newUioResourcePool(rc)
case "netdevice":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the name 'netdevice' reminds me of ovs 'netdev' datapath_type, but it's intended for the kernel net devices.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you suggest any alternative?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps ethdevice or kerneldevice.

func (rf *resourceFactory) GetResourcePool(rc *types.ResourceConfig) types.ResourcePool {
glog.Infof("Resource pool type: %s", rc.DeviceType)
switch rc.DeviceType {
case "vfio":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 'vfio' or 'uio' DeviceType be specified in config file, do we assume that the dpdk port has already been binded to userspace driver before launching SR-IOV device plugin? Does it imply that SR-IOV CNI will not support binding of dpdk driver?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 'vfio' or 'uio' DeviceType be specified in config file, do we assume that the dpdk port has already been binded to userspace driver before launching SR-IOV device plugin? Does it imply that SR-IOV CNI will not support binding of dpdk driver?

That is correct. Admin binds VFs to the desired drivers(kernel network driver, igb_uio or vfio-pci) then define the resource config accordingly for device plugin to export correct device files. If a VF is bound with userspace driver then SRIOV-CNI has no job to do(So we could just skip it for now). It is up to the application configure the port.

pkg/resources/factory.go Show resolved Hide resolved

func (rs *resourceServer) triggerUpdate() {
rp := rs.resourcPool
if rs.checkIntervals > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to capture the notification of PF link state change and then triggers the update accordingly? With checkIntervals, there might be some short period when link state is changed, but kubelet doesn't get updated state and still allocated unhealthy devices to pods.

Copy link
Collaborator Author

@ahalimx86 ahalimx86 Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a file watcher that watches the operstate? Lets start an issue on this later once this PR is merged

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sth like a watcher or a block wait to detect notification immediately when state changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, please submit a PR with an implementation if you can, That would be great. Thanks

@ahalimx86
Copy link
Collaborator Author

Thanks for the review @zshi-redhat New changes on the way!
Also planning to switch the build script to 'make' file in future.

README.md Outdated
This plugin was tested in following environment:
- CentOS 7 - Kernel 4.13.7-1.el7.elrepo.x86_64
- Go - version go1.10.3 linux/amd64
- Kubernetes v1.12.0-alpha
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there special consideration on which kube release we will support or be compatible with going forward?
for some feature like kubelet plugin watcher which is Beta in 1.12; switch to it in DP might cause incompatibility with previous kube releases or some extra configuration(enable feature gates) in order to work with previous releases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's depends on K8s device plugin features support really, our plugin should work with any k8s release that supports device plugin(without any dramatic changes of course). The Kubelet plugin watcher is backward compatible still. Later we should definitely switch to the plugin watcher mechanism.
Here we simply mentioning the versions we have tested with this implementation.

booxter added a commit to booxter/kubevirt that referenced this pull request Oct 19, 2018
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable set by
SR-IOV CNI plugin; and configures hostdev libvirt devices that corresponds to
the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher pods
privileged to allow qemu open /dev/vfio/NN devices. We don't know in advance
the name of the device until we create and start the pod, at which point SR-IOV
DP allocates a PCI ID to the pod that can be mapped to its IOMMU group number
and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with device
cgroup, at which point we will be able to drop the privileged mode. (Additional
capabilities like SYS_RESOURCE and SYS_RAWIO are still needed.) This work is
tracked in: k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Oct 19, 2018
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable
set by SR-IOV CNI plugin; and configures hostdev libvirt devices that
corresponds to the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel
interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher
pods privileged to allow qemu open /dev/vfio/NN devices. We don't know
in advance the name of the device until we create and start the pod,
at which point SR-IOV DP allocates a PCI ID to the pod that can be
mapped to its IOMMU group number and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with
device cgroup, at which point we will be able to drop the privileged
mode. (Additional capabilities like SYS_RESOURCE and SYS_RAWIO are
still needed.) This work is tracked in:
k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Oct 23, 2018
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable
set by SR-IOV CNI plugin; and configures hostdev libvirt devices that
corresponds to the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel
interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher
pods privileged to allow qemu open /dev/vfio/NN devices. We don't know
in advance the name of the device until we create and start the pod,
at which point SR-IOV DP allocates a PCI ID to the pod that can be
mapped to its IOMMU group number and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with
device cgroup, at which point we will be able to drop the privileged
mode. (Additional capabilities like SYS_RESOURCE and SYS_RAWIO are
still needed.) This work is tracked in:
k8snetworkplumbingwg/sriov-network-device-plugin#26
@ahalimx86 ahalimx86 force-pushed the dev/multiendpoints branch 2 times, most recently from a8bd563 to 8326564 Compare October 30, 2018 16:19
booxter added a commit to booxter/kubevirt that referenced this pull request Oct 30, 2018
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable
set by SR-IOV CNI plugin; and configures hostdev libvirt devices that
corresponds to the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel
interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher
pods privileged to allow qemu open /dev/vfio/NN devices. We don't know
in advance the name of the device until we create and start the pod,
at which point SR-IOV DP allocates a PCI ID to the pod that can be
mapped to its IOMMU group number and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with
device cgroup, at which point we will be able to drop the privileged
mode. (Additional capabilities like SYS_RESOURCE and SYS_RAWIO are
still needed.) This work is tracked in:
k8snetworkplumbingwg/sriov-network-device-plugin#26
This changes will add following new features to sriov-device-pLlugin

    - Handles SRIOV capable/not-capable devices (NICs and Accelerators alike)
    - Supports devices with both Kernel and userspace(uio and VFIO) drivers
    - Supports PF bound to DPDK driver to meet certain use-cases
    - Allow grouping together multiple PCI devices as one aggregated
    resource pool
    - Can represent each PF as a separately addressable resource pool to K8s
    - User configurable resourceName
    - Detects Kubelet restarts and auto-re-register
    - Detects Link status (for Linux network devices) and updates associated
    VFs health accordingly
    - Extensible to support new device types with minimal effort if not
    already supported

Change-Id: I314e1ea75e647e747bcfdb2f447e593adbcd14c5
Change-Id: I802e3d8bf2893f4975483d68da459f49c9f7bb80
Change-Id: I46fc648f1b400bb55f207d5a24cf91571dbd1a87
updated device were appended instead of a new list. This is
corrected.

Change-Id: Ia10c53209872cd691d4c83fe8cbfdc4e91650693
for _, container := range rqt.ContainerRequests {
containerResp := new(pluginapi.ContainerAllocateResponse)
containerResp.Devices = rs.resourcePool.GetDeviceSpecs(rs.resourcePool.GetDeviceFiles(), container.DevicesIDs)
containerResp.Envs = rs.resourcePool.GetEnvs(rs.resourcePool.GetResourceName(), container.DevicesIDs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the Env var has hard-coded name 'SRIOV-VF-PCI-ADDR' which can be used by application in Pod to get information about which DeviceIDs have been allocated; With above change, the application doesn't know which name of Env shall be queried to get the DeviceIDs since the resourceName is unknown to the Pod. although Pod can try to get the resourceName by querying the network CRD attached with it, but I don't think this is a good option to go as it basically requires individual pod to access system network CRD., plus the resourceName(sriov) in Env only contains the suffix of actual full resourceName(inte.com/sriov).
Perhaps we can keep the hardcoded Env name as it is and make the resourceName+DeviceID a dict value of Env.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the network attachment crd is known to the Pod then the associated CRD is also attainable by looking at the resourceName in it. Maybe it's something for a discussion.

Perhaps we can keep the hardcoded Env name as it is and make the resourceName+DeviceID a dict value of Env.

Can you give an example how it should look like?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the network attachment crd is known to the Pod then the associated CRD is also attainable by looking at the resourceName in it. Maybe it's something for a discussion.

The resourceName returned in "containerResp.Envs" only contains the suffix (e.g sriov), in multiple endpoints mode, how does Pod get to know which prefix is used(assuming multiple prefix are advertised)?

Perhaps we can keep the hardcoded Env name as it is and make the resourceName+DeviceID a dict value of Env.

Can you give an example how it should look like?

for example:
Env['SRIOV-VF-PCI-ADDR'] = [{"resourceName": "intel.com/sriov", "deviceIDs": "00:01.0 00:01.1"}, {"resourceName": "example.com/sriov", "deviceIDs": "00:02.0 00:02.1"}]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess good documentation and conventions we need to follow in such case.

for example:
Env['SRIOV-VF-PCI-ADDR'] = [{"resourceName": "intel.com/sriov", "deviceIDs": "00:01.0 00:01.1"}, {"resourceName": "example.com/sriov", "deviceIDs": "00:02.0 00:02.1"}]

It's good to have a known environment variable but now we running into another problem. This env variable will be overwritten by the most recent device allocate request. Consider this, a Pod with have two VFs requests from two different resource pool; sriovA and sriovB.

Say, sriovA allocate requests came in first. So the ENV variable is set to:
SRIOV-VF-PCI-ADDR= [{"resourceName": "intel.com/sriovA", "deviceIDs": "00:01.0"}]

Then the sriovB request came in and we expect to see:
SRIOV-VF-PCI-ADDR= [{"resourceName": "intel.com/sriovA", "deviceIDs": "00:01.0"},{"resourceName": "intel.com/sriovB", "deviceIDs": "02:00.1"}]
Right?

But what we will actually get:
SRIOV-VF-PCI-ADDR= [{"resourceName": "intel.com/sriovB", "deviceIDs": "02:00.1"}]

It's because there is no way of knowing if an ENV var already has a value or not. So, when we set it we simply setting a new value. That's why it makes sense to have the env variable name unique for the resourceName.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahalim-intel you're right, I didn't think of that overwrite problem. Then the question remains, how the application running in the Pod can get the PCI address allocated for the Pod. This is a real question from kubevirt community where they use current env['SRIOV-VF-PCI-ADDR'] to get PCI address and passthough that to VM.
I'm not sure if it's a good design for Pod application to query API and parse the network CRD in order to retrieve the resourceName information, it sounds too much overhead for the application. can you think of a better way for application to get the resourceName?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open for better alternatives

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahalim-intel the very least, can we use full resource names (with intel.com/ prefix) for the variable names? This will at least help us to not assume the prefix is intel.com/, or that it is present and need to be dealt with at all. It will also allow other DPs to mimic the Intel DP behavior if they need to, without modifications to pod apps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we can do on kubevirt side is to precalculate interface-to-resourceName mapping when creating a pod and pass it off band (environment variable with a well known name?) into the pod. The pod then will extract the mapping and use it to decide which SR-IOV DP managed enviroment variable to use to extract a PCI ID for each VM interface.

We may drop the intel.com/ prefix to construct the PCI ID variable name ourselves, but it would be slightly easier for us (and more compatible with any other solutions that may pop up for device management) if we keep the full resource name for the variables. I understand that environment variables can't have slashes, so perhaps we can replace them with a different character (underscore?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, Multus could pass the network-to-resourceName mapping into the pod (todo: check if CNI API supports it). Then we would have the mapping defined in all pods, not just kubevirt virt-launcher pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at CNI API description today, specifically at Result format: https://github.com/containernetworking/cni/blob/master/SPEC.md#result and there seems to be no official back-channel from CNI back to pod to pass environment variables or anything like that, so Multus doesn't seem to be a good place for mapping calculation as I originally suggested. On the other side, SR-IOV DP is not the right place either, since it doesn't deal with CRDs.

But I've heard there are plans to add a new mutating hook for SR-IOV DP that would fill in requests / limits based on linked network CRDs. The same hook could also calculate and pass the mapping, via an environment variable or something else.

booxter added a commit to booxter/kubevirt that referenced this pull request Nov 1, 2018
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable
set by SR-IOV CNI plugin; and configures hostdev libvirt devices that
corresponds to the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel
interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher
pods privileged to allow qemu open /dev/vfio/NN devices. We don't know
in advance the name of the device until we create and start the pod,
at which point SR-IOV DP allocates a PCI ID to the pod that can be
mapped to its IOMMU group number and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with
device cgroup, at which point we will be able to drop the privileged
mode. (Additional capabilities like SYS_RESOURCE and SYS_RAWIO are
still needed.) This work is tracked in:
k8snetworkplumbingwg/sriov-network-device-plugin#26
@dankenigsberg
Copy link

mapping each PCI device to its network sounds useful for other purposes, but I'm not sure why it is relevant here.
I'd be content if for every device attached to the Pod, I get a different envvar named
SRIOV-VF-PCI-ADDR-<deviceid> or SRIOV-VF-PCI-ADDR-<meaningless-index>

A pod application can easily iterate over all SRIOV-VF-PCI-ADDR-* and do its thing with them.

@dankenigsberg
Copy link

P.S the choice of envvar name should allow allocation of two devices of the same resourceName. A Pod may very well request access to vlan1 and vlan2 on sriovA.

@ahalimx86
Copy link
Collaborator Author

@dankenigsberg @booxter Thanks for your comments.

There are plans to add the allocated device id in the network status as part of Result. So, this information will be available from Pod status. But for applications to get this info will need to communicate to API server. Env variable is a convenient way of getting this information.

It would be useful to explore the possibility of utilizing the same hook to have this resourceName env translated to a well-known name. Then again we run into the same problem as discussed above with multiple resources.

For now, I think we could include some common prefix such as:
PCIDEVICE_<resourceName>=<one or more PCI address with known delimiter>

@dankenigsberg
Copy link

I think that PCIDEVICE_<resourceName>=<one or more PCI address with known delimiter> has the same "overwrite" problem.
I find having multiple PCIDEVICE_<resourceName>_<PCI address=more info about device vars simpler to write in the DP and to iterate in the Pod. What do you think about them, @ahalim-intel ?

@ahalimx86
Copy link
Collaborator Author

I think that PCIDEVICE_<resourceName>=<one or more PCI address with known delimiter> has the same "overwrite" problem.
I find having multiple PCIDEVICE_<resourceName>_<PCI address=more info about device vars simpler to write in the DP and to iterate in the Pod. What do you think about them, @ahalim-intel ?

Actually we don't have the overwrite problem now as it is(current implementation). That is because we have a unique variable name(resourceName). For multiple resource request with same resourceName, the PCI addresses will be appended to the same variable's value.
Only change we need to do is add a well-known prefix to the variable name(PCIDEVICE_) but this will be still unique per resource pool.
Say, we have 2 resource request from sriov_a. With this approach we are going to get one variable with 2 values:
PCIDEVICE_SRIOV_A=00:01.0;02:00.1

With the one you described we will get 2 variables:
PCIDEVICE_SRIOV_A_00:01.0=<extra info>
PCIDEVICE_SRIOV_A_02:00.0=<extra info>

Also, most likely the : and . are not allowed in var names, so those need be taken care of too.

@booxter
Copy link
Contributor

booxter commented Nov 5, 2018

@dankenigsberg @ahalim-intel the issue with <common_prefix>_ naming scheme is that you don't know which Multus network each PCI ID belongs to (and with heterogeneous nodes it becomes important).

Having resourceName encoded in variable names is definitely the way to go. In KubeVirt context, we still miss the link between Multus networkName and resourceName. While we can calculate and pass it into the pod in KubeVirt, I was wondering if having it implemented in more broad context for all pods would be more helpful. Sadly, CNI API doesn't seem to provide a way to communicate such mapping back to pod. I think we could have a mutating hook that would calculate the mapping and pass through a variable on a pod.

I will write a PoC showing the solution for KubeVirt, and we can discuss from there if something similar should be implemented in more general context.

@dankenigsberg
Copy link

I understand why matching a network device to the network it connects to is interesting. But it is interesting in non-sriov networks just as much. Hence I don't think that it should do delay this pr, or merit a complex solution such as calling kAPI from inside each Pod.

I don't understand the need in a kubevirt PoC. For kubevirt, you can match a pci device based on the mac address that sriov-cni he set on it. But I'll wait and see.

@booxter
Copy link
Contributor

booxter commented Nov 5, 2018

@dankenigsberg I don't think we should delay the PR (except for extending per-pool variable names to full resource Name, with intel.com/ prefix). Yes, mapping between networks and resource names is useful beyond this particular DP. Who would be the component to serve it to pods? Multus?

@dankenigsberg please explain your solution for KubeVirt particular concern. How do we know the MAC address allocated by CNI to an interface?

@rkamudhan
Copy link
Member

rkamudhan commented Nov 5, 2018

@ALL

Guys, I believe the PR by itself is bug free. PCI address introduction is important for the application like DPDK or KubeVirt. @ahalim-intel introduced a way to expose the PCI as env variable based on the resource name. Can we agree on one strategy and improve the solution in the upcoming PRs.

@booxter
Copy link
Contributor

booxter commented Nov 5, 2018

@rkamudhan AFAIK we still use a shortened resourceName (without intel.com prefix) as the variable name. If we'd like to have it extendable to other potential solutions, it would be nice if we could have the prefix a part of the variable name so e.g. INTEL_COM_sriov_vfio=,,... instead of current sriov_vfio=,,... Otherwise we are left to assume the intel.com/ prefix in annotations and cut it off before accessing the variable.

Change-Id: Ieda50d0eff71df7869dcec4533d3c16bd54fd275
@ahalimx86 ahalimx86 merged commit cd8941f into master Nov 9, 2018
ahalimx86 added a commit that referenced this pull request Nov 9, 2018
* added support for multiple resource endpoints

This changes will add following new features to sriov-device-pLlugin

    - Handles SRIOV capable/not-capable devices (NICs and Accelerators alike)
    - Supports devices with both Kernel and userspace(uio and VFIO) drivers
    - Supports PF bound to DPDK driver to meet certain use-cases
    - Allow grouping together multiple PCI devices as one aggregated
    resource pool
    - Can represent each PF as a separately addressable resource pool to K8s
    - User configurable resourceName
    - Detects Kubelet restarts and auto-re-register
    - Detects Link status (for Linux network devices) and updates associated
    VFs health accordingly
    - Extensible to support new device types with minimal effort if not
    already supported

Change-Id: I314e1ea75e647e747bcfdb2f447e593adbcd14c5

* changes based on review

Change-Id: I802e3d8bf2893f4975483d68da459f49c9f7bb80

* corrected network annotaion for pod-tc4.yaml

Change-Id: I46fc648f1b400bb55f207d5a24cf91571dbd1a87

* fixed updated device list on health change

updated device were appended instead of a new list. This is
corrected.

Change-Id: Ia10c53209872cd691d4c83fe8cbfdc4e91650693

* minor fixes and typos

Change-Id: Ieda50d0eff71df7869dcec4533d3c16bd54fd275
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 9, 2018
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable
set by SR-IOV CNI plugin; and configures hostdev libvirt devices that
corresponds to the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel
interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher
pods privileged to allow qemu open /dev/vfio/NN devices. We don't know
in advance the name of the device until we create and start the pod,
at which point SR-IOV DP allocates a PCI ID to the pod that can be
mapped to its IOMMU group number and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with
device cgroup, at which point we will be able to drop the privileged
mode. (Additional capabilities like SYS_RESOURCE and SYS_RAWIO are
still needed.) This work is tracked in:
k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 12, 2018
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable
set by SR-IOV CNI plugin; and configures hostdev libvirt devices that
corresponds to the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel
interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher
pods privileged to allow qemu open /dev/vfio/NN devices. We don't know
in advance the name of the device until we create and start the pod,
at which point SR-IOV DP allocates a PCI ID to the pod that can be
mapped to its IOMMU group number and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with
device cgroup, at which point we will be able to drop the privileged
mode. (Additional capabilities like SYS_RESOURCE and SYS_RAWIO are
still needed.) This work is tracked in:
k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 12, 2018
TODO: extract resourceName from CRD.
TODO: keep it backwards compatible with previous DP version.
TODO: tests.

This is to align kubevirt with support for multiple allocation pools in SR-IOV
DP: k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 12, 2018
TODO: grant access to networks to virt-controller.
TODO: keep it backwards compatible with previous DP version.
TODO: handle errors from network client.
TODO: fix tests.

Consider:
- cache networks.

This is to align kubevirt with support for multiple allocation pools in SR-IOV
DP: k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 13, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: fix tests

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 14, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 14, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 15, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 23, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Nov 26, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Nov 26, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Nov 29, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Nov 30, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Nov 30, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Nov 30, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Dec 3, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Dec 3, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Dec 3, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Dec 3, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Dec 3, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
greg-bock pushed a commit to greg-bock/kubevirt that referenced this pull request Dec 3, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
booxter added a commit to booxter/kubevirt that referenced this pull request Dec 4, 2018
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
yossisegev pushed a commit to yossisegev/kubevirt that referenced this pull request Jan 3, 2019
The type extracts PCI IDs from SRIOV-VF-PCI-ADDR environment variable
set by SR-IOV CNI plugin; and configures hostdev libvirt devices that
corresponds to the extracted IDs.

The change mounts additional host mounts for /sys and /dev to allow
containerized libvirt / qemu to plug pci devices using vfio kernel
interface.

Note that at this moment, SR-IOV enabled VMIs run their virt-launcher
pods privileged to allow qemu open /dev/vfio/NN devices. We don't know
in advance the name of the device until we create and start the pod,
at which point SR-IOV DP allocates a PCI ID to the pod that can be
mapped to its IOMMU group number and hence /dev/vfio/NN device.

In the future, SR-IOV DP will register the /dev/vfio/NN device with
device cgroup, at which point we will be able to drop the privileged
mode. (Additional capabilities like SYS_RESOURCE and SYS_RAWIO are
still needed.) This work is tracked in:
k8snetworkplumbingwg/sriov-network-device-plugin#26
yossisegev pushed a commit to yossisegev/kubevirt that referenced this pull request Jan 3, 2019
This patch adopts changes from the latest SR-IOV device plugin master [1] that
allow us to attach a VMI to multiple SR-IOV Multus networks. (Before the
change, devices allocated for multiple networks could end up in different boot
order than configured.)

Among other things, this patch adds read access to network attachment
definitions to virt-controller role. This is to allow the controller to fetch
network CRD objects and determine their respective resource names. For this
same reason, it extends KubevirtClient object to include NetworkClient().

We do *not* maintain backwards compatibility with older SR-IOV device plugin
releases. This is as intended because SR-IOV support is still considered
experimental, and there is no good reason for us to support the device plugin
release that does not support multiple networks.

Updated documentation to reflect the fact that kubevirt is now compatible with
device plugin master (and *not* compatible with v1.0).

todo: get rid of vendor/* change

[1] k8snetworkplumbingwg/sriov-network-device-plugin#26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants