Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDI implementation #489

Merged
merged 1 commit into from
Oct 1, 2023
Merged

CDI implementation #489

merged 1 commit into from
Oct 1, 2023

Conversation

e0ne
Copy link
Collaborator

@e0ne e0ne commented Jun 4, 2023

No description provided.

@e0ne e0ne force-pushed the cdi branch 2 times, most recently from bde3b1a to a417105 Compare June 6, 2023 19:25
@e0ne e0ne marked this pull request as ready for review June 12, 2023 06:47
@e0ne e0ne changed the title WIP. CDI implementation CDI implementation Jun 19, 2023
@coveralls
Copy link
Collaborator

coveralls commented Jun 20, 2023

Pull Request Test Coverage Report for Build 5680170820

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 42 of 120 (35.0%) changed or added relevant lines in 9 files are covered.
  • 96 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-2.4%) to 75.845%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/factory/factory.go 0 1 0.0%
cmd/sriovdp/main.go 0 3 0.0%
pkg/accelerator/accelResourcePool.go 0 3 0.0%
pkg/auxnetdevice/auxNetResourcePool.go 0 3 0.0%
pkg/resources/pool_stub.go 0 3 0.0%
pkg/resources/server.go 21 35 60.0%
pkg/cdi/cdi.go 13 64 20.31%
Files with Coverage Reduction New Missed Lines %
cmd/sriovdp/main.go 6 0.0%
pkg/utils/testing.go 8 56.6%
pkg/factory/factory.go 17 85.52%
cmd/sriovdp/manager.go 27 80.89%
pkg/resources/server.go 38 77.44%
Totals Coverage Status
Change from base Build 5611817480: -2.4%
Covered Lines: 1975
Relevant Lines: 2604

💛 - Coveralls

Copy link
Contributor

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that device plugin daemonset would need to mount cdi dir
(/var/run/cdi) IIRC

@e0ne e0ne force-pushed the cdi branch 2 times, most recently from b8caf78 to cefd97b Compare June 26, 2023 20:00
cmd/sriovdp/main.go Outdated Show resolved Hide resolved
cmd/sriovdp/manager.go Outdated Show resolved Hide resolved
rPool, err := rm.rFactory.GetResourcePool(rc, filteredDevices)
if err != nil {
glog.Errorf("initServers(): error creating ResourcePool with config %+v: %q", rc, err)
return err
}

if rm.useCdi {
err = cdi.CreateCDISpec(rm.resourcePrefix, filteredDevices, rPool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you regenerated it because the logic is in device plugin and you don't want to duplicate it

name: default-cdi
- mountPath: /var/run/cdi
name: dynamic-cdi
- mountPath: /host//etc/pcidp/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra /

pkg/cdi/cdi.go Outdated
func CreateCDISpec(resourcePrefix string, filteredDevices []types.HostDevice, rPool types.ResourcePool) error {
cdiDevices := make([]cdiSpecs.Device, 0)
cdiSpec := cdiSpecs.Spec{
Version: "0.5.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/cdi/cdi.go Outdated
deviceNode := cdiSpecs.DeviceNode{
Path: spec.ContainerPath,
HostPath: spec.HostPath,
Permissions: "rwm",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check if the m in Permissions is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't needed for VFs. I'll check with SFs and create a separate PR to fix it over the whole project

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check with and without RDMA CM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I checked both for VFs

```yaml
- mountPath: /etc/cdi/
name: default-cdi
- mountPath: /var/run/cdi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need just /var/run/cdi imo we dont deal with statically configured files right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should align daemonsets as well

@@ -4,6 +4,7 @@ go 1.20

require (
github.com/Mellanox/rdmamap v1.1.0
github.com/container-orchestrated-devices/container-device-interface v0.5.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to check if we prefer to:

  1. bump this to v0.6.0 and require containerd 1.7.5 or cri-o 1.28 or keep as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -324,6 +365,9 @@ func (rs *resourceServer) cleanUp() error {
if err := rs.resourcePool.CleanDeviceInfoFile(rs.resourceNamePrefix); err != nil {
errors = append(errors, err.Error())
}
if err := rs.cdi.CleanupSpecs(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should:

  1. cleanup spec files which are related only to this resource server her
  2. in the resource manager we should cleanup any remaining cdi spec files before starting servers just in case on un-gracefull exit of previous device plugin instance.

thoughts ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's better to move this logic to resource manager only: it will cleanup all spec files before resorce servers start, so no raise condidion in this case and all outdated specs will be removed. With such behaviour we don't need to filter for orphaned spec files if config changed during device plugin restart

// Impl implements CDI interface
type Impl struct {
}

Copy link
Contributor

@adrianchiris adrianchiris Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : define a New() method

func New() CDI {
    return &impl{}
}

and use it in resource server

also consider making Impl struct private

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// CreateCDISpecForPool creates CDI spec file with specified devices
func (c *Impl) CreateCDISpecForPool(resourcePrefix string, rPool types.ResourcePool) error {
err := c.CleanupSpecs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will delete spec files of other resource servers. see other comments related to where/how we should clean up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please loot at #489 (comment)

func (c *Impl) CleanupSpecs() error {
for _, dir := range cdi.GetRegistry().GetSpecDirectories() {
specs, err := filepath.Glob(filepath.Join(dir, cdiSpecPrefix+"*"))
if err != nil {
Copy link
Contributor

@adrianchiris adrianchiris Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this one fail if dir doesnt exist ?

if yes, then for such error we need to skip, else disregard my comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glob returns the names of all files matching pattern or nil if there is no matching file. The syntax of patterns is the same as in Match. The pattern may describe hierarchical names such as /usr/*/bin/ed (assuming the Separator is '/').
Glob ignores file system errors such as I/O errors reading directories. The only possible returned error is ErrBadPattern, when pattern is malformed.

https://pkg.go.dev/path/filepath#Glob

annoKey := "cdi.k8s.io/example.com_net"
annoVal := "example.com/net=0000:00:00.2"
Expect(annotations[annoKey]).To(Equal(annoVal))
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add test for cleanup method as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final comments on my side, once addressed im LGTM.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!
I left some comments

func flagInit(cp *cliParams) {
flag.StringVar(&cp.configFile, "config-file", defaultConfig,
"JSON device pool config file location")
flag.StringVar(&cp.resourcePrefix, "resource-prefix", "intel.com",
"resource name prefix used for K8s extended resource")
"resource name prefix used for K8s extended re"+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why a new line here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -43,7 +46,7 @@ func main() {

glog.Infof("resource manager reading configs")
if err := rm.readConfig(); err != nil {
glog.Errorf("error getting resources from file %v", err)
glog.Error("error getting resources from file", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should have the errorf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -101,12 +109,16 @@ func (rm *resourceManager) readConfig() error {
}

func (rm *resourceManager) initServers() error {
err := rm.cleanupCDISpecs()
if err != nil {
glog.Infof("Unable to delete CDI specs: %s", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be error here if we return an error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

func (rm *resourceManager) cleanupCDISpecs() error {
if rm.cliParams.useCdi {
if err := rm.cdi.CleanupSpecs(); err != nil {
return fmt.Errorf("unable to delete CDI specs: %s", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be %v here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -23,6 +23,10 @@ import (
"github.com/k8snetworkplumbingwg/sriov-network-device-plugin/pkg/types"
)

const (
accelPoolType = "net-accel"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we add net-?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it for the consistency across all device types

pkg/cdi/cdi.go Outdated
func (c *impl) CreateCDISpecForPool(resourcePrefix string, rPool types.ResourcePool) error {
err := c.CleanupSpecs()
if err != nil {
glog.Error("can not cleanup old spec files", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add CreateCDISpecForPool(): so we can be consistent in the logs or remove it from the log below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

annoVal := "example.com/net=0000:00:00.1"
Expect(annotations[annoKey]).To(Equal(annoVal))
})
It("should not fail on non-existing device", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure the test is right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're righ. I'll delete this test

containerResp.Annotations, err = rs.cdi.CreateContainerAnnotations(
container.DevicesIDs, rs.resourceNamePrefix, rs.resourcePool.GetCDIName())
if err != nil {
return nil, fmt.Errorf("cant create container annotation: %s", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cant -> can't

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -147,8 +162,12 @@ func (rs *resourceServer) ListAndWatch(empty *pluginapi.Empty, stream pluginapi.
devs = append(devs, dev)
}
resp.Devices = devs
err := rs.updateCDISpec()
if err != nil {
glog.Error("cannot update CDI specs", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be errorf with %v

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -147,8 +162,12 @@ func (rs *resourceServer) ListAndWatch(empty *pluginapi.Empty, stream pluginapi.
devs = append(devs, dev)
}
resp.Devices = devs
err := rs.updateCDISpec()
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can have this in the same line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done!

pkg/cdi/cdi.go Outdated
func (c *impl) CreateCDISpecForPool(resourcePrefix string, rPool types.ResourcePool) error {
err := c.CleanupSpecs()
if err != nil {
glog.Error("CreateCDISpecForPool(): can not cleanup old spec files", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be glog.Errorf("....%v", err)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

pkg/cdi/cdi.go Outdated
annotations := make(map[string]string, 0)
annoKey, err := cdi.AnnotationKey(resourcePrefix, resourceKind)
if err != nil {
glog.Error("CreateContainerAnnotations(): can't create container annotation", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be glog.Errorf("....%v", err)?
Same here and on line 106

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments about error management.
Beside that, LGTM

@e0ne
Copy link
Collaborator Author

e0ne commented Sep 22, 2023

Left a few minor comments about error management. Beside that, LGTM

thanks, @zeeke

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some last small nits

can you please also add a section about CDI in the main document that point to the CDI readme?

cmd/sriovdp/manager.go Outdated Show resolved Hide resolved
cmd/sriovdp/manager.go Outdated Show resolved Hide resolved
@@ -23,6 +23,10 @@ import (
"github.com/k8snetworkplumbingwg/sriov-network-device-plugin/pkg/types"
)

const (
accelPoolType = "net-accel"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be just accel as the accelerators are not only network acceleration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now we decide to leave it with the net-

pkg/resources/pool_stub.go Show resolved Hide resolved
@@ -25,6 +25,10 @@ import (
"github.com/k8snetworkplumbingwg/sriov-network-device-plugin/pkg/types"
)

const (
auxPoolType = "net-sf"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA use: scalable functions
Intel use: sub functions

so sf is good here :)

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last comment and we can merge this PR!

Great work!

memory: "200Mi"
volumeMounts:
- name: devicesock
mountPath: /var/lib/kubelet/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the mounts to be the same as #500

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This commit implements Container Device Interface [1] support.

[1] https://github.com/container-orchestrated-devices/container-device-interface
Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@SchSeba SchSeba merged commit 2cc723d into k8snetworkplumbingwg:master Oct 1, 2023
10 checks passed
@e0ne
Copy link
Collaborator Author

e0ne commented Oct 1, 2023

Thanks for the feedback, @SchSeba !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants