Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect deviceClassWhitelist configuration is provided #729

Closed
fprzewozny opened this issue May 27, 2024 · 3 comments
Closed

Incorrect deviceClassWhitelist configuration is provided #729

fprzewozny opened this issue May 27, 2024 · 3 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@fprzewozny
Copy link

fprzewozny commented May 27, 2024

Hey,
Going through live system configuration I have noticed, that gpu-operator-node-feature-discovery-worker-conf contains incorrect device class whitelist:

apiVersion: v1
data:
 nfd-worker.conf: |-
  sources:
   pci:
    deviceClassWhitelist:
    - "02"
    - "0200"
    - "0207"
    - "0300"
    - "0302"
    deviceLabelFields:
    - vendor
kind: ConfigMap

According to PCI-SIG specifications, base class 03 is Display controller, 00 subclass of 03 class is VGA-compatible controller, and 02 subclass of 03 class is 3D controller . 02 class is Network controller, with empty subclass pointing to any, 00 subclass to Ethernet controller, and 07 subclass to InfiniBand Controller.

So provided configuration with operator translates to:

deviceClassWhitelist:
- "02"    # Any network controller
- "0200"  # Ethernet controller
- "0207"  # InfiniBand Controller
- "0300"  # VGA-compatible controller
- "0302"  # 3D controller

With such filters it seems like gpu-operator-node-feature-discovery is configured to gather both GPU, and network data (where that should be done by https://github.com/Mellanox/network-operator, with similar issue: Mellanox/network-operator#957). In my opinion, deviceClassWhitelist should contain entries only from 03 classes (Display).

Result of this misconfiguration can be observed in logs of gpu-operator-node-feature-discovery-worker pods, it tries to gather data about both Ethernet and InfiniBand devices (which should be gathered by network-operator, not the gpu-operator. Those devices should be filtered out by deviceClassWhitelist):

kubectl logs -n gpu-operator gpu-operator-node-feature-discovery-worker-7ndj5 | head -n 5
E0526 21:58:37.810614       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/eno3/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811725       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.811789       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ens6f1/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812141       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v0/speed: invalid argument" attributeName="speed"
E0526 21:58:37.812180       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/ibp154s0v1/speed: invalid argument" attributeName="speed"

This configuration can be found here: https://github.com/NVIDIA/gpu-feature-discovery/blob/main/deployments/helm/gpu-feature-discovery/values.yaml#L84

In my opinion, deviceClassWhitelist for gpu-feature-discovery should contain only 0300, and 0302 entries.

Thank you,
Franciszek

@elezar
Copy link
Member

elezar commented May 27, 2024

@ArangoGutierrez could you have a look at this?

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2024
Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants