Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't fail permenantly when nvml isn't installed #2732

Merged
merged 1 commit into from
Nov 16, 2020

Conversation

dashpole
Copy link
Collaborator

Reverts a small part of #2419, which changed how the nvidia collector is setup. NVIDIA driver and library installation is often done via daemonset. If cadvisor starts before the library is installed, youdon't get any GPU metrics.

This should fix the problem by returning a manager instead of a noop. Setup will be retried when GetCollector is called.

cc @RenaudWasTaken @iwankgb @bobbypage

I'd like to cherrypick this back to v0.37 as well.

@google-cla google-cla bot added the cla: yes label Nov 13, 2020
@iwankgb
Copy link
Collaborator

iwankgb commented Nov 14, 2020

LGTM

@bobbypage
Copy link
Collaborator

LGTM, thanks for the fix. Agree, let's cherrypick this to v0.37

@liggitt
Copy link
Contributor

liggitt commented Nov 18, 2020

Please make sure this gets in the v0.38.x stream as well for kubernetes 1.20

@liggitt
Copy link
Contributor

liggitt commented Nov 18, 2020

Setup will be retried when GetCollector is called.

Is that true? I see this, which seems to make GetCollector no-op if setup failed:

// GetCollector returns a collector that can fetch NVIDIA gpu metrics for NVIDIA devices
// present in the devices.list file in the given devicesCgroupPath.
func (nm *nvidiaManager) GetCollector(devicesCgroupPath string) (stats.Collector, error) {
	nc := &nvidiaCollector{}

	if !nm.devicesPresent {
		return &stats.NoopCollector{}, nil
	}

@dashpole
Copy link
Collaborator Author

If there are no GPUs on the node, it returns a no-op. We still fail permanently when there are no GPUs on the node. We no longer fail permanently when nvml is missing

@dashpole dashpole deleted the retry_nvml branch November 18, 2020 19:16
@ruiwen-zhao
Copy link

nm.devicesPresent will be true in this case because it is set before initializing NVML

nm.devicesPresent = true

@liggitt
Copy link
Contributor

liggitt commented Nov 18, 2020

Got it, detectDevices was working, initializeNVML was not. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants