diff --git a/gpu-operator/index.rst b/gpu-operator/index.rst index be4e506ce..0f4ae4909 100644 --- a/gpu-operator/index.rst +++ b/gpu-operator/index.rst @@ -25,6 +25,7 @@ getting-started.rst Platform Support Release Notes + Troubleshooting GPU Driver CRD gpu-driver-upgrades.rst install-gpu-operator-vgpu.rst diff --git a/gpu-operator/troubleshooting.rst b/gpu-operator/troubleshooting.rst new file mode 100644 index 000000000..975a7d2ca --- /dev/null +++ b/gpu-operator/troubleshooting.rst @@ -0,0 +1,250 @@ +.. license-header + SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + SPDX-License-Identifier: Apache-2.0 + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +.. headings (h1/h2/h3/h4/h5) are # * = - + +####################################### +Troubleshooting the NVIDIA GPU Operator +####################################### + +************************************* +GPU Operator Pods Stuck in Crash Loop +************************************* + +.. rubric:: Issue + :class: h4 + +On large clusters, such as 300 or more nodes, the GPU Operator pods +can get stuck in a crash loop. + +.. rubric:: Observation + :class: h4 + +- The GPU Operator pod is not running: + + .. code-block:: console + + $ kubectl get pod -n gpu-operator -l app=gpu-operator + + *Example Output* + + .. code-block:: output + + NAME READY STATUS RESTARTS AGE + gpu-operator-568c7ff7f6-chg5b 0/1 CrashLoopBackOff 4 (85s ago) 4m42s + +- The node that is running the GPU Operator pod has sufficient resources and the node is ``Ready``: + + .. code-block:: console + + $ kubectl describe node + + *Example Output* + + .. code-block:: output + + Conditions: + Type Status LastHeartbeatTime LastTransitionTime Reason Message + ---- ------ ----------------- ------------------ ------ ------- + MemoryPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available + DiskPressure False Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:03 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure + PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available + Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status + +- The logs from the pod include a fatal error: + + .. code-block:: console + + $ kubectl logs -n gpu-operator -l app=gpu-operator + + *Partial Output* + + .. code-block:: output + :emphasize-lines: 1 + + fatal error: concurrent map read and map write + + goroutine 676 [running]: + k8s.io/apimachinery/pkg/runtime.(*Scheme).ObjectKinds(0xc0001fc000, {0x1ea20f0?, 0xc0008b4770}) + /workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:264 +0xce + sigs.k8s.io/controller-runtime/pkg/client/apiutil.GVKForObject({0x1ea20f0?, 0xc0008b4770}, 0xc00133d4e0?) + /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/apiutil/apimachinery.go:98 +0x245 + sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc0000123c0, {0x1ebe020?, 0xc0008b4770}) + /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:94 +0x87 + sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc0000123c0, {0x1eb5ca8, 0xc000618cd0}, {0x1ebe020, 0xc0008b4770}, {0x2c7cf70, 0x0, 0x0}) + /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:73 +0x71 + sigs.k8s.io/controller-runtime/pkg/client.(*delegatingReader).List(0xc000c4b480, {0x1eb5ca8, 0xc000618cd0}, {0x1ebe020?, 0xc0008b4770?}, {0x2c7cf70, 0x0, 0x0}) + /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/split.go:140 +0x114 + github.com/NVIDIA/gpu-operator/controllers.addWatchNewGPUNode.func1({0x199a6a0?, 0xc002873b30?}) + /workspace/controllers/clusterpolicy_controller.go:228 +0x9a + sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0x44?, {0x1ebf938, 0xc000158660}, {0x1ecc6c0?, 0xc001c04fc0?}, 0xa8?) + /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:80 +0x46 + sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0xc000095900?, {{0x1ecc6c0?, 0xc001c04fc0?}}, {0x1ebf938, 0xc000158660}) + /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:57 +0xd2 + sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd({{0x1eb66b8, 0xc000012498}, {0x1ebf938, 0xc000158660}, {0xc000b891f0, 0x1, 0x1}}, {0x1bea560?, 0xc001c04fc0}) + /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:63 +0x295 + k8s.io/client-go/tools/cache.(*processorListener).run.func1() + /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:818 +0x134 + k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) + /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e + k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006fc738?, {0x1e9eae0, 0xc0014c48a0}, 0x1, 0xc000b1a540) + /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6 + k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x1ebcb18?, 0x3b9aca00, 0x0, 0x51?, 0xc0006fc7b0?) + /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89 + k8s.io/apimachinery/pkg/util/wait.Until(...) + /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 + k8s.io/client-go/tools/cache.(*processorListener).run(0xc000766f80) + /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:812 +0x6b + k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() + /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x5a + created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start + /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x85 + + + +.. rubric:: Root Cause + :class: h4 + +The memory resource limit for the GPU Operator is too low for the cluster size. + +.. rubric:: Action + :class: h4 + +Increase the memory request and limit for the GPU Operator pod: + +- Set the memory request to a value that matches the average memory consumption over an large time window. +- Set the memory limit to match the spikes in memory consumption that occur occasionally. + +#. Increase the memory resource limit for the GPU Operator pod: + + .. code-block:: console + + $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ + -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]' + +#. Optional: Increase the memory resource request for the pod: + + .. code-block:: console + + $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ + -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]' + +Monitor the GPU Operator pod. +Increase the memory request and limit again if the pod remains stuck in a crash loop. + + +************************************************ +infoROM is corrupted (nvidia-smi return code 14) +************************************************ + + +.. rubric:: Issue + :class: h4 + +The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well. + + +.. rubric:: Observation + :class: h4 + + +The output from the driver validation container indicates that the infoROM is corrupt: + +.. code-block:: console + + $ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation + +*Example Output* + +.. code-block:: output + + | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | + |-------------------------------+----------------------+----------------------+ + | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | + | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | + | | | MIG M. | + |===============================+======================+======================| + | 0 Tesla P100-PCIE... On | 00000000:0B:00.0 Off | 0 | + | N/A 42C P0 29W / 250W | 0MiB / 16280MiB | 0% Default | + | | | N/A | + +-------------------------------+----------------------+----------------------+ + + +-----------------------------------------------------------------------------+ + | Processes: | + | GPU GI CI PID Type Process name GPU Memory | + | ID ID Usage | + |=============================================================================| + | No running processes found | + +-----------------------------------------------------------------------------+ + WARNING: infoROM is corrupted at gpu 0000:0B:00.0 + 14 + +The GPU emits some warning messages related to infoROM. +The return values for the ``nvidia-smi`` command are listed below. + +.. code-block:: console + + RETURN VALUE + + Return code reflects whether the operation succeeded or failed and what + was the reason of failure. + + · Return code 0 - Success + + · Return code 2 - A supplied argument or flag is invalid + · Return code 3 - The requested operation is not available on target device + · Return code 4 - The current user does not have permission to access this device or perform this operation + · Return code 6 - A query to find an object was unsuccessful + · Return code 8 - A device's external power cables are not properly attached + · Return code 9 - NVIDIA driver is not loaded + · Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU + · Return code 12 - NVML Shared Library couldn't be found or loaded + · Return code 13 - Local version of NVML doesn't implement this function + · Return code 14 - infoROM is corrupted + · Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible + · Return code 255 - Other error or internal driver error occurred + + +.. rubric:: Root Cause + :class: h4 + +The ``nvidi-smi`` command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node. + +.. rubric:: Action + :class: h4 + +Replace the faulty GPU. + + +********************* +EFI + Secure Boot +********************* + + +.. rubric:: Issue + :class: h4 + +GPU Driver pod fails to deploy. + +.. rubric:: Root Cause + :class: h4 + +EFI Secure Boot is currently not supported with GPU Operator + +.. rubric:: Action + :class: h4 + +Disable EFI Secure Boot on the server. diff --git a/gpu-operator/troubleshootings.rst b/gpu-operator/troubleshootings.rst deleted file mode 100644 index 7f2225f0e..000000000 --- a/gpu-operator/troubleshootings.rst +++ /dev/null @@ -1,107 +0,0 @@ - -##################### -Troubleshootings -##################### - - - -************************************************ -infoROM is corrupted (nvidia-smi return code 14) -************************************************ - - -Issue: - -nvidia-operator-validator fails and nvidia-driver-daemonsets fails as well. - - -Observation: - -Output from kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation: - - -.. code-block:: console - - - | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | - |-------------------------------+----------------------+----------------------+ - | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | - | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | - | | | MIG M. | - |===============================+======================+======================| - | 0 Tesla P100-PCIE... On | 00000000:0B:00.0 Off | 0 | - | N/A 42C P0 29W / 250W | 0MiB / 16280MiB | 0% Default | - | | | N/A | - +-------------------------------+----------------------+----------------------+ - - +-----------------------------------------------------------------------------+ - | Processes: | - | GPU GI CI PID Type Process name GPU Memory | - | ID ID Usage | - |=============================================================================| - | No running processes found | - +-----------------------------------------------------------------------------+ - WARNING: infoROM is corrupted at gpu 0000:0B:00.0 - 14 - -The GPU emits some warning messages related to infoROM. - - -Note: - -possible return value for nvidia-smi is listed below (reference: `nvidia-smi specification `_): - -.. code-block:: console - - RETURN VALUE - - Return code reflects whether the operation succeeded or failed and what - was the reason of failure. - - · Return code 0 - Success - - · Return code 2 - A supplied argument or flag is invalid - · Return code 3 - The requested operation is not available on target device - · Return code 4 - The current user does not have permission to access this device or perform this operation - · Return code 6 - A query to find an object was unsuccessful - · Return code 8 - A device's external power cables are not properly attached - · Return code 9 - NVIDIA driver is not loaded - · Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU - · Return code 12 - NVML Shared Library couldn't be found or loaded - · Return code 13 - Local version of NVML doesn't implement this function - · Return code 14 - infoROM is corrupted - · Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible - · Return code 255 - Other error or internal driver error occurred - - - -Root cause: - -nvidi-smi should return a success code (Return code 0) for driver-validator to pass and GPU operator to successfully deploy driver pod on the node. - - -Action: - -replace the faulty GPU - - - - -********************* -EFI + Secure Boot -********************* - - -Issue: -GPU Driver pod fails to deploy - - -Root cause: -EFI Secure Boot is currently not supported with GPU Operator - -Action: -Disable EFI Secure Boot on the server - - - -