Skip to content

Commit

Permalink
Merge branch 'troubleshooting' into 'master'
Browse files Browse the repository at this point in the history
Add troubleshooting section

See merge request nvidia/cloud-native/cnt-docs!366
  • Loading branch information
mikemckiernan committed Dec 28, 2023
2 parents d5ff16d + b40c5db commit 29a1791
Show file tree
Hide file tree
Showing 3 changed files with 251 additions and 107 deletions.
1 change: 1 addition & 0 deletions gpu-operator/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
getting-started.rst
Platform Support <platform-support.rst>
Release Notes <release-notes.rst>
Troubleshooting <troubleshooting.rst>
GPU Driver CRD <gpu-driver-configuration.rst>
gpu-driver-upgrades.rst
install-gpu-operator-vgpu.rst
Expand Down
250 changes: 250 additions & 0 deletions gpu-operator/troubleshooting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
.. license-header
SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.. headings (h1/h2/h3/h4/h5) are # * = -
#######################################
Troubleshooting the NVIDIA GPU Operator
#######################################

*************************************
GPU Operator Pods Stuck in Crash Loop
*************************************

.. rubric:: Issue
:class: h4

On large clusters, such as 300 or more nodes, the GPU Operator pods
can get stuck in a crash loop.

.. rubric:: Observation
:class: h4

- The GPU Operator pod is not running:

.. code-block:: console
$ kubectl get pod -n gpu-operator -l app=gpu-operator
*Example Output*

.. code-block:: output
NAME READY STATUS RESTARTS AGE
gpu-operator-568c7ff7f6-chg5b 0/1 CrashLoopBackOff 4 (85s ago) 4m42s
- The node that is running the GPU Operator pod has sufficient resources and the node is ``Ready``:

.. code-block:: console
$ kubectl describe node <node-name>
*Example Output*

.. code-block:: output
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:03 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status
- The logs from the pod include a fatal error:

.. code-block:: console
$ kubectl logs -n gpu-operator -l app=gpu-operator
*Partial Output*

.. code-block:: output
:emphasize-lines: 1
fatal error: concurrent map read and map write
goroutine 676 [running]:
k8s.io/apimachinery/pkg/runtime.(*Scheme).ObjectKinds(0xc0001fc000, {0x1ea20f0?, 0xc0008b4770})
/workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:264 +0xce
sigs.k8s.io/controller-runtime/pkg/client/apiutil.GVKForObject({0x1ea20f0?, 0xc0008b4770}, 0xc00133d4e0?)
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/apiutil/apimachinery.go:98 +0x245
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc0000123c0, {0x1ebe020?, 0xc0008b4770})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:94 +0x87
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc0000123c0, {0x1eb5ca8, 0xc000618cd0}, {0x1ebe020, 0xc0008b4770}, {0x2c7cf70, 0x0, 0x0})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:73 +0x71
sigs.k8s.io/controller-runtime/pkg/client.(*delegatingReader).List(0xc000c4b480, {0x1eb5ca8, 0xc000618cd0}, {0x1ebe020?, 0xc0008b4770?}, {0x2c7cf70, 0x0, 0x0})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/split.go:140 +0x114
github.com/NVIDIA/gpu-operator/controllers.addWatchNewGPUNode.func1({0x199a6a0?, 0xc002873b30?})
/workspace/controllers/clusterpolicy_controller.go:228 +0x9a
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0x44?, {0x1ebf938, 0xc000158660}, {0x1ecc6c0?, 0xc001c04fc0?}, 0xa8?)
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:80 +0x46
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0xc000095900?, {{0x1ecc6c0?, 0xc001c04fc0?}}, {0x1ebf938, 0xc000158660})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:57 +0xd2
sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd({{0x1eb66b8, 0xc000012498}, {0x1ebf938, 0xc000158660}, {0xc000b891f0, 0x1, 0x1}}, {0x1bea560?, 0xc001c04fc0})
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:63 +0x295
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:818 +0x134
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006fc738?, {0x1e9eae0, 0xc0014c48a0}, 0x1, 0xc000b1a540)
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x1ebcb18?, 0x3b9aca00, 0x0, 0x51?, 0xc0006fc7b0?)
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000766f80)
/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:812 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:75 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x85


.. rubric:: Root Cause
:class: h4

The memory resource limit for the GPU Operator is too low for the cluster size.

.. rubric:: Action
:class: h4

Increase the memory request and limit for the GPU Operator pod:

- Set the memory request to a value that matches the average memory consumption over an large time window.
- Set the memory limit to match the spikes in memory consumption that occur occasionally.

#. Increase the memory resource limit for the GPU Operator pod:

.. code-block:: console
$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
-p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]'
#. Optional: Increase the memory resource request for the pod:

.. code-block:: console
$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
-p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]'
Monitor the GPU Operator pod.
Increase the memory request and limit again if the pod remains stuck in a crash loop.


************************************************
infoROM is corrupted (nvidia-smi return code 14)
************************************************


.. rubric:: Issue
:class: h4

The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.


.. rubric:: Observation
:class: h4


The output from the driver validation container indicates that the infoROM is corrupt:

.. code-block:: console
$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation
*Example Output*

.. code-block:: output
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:0B:00.0 Off | 0 |
| N/A 42C P0 29W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14
The GPU emits some warning messages related to infoROM.
The return values for the ``nvidia-smi`` command are listed below.

.. code-block:: console
RETURN VALUE
Return code reflects whether the operation succeeded or failed and what
was the reason of failure.
· Return code 0 - Success
· Return code 2 - A supplied argument or flag is invalid
· Return code 3 - The requested operation is not available on target device
· Return code 4 - The current user does not have permission to access this device or perform this operation
· Return code 6 - A query to find an object was unsuccessful
· Return code 8 - A device's external power cables are not properly attached
· Return code 9 - NVIDIA driver is not loaded
· Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU
· Return code 12 - NVML Shared Library couldn't be found or loaded
· Return code 13 - Local version of NVML doesn't implement this function
· Return code 14 - infoROM is corrupted
· Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
· Return code 255 - Other error or internal driver error occurred
.. rubric:: Root Cause
:class: h4

The ``nvidi-smi`` command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.

.. rubric:: Action
:class: h4

Replace the faulty GPU.


*********************
EFI + Secure Boot
*********************


.. rubric:: Issue
:class: h4

GPU Driver pod fails to deploy.

.. rubric:: Root Cause
:class: h4

EFI Secure Boot is currently not supported with GPU Operator

.. rubric:: Action
:class: h4

Disable EFI Secure Boot on the server.
107 changes: 0 additions & 107 deletions gpu-operator/troubleshootings.rst

This file was deleted.

0 comments on commit 29a1791

Please sign in to comment.