build: Notify distributors that the '-z now' linker flag is unsupported

The '-z now' flag, which is the opposite of '-z lazy', is unsupported as an external linker flag [1], because of how the NVIDIA Container Toolkit stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at runtime [2,3]. The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the address of a symbol at runtime before using it. It links against undefined symbols at build-time available through a CUDA API definition embedded directly in the CGO code or a copy of nvml.h. It relies upon lazily deferring function call resolution to the point when dlopen(3) is able to load the shared libraries at runtime, instead of doing it when toolbox(1) is started. This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load libsubid.so at runtime. Compare the output of: $ nm /path/to/toolbox | grep ' subid_init' ... with those from: $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV' U nvmlGpuInstanceGetComputeInstanceProfileInfoV $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids' U nvmlDeviceGetAccountingPids Using '-z now' as an external linker flag forces the dynamic linker to resolve all symbols when toolbox(1) is started, and leads to: $ toolbox toolbox: symbol lookup error: toolbox: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV Fallout from 6e848b2 [1] NVIDIA Container Toolkit commit 1407ace94ab7c150 NVIDIA/nvidia-container-toolkit@1407ace94ab7c150 NVIDIA/go-nvml#18 NVIDIA/nvidia-container-toolkit#49 [2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda [3] https://github.com/NVIDIA/go-nvml/blob/main/README.md https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml #1548
containers · Sep 25, 2024 · c46765f · c46765f
1 parent dd23baa
commit c46765f
Showing 1 changed file with 42 additions and 0 deletions.
diff --git a/src/go-build-wrapper b/src/go-build-wrapper
@@ -70,6 +70,48 @@ fi
 
 dynamic_linker="/run/host$dynamic_linker_canonical_dirname/$dynamic_linker_basename"
 
+# Note for distributors:
+#
+# The '-z now' flag, which is the opposite of '-z lazy', is unsupported as an
+# external linker flag [1], because of how the NVIDIA Container Toolkit stack
+# uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at runtime [2,3].
+#
+# The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the address
+# of a symbol at runtime before using it.  It links against undefined symbols
+# at build-time available through a CUDA API definition embedded directly in
+# the CGO code or a copy of nvml.h.  It relies upon lazily deferring function
+# call resolution to the point when dlopen(3) is able to load the shared
+# libraries at runtime, instead of doing it when toolbox(1) is started.
+#
+# This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
+# libsubid.so at runtime.
+#
+# Compare the output of:
+#   $ nm /path/to/toolbox | grep ' subid_init'
+#
+# ... with those from:
+#   $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
+#           U nvmlGpuInstanceGetComputeInstanceProfileInfoV
+#   $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
+#           U nvmlDeviceGetAccountingPids
+#
+# Using '-z now' as an external linker flag forces the dynamic linker to
+# resolve all symbols when toolbox(1) is started, and leads to:
+#   $ toolbox
+#   toolbox: symbol lookup error: toolbox: undefined symbol:
+#       nvmlGpuInstanceGetComputeInstanceProfileInfoV
+#
+# [1] NVIDIA Container Toolkit commit 1407ace94ab7c150
+#     https://github.com/NVIDIA/nvidia-container-toolkit/commit/1407ace94ab7c150
+#     https://github.com/NVIDIA/go-nvml/issues/18
+#     https://github.com/NVIDIA/nvidia-container-toolkit/issues/49
+#
+# [2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
+#
+# [3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
+#     https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
+#     https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
+
 # shellcheck disable=SC2086
 go build \
         $tags \