H100 Compatibility - PyTorch Issues #281

mathis-lambert · 2023-09-27T12:53:11Z

Hi,

I might be on the wrong place but, this kind of issues has already been raised on PyTorch repo so i'm trying here.

Context

I want to fine-tune Llama2-70B-chat-hf with any dataset on an Nvidia H100 instance running with CUDA 12.2 v2. To fine-tune it, i chose autotrain-advanced with Python 3.10.

First try

For the first try, i've simply made a venv and installed autotrain-advanced,
then run :

$ - autotrain setup

So far, everything has gone successfully...
After that, i'm running my train command :

$ - autotrain llm --train --project_name test_llm --model meta-llama/Llama-2-70b-chat-hf --data_path knowrohit07/know_sql  --use_peft --trainer sft --learning_rate 2e-4  --use_int4 --train_batch_size 24 --token [my_token]

And then, i've got this error :

/scratch/torchbuild/lib/python3.10/site-packages/torch/cuda/__init__.py:173: UserWarning:
**NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.**
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

What i tried to handle it

I tried so many things to handle this PyTorch issue :

Install older CUDA Driver : 22.1 (which is supported by the latest PyTorch Nightly version)
Build PyTorch from source in a venv as it is suggested following the PyTorch's repo process
Build with and without conda/mkl
Build on different CUDA Versions

Conclusion

Always this same warning saying me that the PyTorch version isn't compatible for sm_90 capabilities (H100).
And ... as reported by ML Engineer at Nvidia :
pytorch/pytorch#90761 (comment)

I'm gonna post this also on the PyTorch repo but if someone got the same issue, and fixed it, i won't say no to a little help.

If you need deeper context, let me know and i'll provide it.

Many thanks.

mathis-lambert · 2023-09-27T12:53:43Z

Also my env information :

Collecting environment information...
/scratch/torchbuild/lib/python3.10/site-packages/torch/cuda/init.py:173: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 535.104.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9334 32-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 1
Core(s) per socket: 24
Socket(s): 1
Stepping: 1
BogoMIPS: 5399.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm arch_capabilities
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 1.5 MiB (24 instances)
L1i cache: 1.5 MiB (24 instances)
L2 cache: 12 MiB (24 instances)
L3 cache: 384 MiB (24 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.0
[pip3] optree==0.9.2
[pip3] pytorch-triton==2.1.0+6e4932cda8
[pip3] torch==2.0.1
[pip3] triton==2.0.0
[pip3] triton-nightly==2.1.0.dev20230822000928
[conda] blas 1.0 mkl
[conda] magma-cuda121 2.6.1 1 pytorch
[conda] mkl 2023.1.0 h213fc3f_46343
[conda] mkl-include 2023.1.0 h06a4308_46343
[conda] mkl-service 2.4.0 py311h5eee18b_1
[conda] mkl_fft 1.3.6 py311ha02d727_1
[conda] mkl_random 1.2.2 py311ha02d727_1
[conda] numpy 1.24.3 py311h08b1b3b_1
[conda] numpy-base 1.24.3 py311hf175353_1
[conda] numpydoc 1.5.0 py311h06a4308_0
[conda] pytorch 2.0.1 cpu_py311h6d93b4c_0

abhishekkrthakur · 2023-09-28T13:22:38Z

Does the dockerfile work or still the same error?

mathis-lambert · 2023-09-28T14:03:53Z

@abhishekkrthakur I'll try asap, and i will keep you in touch.
Thanks

mathis-lambert · 2023-09-29T11:49:26Z

Okay, here's wha worked for me to use H100 Gpu capabilities :

export TORCH_CUDA_ARCH_LIST="9.0"
Build Magma from source :

git clone --single-branch --branch v2.7.1 --depth 1 https://bitbucket.org/icl/magma.git
cd magma

echo -e "GPU_TARGET = sm_86\nBACKEND = cuda\nFORT = false" > make.inc
make generate

export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib:/usr/local/cuda/targets/x86_64-linux/lib${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}"
export CUDA_DIR="/usr/local/cuda/12.2"
export CONDA_LIB=${CONDA_PREFIX}/lib

# be careful here; they didn't accept sm_89 so I had to round it down to major version, sm_80
make clean && rm -rf build/

TARGETARCH=amd64 cmake -H. -Bbuild -DUSE_FORTRAN=OFF -DGPU_TARGET="Ampere" -DBUILD_SHARED_LIBS=OFF -DBUILD_STATIC_LIBS=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DMKLROOT=${CONDA_PREFIX} -DCUDA_NVCC_FLAGS="-Xfatbin;-compress-all;-DHAVE_CUBLAS;-std=c++11;--threads=0;" -GNinja 

sudo mkdir /usr/local/magma/

sudo cmake --build build -j $(nproc) --target install

sudo cp build/include/* /usr/local/magma/include/
sudo cp build/lib/*.so /usr/local/magma/lib/
sudo cp build/lib/*.a /usr/local/magma/lib/
sudo cp build/lib/pkgconfig/*.pc /usr/local/magma/lib/pkgconfig/
sudo cp /usr/local/magma/include/* ${CONDA_PREFIX}/include/
sudo cp /usr/local/magma/lib/*.a ${CONDA_PREFIX}/lib/
sudo cp /usr/local/magma/lib/*.so ${CONDA_PREFIX}/lib/
sudo cp /usr/local/magma/lib/pkgconfig/*.pc ${CONDA_PREFIX}/lib/pkgconfig/

Build PyTorch from source
Build Xformers from source

# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes)

After that :

verify your installation with :

pip show torch
pip show xformers

and check if the version is followed by the git repo.

Then, you can type :

python -m torch.utils.collect_env

and check if the CUDA version used to build torch is the same as your current CUDA Version.

Finally, when after launched :

autotrain setup

(you might reinstall torch & xformers after the setup via pip)

If you have any questions, tell me.

mathis-lambert closed this as completed Oct 18, 2023

aldopareja mentioned this issue Oct 20, 2023

JAX and TORCH jax-ml/jax#18032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H100 Compatibility - PyTorch Issues #281

H100 Compatibility - PyTorch Issues #281

mathis-lambert commented Sep 27, 2023

mathis-lambert commented Sep 27, 2023

abhishekkrthakur commented Sep 28, 2023

mathis-lambert commented Sep 28, 2023

mathis-lambert commented Sep 29, 2023 •

edited

Loading

H100 Compatibility - PyTorch Issues #281

H100 Compatibility - PyTorch Issues #281

Comments

mathis-lambert commented Sep 27, 2023

Context

First try

What i tried to handle it

Conclusion

mathis-lambert commented Sep 27, 2023

Also my env information :

abhishekkrthakur commented Sep 28, 2023

mathis-lambert commented Sep 28, 2023

mathis-lambert commented Sep 29, 2023 • edited Loading

After that :

mathis-lambert commented Sep 29, 2023 •

edited

Loading