-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H100 Compatibility - PyTorch Issues #281
Comments
Also my env information :Collecting environment information... warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) OS: Ubuntu 22.04.3 LTS (x86_64) Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime) CPU: Versions of relevant libraries: |
Does the dockerfile work or still the same error? |
@abhishekkrthakur I'll try asap, and i will keep you in touch. |
Okay, here's wha worked for me to use H100 Gpu capabilities :
git clone --single-branch --branch v2.7.1 --depth 1 https://bitbucket.org/icl/magma.git
cd magma
echo -e "GPU_TARGET = sm_86\nBACKEND = cuda\nFORT = false" > make.inc
make generate
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib:/usr/local/cuda/targets/x86_64-linux/lib${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}"
export CUDA_DIR="/usr/local/cuda/12.2"
export CONDA_LIB=${CONDA_PREFIX}/lib
# be careful here; they didn't accept sm_89 so I had to round it down to major version, sm_80
make clean && rm -rf build/
TARGETARCH=amd64 cmake -H. -Bbuild -DUSE_FORTRAN=OFF -DGPU_TARGET="Ampere" -DBUILD_SHARED_LIBS=OFF -DBUILD_STATIC_LIBS=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DMKLROOT=${CONDA_PREFIX} -DCUDA_NVCC_FLAGS="-Xfatbin;-compress-all;-DHAVE_CUBLAS;-std=c++11;--threads=0;" -GNinja
sudo mkdir /usr/local/magma/
sudo cmake --build build -j $(nproc) --target install
sudo cp build/include/* /usr/local/magma/include/
sudo cp build/lib/*.so /usr/local/magma/lib/
sudo cp build/lib/*.a /usr/local/magma/lib/
sudo cp build/lib/pkgconfig/*.pc /usr/local/magma/lib/pkgconfig/
sudo cp /usr/local/magma/include/* ${CONDA_PREFIX}/include/
sudo cp /usr/local/magma/lib/*.a ${CONDA_PREFIX}/lib/
sudo cp /usr/local/magma/lib/*.so ${CONDA_PREFIX}/lib/
sudo cp /usr/local/magma/lib/pkgconfig/*.pc ${CONDA_PREFIX}/lib/pkgconfig/
# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes) After that :verify your installation with :
and check if the version is followed by the git repo. Then, you can type : python -m torch.utils.collect_env and check if the CUDA version used to build torch is the same as your current CUDA Version. Finally, when after launched : autotrain setup (you might reinstall torch & xformers after the setup via pip) If you have any questions, tell me. |
Hi,
I might be on the wrong place but, this kind of issues has already been raised on PyTorch repo so i'm trying here.
Context
I want to fine-tune Llama2-70B-chat-hf with any dataset on an Nvidia H100 instance running with CUDA 12.2 v2. To fine-tune it, i chose autotrain-advanced with Python 3.10.
First try
For the first try, i've simply made a venv and installed autotrain-advanced,
then run :
So far, everything has gone successfully...
After that, i'm running my train command :
And then, i've got this error :
What i tried to handle it
I tried so many things to handle this PyTorch issue :
Conclusion
Always this same warning saying me that the PyTorch version isn't compatible for sm_90 capabilities (H100).
And ... as reported by ML Engineer at Nvidia :
pytorch/pytorch#90761 (comment)
I'm gonna post this also on the PyTorch repo but if someone got the same issue, and fixed it, i won't say no to a little help.
If you need deeper context, let me know and i'll provide it.
Many thanks.
The text was updated successfully, but these errors were encountered: