Crash in CUDA 11 #698

csukuangfj · 2021-03-26T07:13:57Z

The following minimal demo will crash for CUDA 11 + Python 3.9 + PyTorch 1.7.1 + latest k2 (for both Debug & Release build)

~~When it runs with cuda-memcheck, the process just hangs and seems never to terminate.~~

When running with cuda-memcheck, it prints no errors after the crash.

The reason for the crash is similar to the one mentioned in #696 (comment)

Note that k2.closure uses torch.nonzero internally.

k2/k2/python/k2/fsa_algo.py

Line 660 in 08a2bfe

minus_one_index = torch.nonzero(src_aux_labels == -1, as_tuple=False)

The crash in snowfall also uses torch.nonzero https://github.com/k2-fsa/snowfall/blob/4a909a3a609d5a3444b14fc40d779f217e1263c1/egs/librispeech/asr/simple_v1/mmi_bigram_train.py#L62

finite_indexes = torch.nonzero(mask).squeeze(1)

It will not crash for the same code built with CUDA 10.1

Crash log

Traceback (most recent call last):
  File "/ceph-fj/open-source/k2/build_debug_11/./ab.py", line 23, in <module>
    main()
  File "/ceph-fj/open-source/k2/build_debug_11/./ab.py", line 19, in main
    ans = k2.closure(fsa)
  File "/root/fangjun/open-source/k2/k2/python/k2/fsa_algo.py", line 677, in closure
    new_value = fix_aux_labels(value, fsa.arcs.row_splits(1), arc_map)
  File "/root/fangjun/open-source/k2/k2/python/k2/fsa_algo.py", line 663, in fix_aux_labels
    minus_one_index[minus_one_index > src_start_state_last_arc_index] += 1
RuntimeError: invalid shape dimension -16711680

Demo

#!/usr/bin/env python3

import torch
import k2


def main():
    device = torch.device('cuda', 0)

    s = '''
        0 1 1 0.1
        1 2 2 0.2
        2 3 -1 0.3
        3
    '''
    fsa = k2.Fsa.from_str(s).to(device).requires_grad_(True)

    fsa.aux_labels = torch.tensor([10, 20, -1], dtype=torch.int32).to(device)
    ans = k2.closure(fsa)


if __name__ == '__main__':
    main()

The text was updated successfully, but these errors were encountered:

csukuangfj · 2021-03-26T07:23:37Z

If I change

k2/k2/python/k2/fsa_algo.py

Line 660 in 08a2bfe

minus_one_index = torch.nonzero(src_aux_labels == -1, as_tuple=False)

to

        print('src shape', src_aux_labels.shape)
        minus_one_index = torch.nonzero(src_aux_labels == -1, as_tuple=False)
        print('minus one shape', minus_one_index.shape)

It prints

src shape torch.Size([3])
minus one shape torch.Size([16843009, 1])

You can see that the shape size 16843009 is extremely large.

danpovey · 2021-03-26T11:07:01Z

My suspicion is that this is a build-system issue. You tend to get these kinds of mysterious errors when different compilation units have different ideas about the layouts of C++ objects.

We have a number of .cu files that include, directly or indirectly, torch.h, and from this
https://pytorch.org/tutorials/advanced/cpp_extension.html
it's not clear to me whether you are supposed to include torch.h in CUDA code
(search for torch.h in that page).

Also there may be compilation flags that are mismatched, e.g. we use
-D_GLIBCXX_USE_CXX11_ABI=0 and I'm not sure how Torch was compiled.

csukuangfj · 2021-03-26T12:58:55Z

Also there may be compilation flags that are mismatched, e.g. we use
-D_GLIBCXX_USE_CXX11_ABI=0 and I'm not sure how Torch was compiled.

This flag is copied from PyTorch. See

k2/cmake/torch.cmake

Lines 13 to 15 in 08a2bfe

    
           # set the global CMAKE_CXX_FLAGS so that 
        
           # k2 uses the same abi flag as PyTorch 
        
           set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

If k2 uses a different flag from what PyTorch is using, then it will have trouble at the link time. The linking stage will fail.

csukuangfj · 2021-03-26T13:10:04Z

We have a number of .cu files that include, directly or indirectly, torch.h, and from this
https://pytorch.org/tutorials/advanced/cpp_extension.html
it's not clear to me whether you are supposed to include torch.h in CUDA code
(search for torch.h in that page).

Here is what torch.h looks like:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/torch.h

  
#pragma once

#include <torch/all.h>

#ifdef TORCH_API_INCLUDE_EXTENSION_H
#include <torch/extension.h>

#endif // defined(TORCH_API_INCLUDE_EXTENSION_H)

And extension.h:
https://github.com/pytorch/pytorch/blob/master/torch/extension.h

#pragma once

// All pure C++ headers for the C++ frontend.
#include <torch/all.h>
// Python bindings for the C++ frontend (includes Python.h).
#include <torch/python.h>

TORCH_API_INCLUDE_EXTENSION_H is defined in k2 at the place

k2/k2/python/csrc/CMakeLists.txt

Line 9 in 08a2bfe

add_definitions(-DTORCH_API_INCLUDE_EXTENSION_H)

Currently, that macro is defined only in k2/python/csrc. I should have moved it to the top-level CMakeLists.txt.

I am moving it.

danpovey · 2021-03-26T13:24:12Z

Incidentally, when I use pytorch's setup.py things to compile an example torch extension with C++, the following is what the compilation flags look like. This is just FYI; I don't see anything in there that looks important.

/usr/local/cuda/bin/nvcc -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include/torch/csrc/api/\
include -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include/TH -I/ceph-fj/fangjun/py39/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include\
 -I/root/fangjun/open-source/pyenv/versions/3.9.0/include/python3.9 -c lltm_cuda_kernel.cu -o build/temp.linux-x86_64-3.9/lltm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D\
__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gc\
c" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=lltm_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -std=c+\
+14

csukuangfj · 2021-03-26T14:11:23Z

This is what I got about half a year ago. Seems that it changes only the build options for Pybind11.

hello world (cpu)
-----------------

.. code-block:: cpp

  :caption: hello.cc

  #include <torch/extension.h>

  torch::Tensor sigmoid(torch::Tensor z) {
    auto s = torch::sigmoid(z);
    return (1 - s) * s;
  }

  PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("sigmoid", &sigmoid, "sigmoid test");
  }

.. code-block:: python

  :caption: setup.py

  from setuptools import setup, Extension
  from torch.utils import cpp_extension

  setup(name='hello',
        ext_modules=[cpp_extension.CppExtension('hello', ['hello.cc'])],
        cmdclass={'build_ext': cpp_extension.BuildExtension.with_options(use_ninja=False)})

The output of ``python setup.py build``::

    running build
    running build_ext
    building 'hello' extension
    creating build
    creating build/temp.linux-x86_64-3.7
    gcc -pthread -Wno-unused-result -Wsign-compare \
      -DNDEBUG -g -fwrapv -O3 -Wall -fPIC \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include/TH \
      -I/xxx/py37/lib/python3.7/site-packages/torch/include/THC \
      -I/xxx/include/python3.7m \
      -c hello.cc \
      -o build/temp.linux-x86_64-3.7/hello.o \
      -DTORCH_API_INCLUDE_EXTENSION_H \
      -DTORCH_EXTENSION_NAME=hello \
      -D_GLIBCXX_USE_CXX11_ABI=0 \
      -std=c++14

    g++ \
      -pthread \
      -shared \
      -L/xxx/lib \
      build/temp.linux-x86_64-3.7/hello.o \
      -L/xxx/py37/lib/python3.7/site-packages/torch/lib \
      -lc10 \
      -ltorch \
      -ltorch_cpu \
      -ltorch_python \
      -o build/lib.linux-x86_64-3.7/hello.cpython-37m-x86_64-linux-gnu.so

hello world (cuda)
------------------

Change ``setup.py``. Replace ``cpp_extension.CppExtension`` with ``cpp_extension.CUDAExtension``.

The output of ``python setup.py build``::

    -I/usr/local/cuda/include


    -L/usr/local/cuda/lib64 \
    -lcudart \
    -lc10_cuda \
    -ltorch_cuda

danpovey · 2021-03-26T15:47:47Z

This issue in PyTorch has been active recently
pytorch/pytorch#54245
and notes certain problems with thrust and cub in CUDA 11; and the implementation of torch.nonzero
https://github.com/pytorch/pytorch/blob/f6634be4c2b72e0d8da46d5992facb59b55a90bc/aten/src/ATen/native/cuda/Indexing.cu#L861
does use cub (and also thrust, but only when the result has ndim > 1, which is not the case in our
minimal example).

Here NVIDIA/thrust#1401 it's mentioned that in certain circumstances people use a prefix to separate the symbols of cub when two different libraries that are loaded at the same time use cub. I checked the symbols from us and torch, and they do declare some identical symbols, e.g. these:

000000000370e618 u _ZGVZN3cub22DeviceCountCachedValueEvE5cache
000000000370ec20 u _ZGVZN3cub26GetPerDeviceAttributeCacheINS_18PtxVersionCacheTagEEERNS_23PerDeviceAttributeCacheEvE5cache

but it's not absolutely clear to me that this is a problem.
It's possible that the issue relates to both torch and us using cub...

See k2-fsa#698

csukuangfj · 2021-03-27T04:25:52Z

but it's not absolutely clear to me that this is a problem.

Yes, that is the key.

Fixed in #699

… is set (#699) * Free CUDA memory in a correct way when PYTORCH_NO_CUDA_MEMORY_CACHING is set * fix a typo. * add more comments. * Fix after review. * Fix typos. * fix typos. * Fix crashes in CUDA11 due to CUB. See #698 * Fix typos.

csukuangfj added a commit to csukuangfj/k2 that referenced this issue Mar 27, 2021

Fix crashes in CUDA11 due to CUB.

29d9e06

See k2-fsa#698

csukuangfj mentioned this issue Mar 27, 2021

Free CUDA memory in a correct way when PYTORCH_NO_CUDA_MEMORY_CACHING is set #699

Merged

danpovey closed this as completed in #699 Mar 31, 2021

csukuangfj mentioned this issue Feb 10, 2022

Fix building for CUDA 11.6 #917

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in CUDA 11 #698

Crash in CUDA 11 #698

csukuangfj commented Mar 26, 2021 •

edited

Loading

csukuangfj commented Mar 26, 2021

danpovey commented Mar 26, 2021

csukuangfj commented Mar 26, 2021 •

edited

Loading

csukuangfj commented Mar 26, 2021

danpovey commented Mar 26, 2021

csukuangfj commented Mar 26, 2021

danpovey commented Mar 26, 2021

csukuangfj commented Mar 27, 2021

Crash in CUDA 11 #698

Crash in CUDA 11 #698

Comments

csukuangfj commented Mar 26, 2021 • edited Loading

Crash log

Demo

csukuangfj commented Mar 26, 2021

danpovey commented Mar 26, 2021

csukuangfj commented Mar 26, 2021 • edited Loading

csukuangfj commented Mar 26, 2021

danpovey commented Mar 26, 2021

csukuangfj commented Mar 26, 2021

danpovey commented Mar 26, 2021

csukuangfj commented Mar 27, 2021

csukuangfj commented Mar 26, 2021 •

edited

Loading

csukuangfj commented Mar 26, 2021 •

edited

Loading