An interesting bug caused "CUDA error: unspecified launch failure" #6375

StortInter · 2023-09-22T18:42:54Z

🐛 Bug

Using dgl.graph() and dgl.dataloading.GraphDataLoader() with num_workers causes "RuntimeError: CUDA error: unspecified launch failure".

To Reproduce

Steps to reproduce the behavior:

Install the latest pytorch and dgl with cuda.

The installation commands I used:

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install  dgl -f https://data.dgl.ai/wheels/cu118/repo.html
pip install  dglgo -f https://data.dgl.ai/wheels-test/repo.html

here is my conda env (only list key components)

# packages in environment at /home/lapluis/miniconda3/envs/dgl:
#
# Name                    Version                   Build                   Channel
cuda-cudart               11.8.89                       0                   nvidia
cuda-cupti                11.8.87                       0                   nvidia
cuda-libraries            11.8.0                        0                   nvidia
cuda-nvrtc                11.8.89                       0                   nvidia
cuda-nvtx                 11.8.86                       0                   nvidia
cuda-runtime              11.8.0                        0                   nvidia
cython                    3.0.2                    pypi_0                   pypi
dgl                       1.1.2+cu118              pypi_0                   pypi
dglgo                     0.0.2                    pypi_0                   pypi
python                    3.11.5          hab00c5b_0_cpython                conda-forge
pytorch                   2.0.1           py3.11_cuda11.8_cudnn8.7.0_0      pytorch
pytorch-cuda              11.8                 h7e8668a_5                   pytorch
torchaudio                2.0.2               py311_cu118                   pytorch
torchdata                 0.6.1           py311h6d97842_1                   conda-forge
torchtriton               2.0.0                     py311                   pytorch
torchvision               0.15.2              py311_cu118                   pytorch

Run the code sample:

import os

import dgl
import torch

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
device = torch.device('cuda:0')


class MyDataset(dgl.data.DGLDataset):
    def process(self):
        pass

    def __init__(self):
        super().__init__('MyDataset')

    def __getitem__(self, idx):
        gh = dgl.graph(([1, 2], [1, 2]))    # comment to resolve error
        return 0

    def __len__(self):
        return 1000


if __name__ == '__main__':
    iter_0 = dgl.dataloading.GraphDataLoader(
        dataset=MyDataset(),
        num_workers=1   # set 0 to resolve error
    )

    for i in iter_0.__iter__():
        i.to(device=device)

    for i in iter_0.__iter__():
        i.to(device=device)

Attention: The error can be avoid by delete line 38: gh = dgl.graph(([1, 2], [1, 2])) or set num_workers to 0.

Expected behavior

Get CUDA error like this:

(dgl) lapluis@nccserv0:~$ python bug.py
Traceback (most recent call last):
  File "/home/lapluis/bug.py", line 35, in <module>
    i.to(device=device)
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This code is a simplified version of the training code, I tried to use compute-sanitizer to run the original code, I got these:

========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaMemcpyAsync.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x441846]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaMemcpyAsync [0x144a374]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:dgl::runtime::CUDADeviceAPI::CopyDataFromTo(void const*, unsigned long, void*, unsigned long, unsigned long, DGLContext, DGLContext, DGLDataType) [0x8b3086]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:dgl::runtime::NDArray::CopyFromTo(DGLArray*, DGLArray*) [0x72936d]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:dgl::runtime::NDArray::CopyTo(DGLContext const&) const [0x764ed3]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DGLContext const&) [0x872ecf]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DGLContext const&) [0x771716]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:std::_Function_handler<void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*), dgl::{lambda(dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)#47}>::_M_invoke(std::_Any_data const&, dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) [0x780156]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:DGLFuncCall [0x70e3f8]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/libdgl.so
=========     Host Frame:__pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) [0x1a79f]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/_ffi/_cy3/core.cpython-311-x86_64-linux-gnu.so
=========     Host Frame:__pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) [0x1afef]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl/_ffi/_cy3/core.cpython-311-x86_64-linux-gnu.so
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/call.c:214:_PyObject_MakeTpCall [0x1e007b]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/ceval.c:4774:_PyEval_EvalFrameDefault [0x1ec992]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/ceval.c:6439:_PyEval_Vector [0x2a4d36]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/ceval.c:1155:PyEval_EvalCode [0x2a43ef]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/pythonrun.c:1713:run_eval_code_obj [0x2c2f2a]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/pythonrun.c:1734:run_mod [0x2bf343]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/pythonrun.c:1628:pyrun_file [0x2d4300]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/pythonrun.c:440:_PyRun_SimpleFileObject [0x2d3c5e]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/pythonrun.c:79:_PyRun_AnyFileObject [0x2d3a44]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Modules/main.c:680:Py_RunMain [0x2cdbdf]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Modules/main.c:735:Py_BytesMain [0x292f97]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:../sysdeps/nptl/libc_start_call_main.h:74:__libc_start_call_main [0x271ca]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:../csu/libc-start.c:347:__libc_start_main [0x27285]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x292e3d]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaHostAlloc.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x441846]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaHostAlloc [0x51bfc]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/../../../../libcudart.so.11.0
=========     Host Frame:at::cuda::CUDAHostAllocatorWrapper::allocate(unsigned long) const [0xe30ec3]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:at::native::_pin_memory_cuda(at::Tensor const&, c10::optional<c10::Device>) [0xe37494]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___pin_memory(at::Tensor const&, c10::optional<c10::Device>) [0x2b4b562]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::Device>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___pin_memory>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x2b4b5f7]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:at::_ops::_pin_memory::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x209e9f1]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::(anonymous namespace)::_pin_memory(at::Tensor const&, c10::optional<c10::Device>) [0x26f26bc]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::Device>), &at::(anonymous namespace)::_pin_memory>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x26f2837]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::_ops::_pin_memory::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x209e9f1]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::VariableType::(anonymous namespace)::_pin_memory(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x3c005af]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>), &torch::autograd::VariableType::(anonymous namespace)::_pin_memory>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x3c0091a]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::_ops::_pin_memory::call(at::Tensor const&, c10::optional<c10::Device>) [0x211171c]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::native::pin_memory(at::Tensor const&, c10::optional<c10::Device>) [0x1a1d95c]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::Device>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__pin_memory>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x2a842c7]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::_ops::pin_memory::call(at::Tensor const&, c10::optional<c10::Device>) [0x211138c]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::THPVariable_pin_memory(_object*, _object*, _object*) [0x4a4d18]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_python.so
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/descrobject.c:366:method_vectorcall_VARARGS_KEYWORDS [0x209388]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/call.c:299:PyObject_Vectorcall [0x1f95bc]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/ceval.c:4774:_PyEval_EvalFrameDefault [0x1ec992]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/call.c:393:_PyFunction_Vectorcall [0x20f121]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/ceval.c:5381:_PyEval_EvalFrameDefault [0x1f04e7]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Include/internal/pycore_call.h:92:_PyObject_VectorcallTstate.lto_priv.4 [0x22fc74]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/classobject.c:67:method_vectorcall [0x22f6a8]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Modules/_threadmodule.c:1093:thread_run [0x30506b]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/thread_pthread.h:243:pythread_wrapper [0x2d05d4]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:./nptl/pthread_create.c:442:start_thread [0x89044]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:../sysdeps/unix/sysv/linux/x86_64/clone3.S:83:clone3 [0x1095fc]
=========                in /lib/x86_64-linux-gnu/libc.so.6
========= 
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x441846]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaGetLastError [0x48dd4]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/../../../../libcudart.so.11.0
=========     Host Frame:c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [0x43edd]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libc10_cuda.so
=========     Host Frame:at::cuda::CUDAHostAllocatorWrapper::allocate(unsigned long) const [0xe30ee3]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:at::native::_pin_memory_cuda(at::Tensor const&, c10::optional<c10::Device>) [0xe37494]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___pin_memory(at::Tensor const&, c10::optional<c10::Device>) [0x2b4b562]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::Device>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___pin_memory>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x2b4b5f7]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
=========     Host Frame:at::_ops::_pin_memory::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x209e9f1]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::(anonymous namespace)::_pin_memory(at::Tensor const&, c10::optional<c10::Device>) [0x26f26bc]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::Device>), &at::(anonymous namespace)::_pin_memory>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x26f2837]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::_ops::_pin_memory::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x209e9f1]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::VariableType::(anonymous namespace)::_pin_memory(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x3c005af]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>), &torch::autograd::VariableType::(anonymous namespace)::_pin_memory>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x3c0091a]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::_ops::_pin_memory::call(at::Tensor const&, c10::optional<c10::Device>) [0x211171c]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::native::pin_memory(at::Tensor const&, c10::optional<c10::Device>) [0x1a1d95c]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::Device>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__pin_memory>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::Device> > >, at::Tensor (at::Tensor const&, c10::optional<c10::Device>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::Device>) [0x2a842c7]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:at::_ops::pin_memory::call(at::Tensor const&, c10::optional<c10::Device>) [0x211138c]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::THPVariable_pin_memory(_object*, _object*, _object*) [0x4a4d18]
=========                in /home/lapluis/miniconda3/envs/dgl/lib/python3.11/site-packages/torch/lib/libtorch_python.so
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/descrobject.c:366:method_vectorcall_VARARGS_KEYWORDS [0x209388]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/call.c:299:PyObject_Vectorcall [0x1f95bc]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/ceval.c:4774:_PyEval_EvalFrameDefault [0x1ec992]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/call.c:393:_PyFunction_Vectorcall [0x20f121]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/ceval.c:5381:_PyEval_EvalFrameDefault [0x1f04e7]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Include/internal/pycore_call.h:92:_PyObject_VectorcallTstate.lto_priv.4 [0x22fc74]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Objects/classobject.c:67:method_vectorcall [0x22f6a8]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Modules/_threadmodule.c:1093:thread_run [0x30506b]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:/usr/local/src/conda/python-3.11.5/Python/thread_pthread.h:243:pythread_wrapper [0x2d05d4]
=========                in /home/lapluis/miniconda3/envs/dgl/bin/python
=========     Host Frame:./nptl/pthread_create.c:442:start_thread [0x89044]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:../sysdeps/unix/sysv/linux/x86_64/clone3.S:83:clone3 [0x1095fc]
=========                in /lib/x86_64-linux-gnu/libc.so.6
========= 
========= Target application returned an error
========= ERROR SUMMARY: 3 errors

Environment

DGL Version (e.g., 1.0): 1.12
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.0.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source): cmake -DBUILD_TYPE=dev -DUSE_CUDA=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda .. (I also tried to install from source)
Python version: 3.11.5
CUDA/cuDNN version (if applicable): CUDA 11.8 & cuDNN 8.9.4.25-1+cuda11.8 (Driver Version: 520.61.05)
GPU models and configuration (e.g. V100): V100
Any other relevant information:

Additional context

I also tried to install from source and conda, and tried on another server (Linux + 3090 (Driver Version: 525.125.06)), but got the same error.

Then I tried to run on my PC (Windows 11 + 3080 (Driver Version: 537.34)), install env using conda. Using 'python main.py' was alright, however, I got another error using ipython and python console:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'MyDataset' on <module '__main__' (built-in)>
Traceback (most recent call last):
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\site-packages\torch\utils\data\dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\multiprocessing\queues.py", line 114, in get
    raise Empty
_queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\site-packages\torch\utils\data\dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
                ^^^^^^^^^^^^^^^^
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\site-packages\torch\utils\data\dataloader.py", line 1294, in _get_data
    success, data = self._try_get_data()
                    ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stort\miniconda3\envs\dgl\Lib\site-packages\torch\utils\data\dataloader.py", line 1145, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 25832) exited unexpectedly

The text was updated successfully, but these errors were encountered:

StortInter · 2023-09-23T02:39:59Z

BTW, this error will not occur if I use the previous version of dgl. I tried to install dgl-1.0.2+cu118 on the server which is the installed version on my cooperator's PC, run the code and nothing happened.

frozenbugs · 2023-09-25T06:48:32Z

Hi @yaox12, @chang-l, can you help on this issue.

It is pretty strange error, effectively the code only moves tensor([0]) to cuda:0.
with num_worker = 1 in dataloader and gh = dgl.graph(([1, 2], [1, 2])), then something goes run that breaks to cuda:0 operator.

It crashes even I change the code to dgl-unrelated code:

t = torch.tensor([[0, 0, 0], [0, 1, 2]])
t.to(device=device)

StortInter · 2023-09-27T11:21:36Z

I also tried 1.1.x (1.1.1 and 1.1.0) and 1.0.x (1.0.4), this error only occurs on 1.1.x.

frozenbugs · 2023-09-28T02:23:40Z

Hi @StortInter

In getitem, when I change return 0 to return gh it will not fail, may I ask why do you need to return a integer?
It is a very weird bug might due to some incompatibility between dgl and pytorch, if you can post a meaningful code which reproduces this issue, we might be able to provide more help.

import os

import dgl
import torch

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
device = torch.device('cuda:0')


class MyDataset(dgl.data.DGLDataset):
    def process(self):
        pass

    def __init__(self):
        super().__init__('MyDataset')

    def __getitem__(self, idx):
        gh = dgl.graph(([1, 2], [1, 2]))    # comment to resolve error
        return gh

    def __len__(self):
        return 1000


if __name__ == '__main__':
    iter_0 = dgl.dataloading.GraphDataLoader(
        dataset=MyDataset(),
        num_workers=1   # set 0 to resolve error
    )

    for i in iter_0.__iter__():
        i.to(device=device)

    for i in iter_0.__iter__():
        i.to(device=device)

StortInter · 2023-10-09T11:54:30Z

Hi @StortInter

In getitem, when I change return 0 to return gh it will not fail, may I ask why do you need to return a integer? It is a very weird bug might due to some incompatibility between dgl and pytorch, if you can post a meaningful code which reproduces this issue, we might be able to provide more help.

import os

import dgl
import torch

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
device = torch.device('cuda:0')


class MyDataset(dgl.data.DGLDataset):
 def process(self):
     pass

 def __init__(self):
     super().__init__('MyDataset')

 def __getitem__(self, idx):
     gh = dgl.graph(([1, 2], [1, 2]))    # comment to resolve error
     return gh

 def __len__(self):
     return 1000


if __name__ == '__main__':
 iter_0 = dgl.dataloading.GraphDataLoader(
     dataset=MyDataset(),
     num_workers=1   # set 0 to resolve error
 )

 for i in iter_0.__iter__():
     i.to(device=device)

 for i in iter_0.__iter__():
     i.to(device=device)

Hi @frozenbugs,

Here is the original code of the dataloader:

# -*- coding: utf-8 -*-


import dgl
import numpy as np
from dgl.data import DGLDataset


class GraphDataset_k_nearest(DGLDataset):
    def __init__(self, x, y, k, num_nodes, win_length):
        self.x = x
        self.labels = y
        self.k = k
        self.num_nodes = num_nodes
        self.win_length = win_length

    def __getitem__(self, idx):
        node_features = self.x[idx]

        cor_matrix = np.corrcoef(node_features.T)
        src_node = []
        dst_node = []
        for j in range(cor_matrix.shape[0]):
            dst = cor_matrix[j].argsort()[-self.k:][::-1]
            src_node.extend([j] * len(dst))
            dst_node.extend(dst)

        G = dgl.graph((src_node, dst_node))
        G = dgl.to_bidirected(G)

        features = node_features.reshape(1, node_features.shape[0], node_features.shape[1])
        self.feature = features

        G.ndata['x'] = node_features.reshape(self.num_nodes, self.win_length)
        self.G = G
        return self.G, self.feature, self.labels[idx]

    def __len__(self):
        return len(self.x)

Just use random to generate x={Tensor(10000, 420, 128)}, y={Tensor(10000,)} and set k=128, num_nodes=128 and win_length=420, for the data is too large.

wzm2256 · 2023-10-20T15:35:07Z

BTW, this error will not occur if I use the previous version of dgl. I tried to install dgl-1.0.2+cu118 on the server which is the installed version on my cooperator's PC, run the code and nothing happened.

Thanks! You saved my life. I can now run my code.

github-actions · 2023-11-20T01:30:02Z

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

chang-l · 2023-11-28T01:20:46Z

I think this is could be due to the issue: #6561, i.e., the sampling (child) processes invoked new CUDA instances (which is not allowed when processes are created via fork method). Even though such CUDA error msg is cleaned afterwards in the code (see the description #6561), it is not enough and this cuda error is somehow revealed at device after the sampling done...

The issue #6561 has been fixed by #6568 and merged into master. I tested using the src build and confirm that the crash can be resolved after applying the commit 1b3f14b.

chang-l · 2023-11-28T01:22:47Z

@frozenbugs can you please help double-check if the commit 1b3f14b can fix this issue?

frozenbugs · 2023-12-15T02:21:26Z

Yes, it is fixed, thanks for your help.

frozenbugs assigned yaox12 and chang-l Sep 25, 2023

frozenbugs self-assigned this Oct 12, 2023

wzm2256 mentioned this issue Oct 19, 2023

dataloader causes RuntimeError: CUDA error: unspecified launch failure #6473

Closed

github-actions bot added the stale-issue label Nov 20, 2023

Rhett-Ying removed the stale-issue label Nov 23, 2023

frozenbugs closed this as completed Dec 15, 2023

lilyminium mentioned this issue Feb 9, 2024

CUDA launch error openforcefield/openff-nagl#81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An interesting bug caused "CUDA error: unspecified launch failure" #6375

An interesting bug caused "CUDA error: unspecified launch failure" #6375

StortInter commented Sep 22, 2023

StortInter commented Sep 23, 2023

frozenbugs commented Sep 25, 2023 •

edited

Loading

StortInter commented Sep 27, 2023

frozenbugs commented Sep 28, 2023

StortInter commented Oct 9, 2023

wzm2256 commented Oct 20, 2023

github-actions bot commented Nov 20, 2023

chang-l commented Nov 28, 2023

chang-l commented Nov 28, 2023

frozenbugs commented Dec 15, 2023 •

edited

Loading

An interesting bug caused "CUDA error: unspecified launch failure" #6375

An interesting bug caused "CUDA error: unspecified launch failure" #6375

Comments

StortInter commented Sep 22, 2023

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

StortInter commented Sep 23, 2023

frozenbugs commented Sep 25, 2023 • edited Loading

StortInter commented Sep 27, 2023

frozenbugs commented Sep 28, 2023

StortInter commented Oct 9, 2023

wzm2256 commented Oct 20, 2023

github-actions bot commented Nov 20, 2023

chang-l commented Nov 28, 2023

chang-l commented Nov 28, 2023

frozenbugs commented Dec 15, 2023 • edited Loading

frozenbugs commented Sep 25, 2023 •

edited

Loading

frozenbugs commented Dec 15, 2023 •

edited

Loading