How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

vedantroy · 2024-08-15T23:36:39Z

I'm running my code with:

env CUDNN_LOGERR_DBG=1  CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train

and getting errors like:

[rank5]: RuntimeError: /home/ved/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:358 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with code: CUDNN_STATUS_EXECUTION_FAILED, and message: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream). For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

I'm using a pretty standard DotProductAttention:

        self.te_attn = te.DotProductAttention(
                num_attention_heads=24,
                kv_channels=self.128,
                qkv_format="thd", # tokens, head, dim
                attn_mask_type="padding",        
       )

and I'm also calling it in a pretty standard way (all the assertions pass):

                assert qkv.shape == (total, 3, self.num_heads, self.head_dim)
                q, k, v = torch.unbind(qkv, dim=1)

                assert q.shape == k.shape == v.shape
                assert q.shape == (total, self.num_heads, self.head_dim)
                assert cu_seqlens.shape[0] == B + 1

                xy: torch.Tensor = self.te_attn(
                    q, k, v,
                    cu_seqlens_q=cu_seqlens,
                    cu_seqlens_kv=cu_seqlens,
                    max_seqlen_q=max_seqlen_in_batch,
                    max_seqlen_kv=max_seqlen_in_batch,
                )

I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?

The text was updated successfully, but these errors were encountered:

vedantroy · 2024-08-16T00:02:57Z

Is there some chance that I need to use a specific stride? I know my shapes are correct, but it's definitely possible my stride is wrong.

ptrendx · 2024-08-16T16:13:41Z

@vedantroy Could you post more information about your environment - most importantly TE, CUDA and cuDNN versions. Also, could you try the failing case with CUDNN_LOGLEVEL_DBG=3 rather than CUDNN_LOGERR_DBG=1 and post a snippet of the log before the error? It should list the cuDNN call it is trying to execute, including the shapes and strides.

vedantroy · 2024-08-16T23:13:02Z

CUDA version:

my-compute-node:~/training/replay$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

CuDNN + transformer engine versions:

transformer_engine            1.8.0+3ec998e
nvidia-cudnn-cu12             9.1.0.70

More logs using the command env CUDNN_LOGERR_DBG=3 CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train 2>log.txt

E! CuDNN (v90100 70) function cudnnBackendExecute() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream)
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: plan.getEnginePtr()->execute(vars, handle->streamId)
e! Time: 2024-08-16T23:10:17.196169 (0d+0h+0m+3s since start)
e! Process=2088716; Thread=2088716; GPU=NULL; Handle=NULL; StreamId=NULL.

vedantroy · 2024-08-17T01:50:18Z

Ok, further updates. It looks like it's failing on the backwards pass only. And ... if I use only 2 layers in my model, instead of 4, it doesn't fail. Is it possible I'm getting Cuda OOM issues? (Seems unlikely since I run this model w/ 48+ layers when using FA2).

cyanguwa · 2024-08-19T21:19:40Z

Hi @vedantroy , I tried to reproduce your config, and it seemed to pass my tests.

Arch: Hopper
Container: nvcr.io/nvidia/pytorch:24.07-py3 (CUDA 12.5.1.007)
TE 1.8: https://github.com/NVIDIA/TransformerEngine/archive/refs/tags/v1.8.zip
cuDNN 9.1: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.1.0.70_cuda12-archive.tar.xz
tests:
model_configs_layout_thd = {
    #       test:             b,  h, hg,   d,   sq,  skv,   p,             mask,             bias
    "layout_0_1": ModelConfig(1, 24, 24, 128, 128, 128, 0.0, "padding", "no_bias"),
}
pytest -s -v tests/pytorch/fused_attn/test_fused_attn.py::test_dpa_qkv_layout_thd

Could you extract a small reproducer code with just the DotProductAttention calls from your application? Maybe we can have a look at how that's different from my tests.

Thanks,
Charlene

vedantroy · 2024-08-19T23:49:18Z

@cyanguwa -- I'll try to make a minimal reproduction soon. For now, a few more details

Only happens w/ FSDP enabled on multiple ranks
Does not happen if

os.environ["NVTE_FUSED_ATTN"] = "0"

ptrendx assigned cyanguwa Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

vedantroy commented Aug 15, 2024

vedantroy commented Aug 16, 2024

ptrendx commented Aug 16, 2024

vedantroy commented Aug 16, 2024

vedantroy commented Aug 17, 2024

cyanguwa commented Aug 19, 2024

vedantroy commented Aug 19, 2024

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

Comments

vedantroy commented Aug 15, 2024

vedantroy commented Aug 16, 2024

ptrendx commented Aug 16, 2024

vedantroy commented Aug 16, 2024

vedantroy commented Aug 17, 2024

cyanguwa commented Aug 19, 2024

vedantroy commented Aug 19, 2024