Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

Open
vedantroy opened this issue Aug 15, 2024 · 6 comments
Open

How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116

vedantroy opened this issue Aug 15, 2024 · 6 comments
Assignees

Comments

@vedantroy
Copy link

I'm running my code with:

env CUDNN_LOGERR_DBG=1  CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train

and getting errors like:

[rank5]: RuntimeError: /home/ved/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:358 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with code: CUDNN_STATUS_EXECUTION_FAILED, and message: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream). For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

I'm using a pretty standard DotProductAttention:

        self.te_attn = te.DotProductAttention(
                num_attention_heads=24,
                kv_channels=self.128,
                qkv_format="thd", # tokens, head, dim
                attn_mask_type="padding",        
       )

and I'm also calling it in a pretty standard way (all the assertions pass):

                assert qkv.shape == (total, 3, self.num_heads, self.head_dim)
                q, k, v = torch.unbind(qkv, dim=1)

                assert q.shape == k.shape == v.shape
                assert q.shape == (total, self.num_heads, self.head_dim)
                assert cu_seqlens.shape[0] == B + 1

                xy: torch.Tensor = self.te_attn(
                    q, k, v,
                    cu_seqlens_q=cu_seqlens,
                    cu_seqlens_kv=cu_seqlens,
                    max_seqlen_q=max_seqlen_in_batch,
                    max_seqlen_kv=max_seqlen_in_batch,
                )

I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?

@vedantroy
Copy link
Author

Is there some chance that I need to use a specific stride? I know my shapes are correct, but it's definitely possible my stride is wrong.

@ptrendx
Copy link
Member

ptrendx commented Aug 16, 2024

@vedantroy Could you post more information about your environment - most importantly TE, CUDA and cuDNN versions. Also, could you try the failing case with CUDNN_LOGLEVEL_DBG=3 rather than CUDNN_LOGERR_DBG=1 and post a snippet of the log before the error? It should list the cuDNN call it is trying to execute, including the shapes and strides.

@vedantroy
Copy link
Author

CUDA version:

my-compute-node:~/training/replay$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

CuDNN + transformer engine versions:

transformer_engine            1.8.0+3ec998e
nvidia-cudnn-cu12             9.1.0.70

More logs using the command env CUDNN_LOGERR_DBG=3 CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train 2>log.txt

E! CuDNN (v90100 70) function cudnnBackendExecute() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream)
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: plan.getEnginePtr()->execute(vars, handle->streamId)
e! Time: 2024-08-16T23:10:17.196169 (0d+0h+0m+3s since start)
e! Process=2088716; Thread=2088716; GPU=NULL; Handle=NULL; StreamId=NULL.

@vedantroy
Copy link
Author

Ok, further updates. It looks like it's failing on the backwards pass only. And ... if I use only 2 layers in my model, instead of 4, it doesn't fail. Is it possible I'm getting Cuda OOM issues? (Seems unlikely since I run this model w/ 48+ layers when using FA2).

@cyanguwa
Copy link
Collaborator

Hi @vedantroy , I tried to reproduce your config, and it seemed to pass my tests.

Arch: Hopper
Container: nvcr.io/nvidia/pytorch:24.07-py3 (CUDA 12.5.1.007)
TE 1.8: https://github.com/NVIDIA/TransformerEngine/archive/refs/tags/v1.8.zip
cuDNN 9.1: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.1.0.70_cuda12-archive.tar.xz
tests:
model_configs_layout_thd = {
    #       test:             b,  h, hg,   d,   sq,  skv,   p,             mask,             bias
    "layout_0_1": ModelConfig(1, 24, 24, 128, 128, 128, 0.0, "padding", "no_bias"),
}
pytest -s -v tests/pytorch/fused_attn/test_fused_attn.py::test_dpa_qkv_layout_thd

Could you extract a small reproducer code with just the DotProductAttention calls from your application? Maybe we can have a look at how that's different from my tests.

Thanks,
Charlene

@vedantroy
Copy link
Author

@cyanguwa -- I'll try to make a minimal reproduction soon. For now, a few more details

  • Only happens w/ FSDP enabled on multiple ranks
  • Does not happen if
os.environ["NVTE_FUSED_ATTN"] = "0"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants