Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #1

Open
Oision-hub opened this issue Mar 28, 2022 · 4 comments
Open

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #1

Oision-hub opened this issue Mar 28, 2022 · 4 comments

Comments

@Oision-hub
Copy link

Hi, when I try the demo in docker, it appeared this problem.

root@Oision-Legion-R7000P2021H:~/EasyEspnet# python train.py --root_path data/an4/asr1/ --dataset an4
2022-03-28 03:29:05,274 (utils:21) WARNING: Skip DEBUG/INFO messages
2022-03-28 03:29:05,349 (train:179) WARNING: ngpu: 1
2022-03-28 03:29:06,526 (data_load:94) WARNING: #Train Json data/an4/asr1/dump/train_nodev/deltafalse/data.json: 848
2022-03-28 03:29:06,526 (data_load:95) WARNING: #Dev Json data/an4/asr1/dump/train_dev/deltafalse/data.json: 100
2022-03-28 03:29:06,526 (data_load:96) WARNING: #Test Json data/an4/asr1/dump/test/deltafalse/data.json: 130
2022-03-28 03:38:48,454 (train:301) WARNING: Total parameter of the model = 27181116
2022-03-28 03:38:48,455 (train:305) WARNING: Trainable parameter of the model = 27181116
Traceback (most recent call last):
  File "train.py", line 315, in <module>
    train(dataloaders, model, optimizer, save_path)
  File "train.py", line 107, in train
    train_stats = train_epoch(train_loader, model, optimizer)
  File "train.py", line 55, in train_epoch
    loss = model(fbank, seq_lens, tokens).mean() # / self.accum_grad
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/espnet/espnet/nets/pytorch_backend/e2e_asr_transformer.py", line 178, in forward
    hs_pad, hs_mask = self.encoder(xs_pad, src_mask)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/espnet/espnet/nets/pytorch_backend/transformer/encoder.py", line 298, in forward
    xs, masks = self.embed(xs, masks)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/espnet/espnet/nets/pytorch_backend/transformer/subsampling.py", line 75, in forward
    x = self.conv(x)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
@jindongwang
Copy link
Owner

Looks like some environment issues. Are you using my docker? Since root@Oision-Legion-R7000P2021H doesn't seem to be inside the docker. If so, what kind of GPU and CUDA version of your machine?

@Oision-hub
Copy link
Author

Yes,it's a environment issue. I'm using your docker. I check the pytorch version and find my CUDA version(11.4) is not suit for the pytorch version in the docker. So, I try to update the Pytorch version and it seems to be work.
It can start to train but appeared this error RuntimeError: Unable to find a valid cuDNN algorithm to run convolution, I search this error code on the Internet, it may happen when the GPU Memory-Usage is full. I try to reduce the batch size in data_load.py but it still has this error.

  • GPU: NVIDIA GeForce RTX 3060 Laptop
  • GPU Total Memory: 6144 MB

@jindongwang
Copy link
Owner

It seems that your GPU is not suitable for training speech tasks. Honestly speaking, speech tasks are really consuming hardware resources and we are doing our experiments on Microsoft Azure with huge numbers of GPUs. I remember we are using 8 V100 GPUs to train it. So I guess your machine cannot work. However, our docker environment can help you setup EspNet environment quickly. Thus, you can do your own experiments.

@Oision-hub
Copy link
Author

Oh, OK. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants