Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure #43680

Closed
LukeLIN-web opened this issue Jun 20, 2022 · 10 comments
Assignees

Comments

@LukeLIN-web
Copy link

bug描述 Describe the Bug

Codes:
https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/reinforcement_learning/actor_critic_method.html

paddle-gpu==2.3.0, cuda10.2,cudnn 7
pip install gym

It failed in training process:

Error: /paddle/paddle/phi/kernels/gpu/multinomial_kernel.cu:67 Assertion `in_data[id] >= 0.0` failed. The input of multinomial distribution should be >= 0, but got nan.
Error: /paddle/paddle/phi/kernels/gpu/multinomial_kernel.cu:67 Assertion `in_data[id] >= 0.0` failed. The input of multinomial distribution should be >= 0, but got nan.
Traceback (most recent call last):
  File "Actor-Critic.py", line 133, in <module>
    trainIters(actor, critic, n_iters=201)
  File "Actor-Critic.py", line 74, in trainIters
    action = dist.sample([1])
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/distribution/categorical.py", line 166, in sample
    self._logits_to_probs(logits), num_samples, True)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/tensor/random.py", line 186, in multinomial
    replacement)
SystemError: (Fatal) Operator multinomial raises an thrust::system::system_error exception.
The exception content is
:transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure. (at /paddle/paddle/fluid/imperative/tracer.cc:307)

其他补充信息 Additional Supplementary Information

No response

@paddle-bot-old
Copy link

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@Liu-xiandong
Copy link
Member

你好,看报错信息是由于multinomial的输入参数不符合规范。看你给出的code链接是paddle的官方case,请问是否有修改其他内容呢?比如数据或者参数之类的内容。可以再仔细看一下该部分的参数输入,https://github.com/PaddlePaddle/Paddle/blob/release/2.3/paddle/phi/kernels/gpu/multinomial_kernel.cu#L64

@Aganlengzi
Copy link
Contributor

Error: /paddle/paddle/phi/kernels/gpu/multinomial_kernel.cu:67 Assertion in_data[id] >= 0.0 failed. The input of multinomial distribution should be >= 0, but got nan.
Error: /paddle/paddle/phi/kernels/gpu/multinomial_kernel.cu:67 Assertion in_data[id] >= 0.0 failed. The input of multinomial distribution should be >= 0, but got nan.
Traceback (most recent call last):

@LukeLIN-web 你好请注意报错显示输入的数据有nan值,所以抛出异常了

@LukeLIN-web
Copy link
Author

你好,看报错信息是由于multinomial的输入参数不符合规范。看你给出的code链接是paddle的官方case,请问是否有修改其他内容呢?比如数据或者参数之类的内容。可以再仔细看一下该部分的参数输入,https://github.com/PaddlePaddle/Paddle/blob/release/2.3/paddle/phi/kernels/gpu/multinomial_kernel.cu#L64

没有修改任何内容, 我又重新复制了一遍, 还是同样错误

@LukeLIN-web
Copy link
Author

Error: /paddle/paddle/phi/kernels/gpu/multinomial_kernel.cu:67 Assertion in_data[id] >= 0.0 failed. The input of multinomial distribution should be >= 0, but got nan.
Error: /paddle/paddle/phi/kernels/gpu/multinomial_kernel.cu:67 Assertion in_data[id] >= 0.0 failed. The input of multinomial distribution should be >= 0, but got nan.
Traceback (most recent call last):

@LukeLIN-web 你好请注意报错显示输入的数据有nan值,所以抛出异常了

输入是https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/reinforcement_learning/actor_critic_method.html 源代码,没有任何改动

@Liu-xiandong
Copy link
Member

你好,我目前在paddle-gpu==2.3.0, cuda10.2,cudnn 7上并没有复现出你的问题,能否提供更多的信息呢,比如所使用的GPU卡型号、操作系统、CPU型号等相关硬件信息。

@LukeLIN-web
Copy link
Author

LukeLIN-web commented Jun 21, 2022

我用镜像启动运行还是不行.
我是用 dockerhub的镜像.
启动后
Successfully installed cloudpickle-2.1.0 gym-0.24.1 gym-notices-0.0.7
GPU : Tesla T4,
OS : Linux. 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
CPU : Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

@Ligoml Ligoml added status/following-up 跟进中 and removed status/new-issue 新建 labels Jun 21, 2022
@Liu-xiandong
Copy link
Member

你好,看了你的硬件参数,暂时无法判断出错的原因。建议你在不同的硬件上尝试一下,如果硬件资源不足,可以使用paddle的AI studio。

@sunhao
Copy link

sunhao commented Feb 20, 2023

遇到同样的问题,gpu\multinomial_kernel.cu:56 Assertion in_data[id] >= 0.0 failed. The input of multinomial distribution should be >= 0, but got nan.
get Nvidia's official solution and advice about CUDA Error.] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:259)
CUDA.11.6 paddlepaddle-gpu==2.4.1.post116 RTX 3070,

@paddle-bot paddle-bot bot closed this as completed Feb 27, 2024
Copy link

paddle-bot bot commented Feb 27, 2024

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants