-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An interesting bug caused "CUDA error: unspecified launch failure" #6375
Comments
BTW, this error will not occur if I use the previous version of dgl. I tried to install dgl-1.0.2+cu118 on the server which is the installed version on my cooperator's PC, run the code and nothing happened. |
Hi @yaox12, @chang-l, can you help on this issue. It is pretty strange error, effectively the code only moves tensor([0]) to cuda:0. It crashes even I change the code to dgl-unrelated code:
|
I also tried 1.1.x (1.1.1 and 1.1.0) and 1.0.x (1.0.4), this error only occurs on 1.1.x. |
Hi @StortInter In getitem, when I change return 0 to return gh it will not fail, may I ask why do you need to return a integer? import os
import dgl
import torch
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
device = torch.device('cuda:0')
class MyDataset(dgl.data.DGLDataset):
def process(self):
pass
def __init__(self):
super().__init__('MyDataset')
def __getitem__(self, idx):
gh = dgl.graph(([1, 2], [1, 2])) # comment to resolve error
return gh
def __len__(self):
return 1000
if __name__ == '__main__':
iter_0 = dgl.dataloading.GraphDataLoader(
dataset=MyDataset(),
num_workers=1 # set 0 to resolve error
)
for i in iter_0.__iter__():
i.to(device=device)
for i in iter_0.__iter__():
i.to(device=device) |
Hi @frozenbugs, Here is the original code of the dataloader: # -*- coding: utf-8 -*-
import dgl
import numpy as np
from dgl.data import DGLDataset
class GraphDataset_k_nearest(DGLDataset):
def __init__(self, x, y, k, num_nodes, win_length):
self.x = x
self.labels = y
self.k = k
self.num_nodes = num_nodes
self.win_length = win_length
def __getitem__(self, idx):
node_features = self.x[idx]
cor_matrix = np.corrcoef(node_features.T)
src_node = []
dst_node = []
for j in range(cor_matrix.shape[0]):
dst = cor_matrix[j].argsort()[-self.k:][::-1]
src_node.extend([j] * len(dst))
dst_node.extend(dst)
G = dgl.graph((src_node, dst_node))
G = dgl.to_bidirected(G)
features = node_features.reshape(1, node_features.shape[0], node_features.shape[1])
self.feature = features
G.ndata['x'] = node_features.reshape(self.num_nodes, self.win_length)
self.G = G
return self.G, self.feature, self.labels[idx]
def __len__(self):
return len(self.x) Just use |
Thanks! You saved my life. I can now run my code. |
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you |
I think this is could be due to the issue: #6561, i.e., the sampling (child) processes invoked new CUDA instances (which is not allowed when processes are created via The issue #6561 has been fixed by #6568 and merged into master. I tested using the src build and confirm that the crash can be resolved after applying the commit 1b3f14b. |
@frozenbugs can you please help double-check if the commit 1b3f14b can fix this issue? |
Yes, it is fixed, thanks for your help. |
🐛 Bug
Using
dgl.graph()
anddgl.dataloading.GraphDataLoader()
withnum_workers
causes "RuntimeError: CUDA error: unspecified launch failure".To Reproduce
Steps to reproduce the behavior:
The installation commands I used:
here is my conda env (only list key components)
Attention: The error can be avoid by delete line 38:
gh = dgl.graph(([1, 2], [1, 2]))
or setnum_workers
to 0.Expected behavior
Get CUDA error like this:
This code is a simplified version of the training code, I tried to use
compute-sanitizer
to run the original code, I got these:Environment
conda
,pip
, source): pipAdditional context
I also tried to install from source and
conda
, and tried on another server (Linux + 3090 (Driver Version: 525.125.06)), but got the same error.Then I tried to run on my PC (Windows 11 + 3080 (Driver Version: 537.34)), install env using
conda
. Using 'python main.py' was alright, however, I got another error usingipython
and python console:The text was updated successfully, but these errors were encountered: