Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

Open
isratnisa opened this issue May 23, 2023 · 7 comments

Comments

@isratnisa
Copy link
Collaborator

🐛 Bug

Dataloader cannot handle when number of sampler is more than 0 in distributed training for pytorch versions > 1.12. Could run the same script with PyTorch 1.12.

start training: elapsed time: 5.125, mem (curr: 2.251, peak: 2.251, shared: 0.562,                     global curr: 11.899, global shared: 72.445) GB
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py:1772: UserWarning: You passed find_unused_parameters=true to DistributedDataParallel, `_set_static_graph` will detect unused parameters automatically, so you do not need to set find_unused_parameters=true, just be sure these unused parameters will not change during training loop while calling `_set_static_graph`.
  warnings.warn(
Client [160] waits on 172.31.28.52:52675
Machine (0) group (0) client (13) connect to server successfuly!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 114, in init_process
    raise e
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'
Client [156] waits on 172.31.28.52:39043

Will add more details.

To Reproduce

Steps to reproduce the behavior:

  1. Distributed training
    (Will add more details)

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 1.0.0
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): Pytorch 1.13 or pytorch 2.0
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable): 11.6
  • GPU models and configuration (e.g. V100): T4
  • Any other relevant information:

Additional context

@isratnisa isratnisa changed the title [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 May 24, 2023
@isratnisa isratnisa changed the title [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 May 24, 2023
@Rhett-Ying
Copy link
Collaborator

could you add more about how to reproduce this issue? share the key part of DistDataLoader?

@Rhett-Ying
Copy link
Collaborator

this issue happens even with num_samplers=1 ?

@Rhett-Ying
Copy link
Collaborator

when reproducing this issue, I hit an other known issue: #5528 (comment)

@isratnisa
Copy link
Collaborator Author

isratnisa commented May 25, 2023

@Rhett-Ying I reproduced the issue on GraphStorm: awslabs/graphstorm#199

@chang-l
Copy link
Collaborator

chang-l commented May 31, 2023

Please check if it is a duplicate issue of #5480 due to a bug from PyT's ForkingPickler (stack trace may vary due to file/data racing).

@github-actions
Copy link

github-actions bot commented Jul 1, 2023

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

@github-actions
Copy link

github-actions bot commented Aug 5, 2023

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants