-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with multiple samplers on torch 1.13 #199
Comments
Related issues: dmlc/dgl#5480, dmlc/dgl#5528, |
for bug 1, I think it's worth trying with the suggestion:
|
Hi @isratnisa , to verify if it is a same issue as dmlc/dgl#5480, can you please try revert the problematic commit (pytorch/pytorch@b25a1ce) or rebuild PyT from TOT to see if it works? |
Hi, I am facing the same error with #199 (comment). I have tried both Run command:
Error:
|
|
Hi @isratnisa , I wonder if you are using the docker when getting this error? I tried without docker using In fact, I also tried |
I get the following numbers by checking the open files limits. It seems like this is not the root cause given that those numbers are quite large:
Besides, I have also tried
This seems like a deadlock when saving models, not really multi-sampler issue though. |
Torch 2.0.1 resolves the issue. Verified with |
…377) Resolves issue #199 Updating the torch version from `torch==1.13` to `torch==2.1.0` in the docker file. Torch versions later than `1.12` had a bug which did not allow us to use `num_samplers` > 0. In Pytorch 2.1.0 release the bug is resolved. We have verified the solution through the following experiments. #### Experiment setup: Dataset: ogbn-mag (partitioned into 2) DGL versions: '1.0.4+cu117' and '1.1.1+cu113' Torch versions: '2.1.0+cu118' ### Experiment 1: 1 trainer and 4 samplers ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 1 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 13.5191 | Time: 3.2363 Epoch 00000 | Batch 020 | Train Loss: 3.2547 | Time: 0.4499 Epoch 00000 | Batch 040 | Train Loss: 2.0744 | Time: 0.5477 Epoch 00000 | Batch 060 | Train Loss: 1.6599 | Time: 0.5524 Epoch 00000 | Batch 080 | Train Loss: 1.4543 | Time: 0.4597 Epoch 00000 | Batch 100 | Train Loss: 1.2397 | Time: 0.4665 Epoch 00000 | Batch 120 | Train Loss: 1.0915 | Time: 0.4823 Epoch 00000 | Batch 140 | Train Loss: 0.9683 | Time: 0.4576 Epoch 00000 | Batch 160 | Train Loss: 0.8798 | Time: 0.5382 Epoch 00000 | Batch 180 | Train Loss: 0.7762 | Time: 0.5681 Epoch 00000 | Batch 200 | Train Loss: 0.7021 | Time: 0.4492 Epoch 00000 | Batch 220 | Train Loss: 0.6619 | Time: 0.4450 Epoch 00000 | Batch 240 | Train Loss: 0.6001 | Time: 0.4437 Epoch 00000 | Batch 260 | Train Loss: 0.5591 | Time: 0.4540 Epoch 00000 | Batch 280 | Train Loss: 0.5115 | Time: 0.3577 Epoch 0 take 134.6200098991394 ``` ### Experiment 2: 4 trainers and 4 samplers: ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 4 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 11.1130 | Time: 4.6957 Epoch 00000 | Batch 020 | Train Loss: 3.3098 | Time: 0.7897 Epoch 00000 | Batch 040 | Train Loss: 1.9996 | Time: 0.8633 Epoch 00000 | Batch 060 | Train Loss: 1.5202 | Time: 0.4229 Epoch 0 take 56.44491267204285 successfully save the model to /data/ogbn-map-lp/model/epoch-0 Time on save model 5.461951017379761 ``` By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
🐛 Bug
Training script for link prediction does not work with multiple sampler for PyTorch 1.13. So far, three different bugs were found. In summary:
KeyError: 'dataloader-0'
error fromdgl/distributed/dist_context.py
Note:
Details
Bug 1:
Run command:
Error:
Bug 2:
Run command:
Error:
Bug 3:
Error:
Environment
The text was updated successfully, but these errors were encountered: