Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue #377

Merged
merged 17 commits into from
Oct 27, 2023

Conversation

isratnisa
Copy link
Contributor

@isratnisa isratnisa commented Aug 10, 2023

Resolves issue #199

Updating the torch version from torch==1.13 to torch==2.1.0 in the docker file. Torch versions later than 1.12 had a bug which did not allow us to use num_samplers > 0. In Pytorch 2.1.0 release the bug is resolved. We have verified the solution through the following experiments.

Experiment setup:

Dataset: ogbn-mag (partitioned into 2)
DGL versions: '1.0.4+cu117' and '1.1.1+cu113'
Torch versions: '2.1.0+cu118'

Experiment 1:

1 trainer and 4 samplers

python3 -u  /dgl/tools/launch.py         
     --workspace /graph-storm/python/graphstorm/run/gsgnn_lp         
     --num_trainers 1         
     --num_servers 1         
     --num_samplers 4         
     --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json         
     --ip_config /data/ip_list_p2.txt         
     --ssh_port 2222         
     --graph_format csc,coo         
     "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true"

Output:

Epoch 00000 | Batch 000 | Train Loss: 13.5191 | Time: 3.2363
Epoch 00000 | Batch 020 | Train Loss: 3.2547 | Time: 0.4499
Epoch 00000 | Batch 040 | Train Loss: 2.0744 | Time: 0.5477
Epoch 00000 | Batch 060 | Train Loss: 1.6599 | Time: 0.5524
Epoch 00000 | Batch 080 | Train Loss: 1.4543 | Time: 0.4597
Epoch 00000 | Batch 100 | Train Loss: 1.2397 | Time: 0.4665
Epoch 00000 | Batch 120 | Train Loss: 1.0915 | Time: 0.4823
Epoch 00000 | Batch 140 | Train Loss: 0.9683 | Time: 0.4576
Epoch 00000 | Batch 160 | Train Loss: 0.8798 | Time: 0.5382
Epoch 00000 | Batch 180 | Train Loss: 0.7762 | Time: 0.5681
Epoch 00000 | Batch 200 | Train Loss: 0.7021 | Time: 0.4492
Epoch 00000 | Batch 220 | Train Loss: 0.6619 | Time: 0.4450
Epoch 00000 | Batch 240 | Train Loss: 0.6001 | Time: 0.4437
Epoch 00000 | Batch 260 | Train Loss: 0.5591 | Time: 0.4540
Epoch 00000 | Batch 280 | Train Loss: 0.5115 | Time: 0.3577
Epoch 0 take 134.6200098991394

Experiment 2:

4 trainers and 4 samplers:

python3 -u  /dgl/tools/launch.py         
     --workspace /graph-storm/python/graphstorm/run/gsgnn_lp         
     --num_trainers 4         
     --num_servers 1         
     --num_samplers 4         
     --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json         
     --ip_config /data/ip_list_p2.txt         
     --ssh_port 2222         
     --graph_format csc,coo         
     "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true"

Output:

Epoch 00000 | Batch 000 | Train Loss: 11.1130 | Time: 4.6957
Epoch 00000 | Batch 020 | Train Loss: 3.3098 | Time: 0.7897
Epoch 00000 | Batch 040 | Train Loss: 1.9996 | Time: 0.8633
Epoch 00000 | Batch 060 | Train Loss: 1.5202 | Time: 0.4229
Epoch 0 take 56.44491267204285
successfully save the model to /data/ogbn-map-lp/model/epoch-0
Time on save model 5.461951017379761

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@classicsong
Copy link
Contributor

Did you test it with multiple sampler?
Any performance improvement?

@classicsong
Copy link
Contributor

Can you also update the README and wiki?

@isratnisa isratnisa self-assigned this Aug 11, 2023
@isratnisa isratnisa requested a review from classicsong August 11, 2023 00:49
@isratnisa
Copy link
Contributor Author

@classicsong Yes! Tried with ogbn-mag dataset and it worked. But didn't see any performance improvement probably because the dataset is too small.
Once the PR is merged I will update Wiki and README.

@isratnisa isratnisa requested a review from zhjwy9343 August 11, 2023 00:53
@classicsong
Copy link
Contributor

@classicsong Yes! Tried with ogbn-mag dataset and it worked. But didn't see any performance improvement probably because the dataset is too small. Once the PR is merged I will update Wiki and README.

You can put the update of README and rst files in the same PR.

@isratnisa
Copy link
Contributor Author

@classicsong Any reason we are still using DGL 1.0.4 not the latest release 1.1.1?

@zheng-da
Copy link
Contributor

does GraphStorm work with DGL 1.1.1?

README.md Show resolved Hide resolved
@isratnisa
Copy link
Contributor Author

@zheng-da I have been using DGL 1.1.1 for GSF for a while now. Is there any breaking point?

@classicsong
Copy link
Contributor

@zheng-da I have been using DGL 1.1.1 for GSF for a while now. Is there any breaking point?

We don't know. Let's run our regression test with DGL 1.1.1

@zhjwy9343
Copy link
Contributor

If this works, could you also revise this rst file to give a proper Torch and DGL installation commands?

https://github.com/awslabs/graphstorm/blob/main/docs/source/install/env-setup.rst#install-dependencies

Copy link
Contributor

@zhjwy9343 zhjwy9343 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plese update rst file too. LGTM

README.md Show resolved Hide resolved
@classicsong classicsong added 0.2.1 and removed v0.2 labels Sep 29, 2023
@isratnisa isratnisa changed the title Update docker to Torch 2.0.1+CUDA11.7 to resolve multi-sampler issue Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue Oct 5, 2023
@zheng-da zheng-da added the ready able to trigger the CI label Oct 6, 2023
docker/Dockerfile.local Outdated Show resolved Hide resolved
@isratnisa isratnisa merged commit 51b4256 into awslabs:main Oct 27, 2023
6 checks passed
@isratnisa isratnisa deleted the update-torch branch October 27, 2023 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2.1 ready able to trigger the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants