Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue #377

isratnisa · 2023-08-10T20:07:00Z

Resolves issue #199

Updating the torch version from torch==1.13 to torch==2.1.0 in the docker file. Torch versions later than 1.12 had a bug which did not allow us to use num_samplers > 0. In Pytorch 2.1.0 release the bug is resolved. We have verified the solution through the following experiments.

Experiment setup:

Dataset: ogbn-mag (partitioned into 2)
DGL versions: '1.0.4+cu117' and '1.1.1+cu113'
Torch versions: '2.1.0+cu118'

Experiment 1:

1 trainer and 4 samplers

python3 -u  /dgl/tools/launch.py         
     --workspace /graph-storm/python/graphstorm/run/gsgnn_lp         
     --num_trainers 1         
     --num_servers 1         
     --num_samplers 4         
     --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json         
     --ip_config /data/ip_list_p2.txt         
     --ssh_port 2222         
     --graph_format csc,coo         
     "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true"

Output:

Epoch 00000 | Batch 000 | Train Loss: 13.5191 | Time: 3.2363
Epoch 00000 | Batch 020 | Train Loss: 3.2547 | Time: 0.4499
Epoch 00000 | Batch 040 | Train Loss: 2.0744 | Time: 0.5477
Epoch 00000 | Batch 060 | Train Loss: 1.6599 | Time: 0.5524
Epoch 00000 | Batch 080 | Train Loss: 1.4543 | Time: 0.4597
Epoch 00000 | Batch 100 | Train Loss: 1.2397 | Time: 0.4665
Epoch 00000 | Batch 120 | Train Loss: 1.0915 | Time: 0.4823
Epoch 00000 | Batch 140 | Train Loss: 0.9683 | Time: 0.4576
Epoch 00000 | Batch 160 | Train Loss: 0.8798 | Time: 0.5382
Epoch 00000 | Batch 180 | Train Loss: 0.7762 | Time: 0.5681
Epoch 00000 | Batch 200 | Train Loss: 0.7021 | Time: 0.4492
Epoch 00000 | Batch 220 | Train Loss: 0.6619 | Time: 0.4450
Epoch 00000 | Batch 240 | Train Loss: 0.6001 | Time: 0.4437
Epoch 00000 | Batch 260 | Train Loss: 0.5591 | Time: 0.4540
Epoch 00000 | Batch 280 | Train Loss: 0.5115 | Time: 0.3577
Epoch 0 take 134.6200098991394

Experiment 2:

4 trainers and 4 samplers:

python3 -u  /dgl/tools/launch.py         
     --workspace /graph-storm/python/graphstorm/run/gsgnn_lp         
     --num_trainers 4         
     --num_servers 1         
     --num_samplers 4         
     --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json         
     --ip_config /data/ip_list_p2.txt         
     --ssh_port 2222         
     --graph_format csc,coo         
     "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true"

Output:

Epoch 00000 | Batch 000 | Train Loss: 11.1130 | Time: 4.6957
Epoch 00000 | Batch 020 | Train Loss: 3.3098 | Time: 0.7897
Epoch 00000 | Batch 040 | Train Loss: 1.9996 | Time: 0.8633
Epoch 00000 | Batch 060 | Train Loss: 1.5202 | Time: 0.4229
Epoch 0 take 56.44491267204285
successfully save the model to /data/ogbn-map-lp/model/epoch-0
Time on save model 5.461951017379761

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

classicsong · 2023-08-10T23:25:57Z

Did you test it with multiple sampler?
Any performance improvement?

classicsong · 2023-08-11T00:04:13Z

Can you also update the README and wiki?

isratnisa · 2023-08-11T00:51:23Z

@classicsong Yes! Tried with ogbn-mag dataset and it worked. But didn't see any performance improvement probably because the dataset is too small.
Once the PR is merged I will update Wiki and README.

classicsong · 2023-08-11T05:13:29Z

@classicsong Yes! Tried with ogbn-mag dataset and it worked. But didn't see any performance improvement probably because the dataset is too small. Once the PR is merged I will update Wiki and README.

You can put the update of README and rst files in the same PR.

isratnisa · 2023-08-13T18:12:43Z

@classicsong Any reason we are still using DGL 1.0.4 not the latest release 1.1.1?

zheng-da · 2023-08-13T22:14:10Z

does GraphStorm work with DGL 1.1.1?

README.md

isratnisa · 2023-08-14T13:08:00Z

@zheng-da I have been using DGL 1.1.1 for GSF for a while now. Is there any breaking point?

classicsong · 2023-08-15T17:03:52Z

@zheng-da I have been using DGL 1.1.1 for GSF for a while now. Is there any breaking point?

We don't know. Let's run our regression test with DGL 1.1.1

zhjwy9343 · 2023-09-21T23:28:35Z

If this works, could you also revise this rst file to give a proper Torch and DGL installation commands?

https://github.com/awslabs/graphstorm/blob/main/docs/source/install/env-setup.rst#install-dependencies

zhjwy9343

Plese update rst file too. LGTM

README.md

docs/source/install/env-setup.rst

docker/Dockerfile.local

isratnisa added 2 commits August 10, 2023 16:05

Update Dockerfile.local

8441e71

endline

1fc7bcb

classicsong added the v0.2 label Aug 10, 2023

isratnisa added 2 commits August 10, 2023 20:44

endline

fcb287c

Update Dockerfile.local

49c90ba

isratnisa self-assigned this Aug 11, 2023

isratnisa requested a review from classicsong August 11, 2023 00:49

isratnisa requested a review from zhjwy9343 August 11, 2023 00:53

isratnisa added 4 commits August 11, 2023 12:32

Update README.md

d2f244c

Update env-setup.rst

7dd7b69

Update README.md

21026e1

Update README.md

b7d8f60

zheng-da reviewed Aug 13, 2023

View reviewed changes

README.md Show resolved Hide resolved

zheng-da approved these changes Aug 13, 2023

View reviewed changes

zhjwy9343 approved these changes Sep 21, 2023

View reviewed changes

README.md Show resolved Hide resolved

classicsong added 0.2.1 and removed v0.2 labels Sep 29, 2023

isratnisa added 4 commits October 5, 2023 12:38

Update Dockerfile.local

6ab8eab

Update Dockerfile.local

09a11ed

Update env-setup.rst

ce6229e

Update Dockerfile.local

d672ca8

Update README.md

c413891

isratnisa changed the title ~~Update docker to Torch 2.0.1+CUDA11.7 to resolve multi-sampler issue~~ Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue Oct 5, 2023

thvasilo reviewed Oct 5, 2023

View reviewed changes

docs/source/install/env-setup.rst Show resolved Hide resolved

Update env-setup.rst

c64a1fd

zheng-da added the ready able to trigger the CI label Oct 6, 2023

jalencato reviewed Oct 25, 2023

View reviewed changes

docker/Dockerfile.local Outdated Show resolved Hide resolved

isratnisa added 3 commits October 27, 2023 12:00

Update Dockerfile.local

9262d5e

Update README.md

1aa76c4

Merge branch 'main' into update-torch

222c796

isratnisa merged commit 51b4256 into awslabs:main Oct 27, 2023
6 checks passed

isratnisa deleted the update-torch branch October 27, 2023 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue #377

Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue #377

isratnisa commented Aug 10, 2023 •

edited

Loading

classicsong commented Aug 10, 2023

classicsong commented Aug 11, 2023

isratnisa commented Aug 11, 2023

classicsong commented Aug 11, 2023

isratnisa commented Aug 13, 2023

zheng-da commented Aug 13, 2023

isratnisa commented Aug 14, 2023

classicsong commented Aug 15, 2023

zhjwy9343 commented Sep 21, 2023

zhjwy9343 left a comment

Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue #377

Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue #377

Conversation

isratnisa commented Aug 10, 2023 • edited Loading

Experiment setup:

Experiment 1:

Experiment 2:

classicsong commented Aug 10, 2023

classicsong commented Aug 11, 2023

isratnisa commented Aug 11, 2023

classicsong commented Aug 11, 2023

isratnisa commented Aug 13, 2023

zheng-da commented Aug 13, 2023

isratnisa commented Aug 14, 2023

classicsong commented Aug 15, 2023

zhjwy9343 commented Sep 21, 2023

zhjwy9343 left a comment

Choose a reason for hiding this comment

isratnisa commented Aug 10, 2023 •

edited

Loading