Skip to content

Commit

Permalink
Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue (#…
Browse files Browse the repository at this point in the history
…377)

Resolves issue #199

Updating the torch version from `torch==1.13` to `torch==2.1.0` in the
docker file. Torch versions later than `1.12` had a bug which did not
allow us to use `num_samplers` > 0. In Pytorch 2.1.0 release the bug is
resolved. We have verified the solution through the following
experiments.

#### Experiment setup:
Dataset: ogbn-mag (partitioned into 2)
DGL versions: '1.0.4+cu117' and '1.1.1+cu113'
Torch versions: '2.1.0+cu118'

### Experiment 1:
1 trainer and 4 samplers
```
python3 -u  /dgl/tools/launch.py         
     --workspace /graph-storm/python/graphstorm/run/gsgnn_lp         
     --num_trainers 1         
     --num_servers 1         
     --num_samplers 4         
     --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json         
     --ip_config /data/ip_list_p2.txt         
     --ssh_port 2222         
     --graph_format csc,coo         
     "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true"
```
Output:
```
Epoch 00000 | Batch 000 | Train Loss: 13.5191 | Time: 3.2363
Epoch 00000 | Batch 020 | Train Loss: 3.2547 | Time: 0.4499
Epoch 00000 | Batch 040 | Train Loss: 2.0744 | Time: 0.5477
Epoch 00000 | Batch 060 | Train Loss: 1.6599 | Time: 0.5524
Epoch 00000 | Batch 080 | Train Loss: 1.4543 | Time: 0.4597
Epoch 00000 | Batch 100 | Train Loss: 1.2397 | Time: 0.4665
Epoch 00000 | Batch 120 | Train Loss: 1.0915 | Time: 0.4823
Epoch 00000 | Batch 140 | Train Loss: 0.9683 | Time: 0.4576
Epoch 00000 | Batch 160 | Train Loss: 0.8798 | Time: 0.5382
Epoch 00000 | Batch 180 | Train Loss: 0.7762 | Time: 0.5681
Epoch 00000 | Batch 200 | Train Loss: 0.7021 | Time: 0.4492
Epoch 00000 | Batch 220 | Train Loss: 0.6619 | Time: 0.4450
Epoch 00000 | Batch 240 | Train Loss: 0.6001 | Time: 0.4437
Epoch 00000 | Batch 260 | Train Loss: 0.5591 | Time: 0.4540
Epoch 00000 | Batch 280 | Train Loss: 0.5115 | Time: 0.3577
Epoch 0 take 134.6200098991394
```

### Experiment 2: 
4 trainers and 4 samplers: 
```
python3 -u  /dgl/tools/launch.py         
     --workspace /graph-storm/python/graphstorm/run/gsgnn_lp         
     --num_trainers 4         
     --num_servers 1         
     --num_samplers 4         
     --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json         
     --ip_config /data/ip_list_p2.txt         
     --ssh_port 2222         
     --graph_format csc,coo         
     "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true"
```
Output:
```
Epoch 00000 | Batch 000 | Train Loss: 11.1130 | Time: 4.6957
Epoch 00000 | Batch 020 | Train Loss: 3.3098 | Time: 0.7897
Epoch 00000 | Batch 040 | Train Loss: 1.9996 | Time: 0.8633
Epoch 00000 | Batch 060 | Train Loss: 1.5202 | Time: 0.4229
Epoch 0 take 56.44491267204285
successfully save the model to /data/ogbn-map-lp/model/epoch-0
Time on save model 5.461951017379761
```



By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
  • Loading branch information
isratnisa authored Oct 27, 2023
1 parent 1230ac6 commit 51b4256
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 6 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,9 @@ python3 -m graphstorm.run.gs_link_prediction \
## Limitation
GraphStorm framework now supports using CPU or NVidia GPU for model training and inference. But it only works with PyTorch-gloo backend. It was only tested on AWS CPU instances or AWS GPU instances equipped with NVidia GPUs including P4, V100, A10 and A100.

Multiple samplers are not supported for PyTorch versions greater than 1.12. Please use `--num-samplers 0` when your PyTorch version is above 1.12. You can find more details [here](https://github.com/awslabs/graphstorm/issues/199).
Multiple samplers are supported in PyTorch versions <= 1.12 and >= 2.1.0. Please use `--num-samplers 0` for other PyTorch versions. More details [here](https://github.com/awslabs/graphstorm/issues/199).

To use multiple samplers on sagemaker please use PyTorch versions <= 1.12.

## License
This project is licensed under the Apache-2.0 License.
Expand Down
4 changes: 2 additions & 2 deletions docker/Dockerfile.local
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get install -y python3-pip git wget psmisc
RUN apt-get install -y cmake

# Install Pytorch
RUN pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
RUN pip3 install torch==2.1.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

# Install DGL
RUN pip3 install dgl==1.0.4+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
Expand Down Expand Up @@ -49,4 +49,4 @@ RUN cp ${SSHDIR}/id_rsa.pub ${SSHDIR}/authorized_keys

EXPOSE 2222
RUN mkdir /run/sshd
CMD ["/usr/sbin/sshd", "-D"]
CMD ["/usr/sbin/sshd", "-D"]
6 changes: 3 additions & 3 deletions docs/source/install/env-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,20 +29,20 @@ Users can use ``pip`` or ``pip3`` to install GraphStorm.
Install Dependencies
.....................
Users should install PyTorch v1.13.1 and DGL v1.0.4 that is the core dependency of GraphStorm using the following commands.
Users should install PyTorch v2.1.0 and DGL v1.0.4 that is the core dependency of GraphStorm using the following commands.

For Nvidia GPU environment:

.. code-block:: bash
pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install dgl==1.0.4+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
For CPU environment:

.. code-block:: bash
pip install torch==1.13.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu
pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install dgl==1.0.4 -f https://data.dgl.ai/wheels-internal/repo.html
Configure SSH No-password login
Expand Down

0 comments on commit 51b4256

Please sign in to comment.