Skip to content

This is an authors' implementation of the NIPS 2022 dataset and Benchmark Track Paper "A Comprehensive Study on Large Scale Graph Training: Benchmarking and Rethinking" in PyTorch.

License

Notifications You must be signed in to change notification settings

VITA-Group/Large_Scale_GCN_Benchmarking

Repository files navigation

Benchmark ScalableGraphLearning

PWC PWC

This is an authors' implementation of "A Comprehensive Study on Large Scale Graph Training: Benchmarking and Rethinking" in Pytorch.

Authors: Keyu Duan, Zirui Liu, Wenqing Zheng, Peihao Wang, Kaixiong Zhou, Tianlong Chen, Zhangyang Wang, Xia Hu.

Introduction

Bag of approaches to train large-scale graphs, including methods based upon sub-graph sampling, precomputing, and label propagation.

Requirements

We recommend using anaconda to manage the python environment. To create the environment for our benchmark, please follow the instruction as follows.

conda create -n $your_env_name
conda activate $your_env_name

install pytorch following the instruction on pytorch installation

conda install pytorch torchvision torchaudio cudatoolkit -c pytorch

intall pytorch-geometric following the instruction on pyg installation

conda install pyg -c pyg -c conda-forge

install the other dependencies

pip install ogb # package for ogb datasets
pip install texttable # show the running hyperparameters
pip install h5py # for Label Propagation
cd GraphSampling && pip install -v -e . # install our implemented sampler

Our Installation Notes for torch-geometric

What env configs that we tried that have succeeded: Mac/Linux + cuda driver 11.2 + Torch with cuda 11.1 + torch_geometric/torch sparse/etc with cuda 11.1.

What env configs that we tried by did not work: Linux + Cuda 11.1/11.0/10.2 + whatever version of Torch

In the above case when it did work, we adopted the following installation commands, and it automatically downloaded built wheels, and the installation completes within seconds. Installation codes that we adopted on Linux cuda 11.2 that did work:

  pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
  pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html
  pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html
  pip install torch-geometric

Til now, you should be able to play with all of our implemented models except Label Propagation. To run LP, please follow our installation notes.

Installation guides for Julia (only required for certain modes of Label propagation, inherited from C&S )

First install Julia and PyJulia, following below instructions or instructions in https://pyjulia.readthedocs.io/en/latest/installation.html#install-julia

Installation guide for PyJulia on Linux:

Download Julia from official website, extract to whatever directory on your machine, there will be '/bin' at in the extracted folder.

export PATH=$PATH:/path-to-yout-extracted-julia/bin

After this step, type "julia", then you should be able to see Julia LOGO.

python3 -m pip install --user julia

use python to install

>>> import julia
>>> julia.install()

activate julia and install requirements. To activate julia, until you see julia> , then type the following lines to install required packages in julia console:

import Pkg; Pkg.add("LinearMaps")
import Pkg; Pkg.add("Arpack")
import Pkg; Pkg.add("MAT")

Play with our implemented models

To train a scalable graph training model, simply run:

python main.py --cuda_num=0  --type_model=$type_model --dataset=$dataset
# type_model in ['GraphSAGE', 'FastGCN', 'LADIES', 'ClusterGCN', 'GraphSAINT', 'SGC', 'SIGN', 'SIGN_MLP', 'LP_Adj', 'SAGN', 'GAMLP']
# dataset in ['Flickr', 'Reddit', 'Products', 'Yelp', 'AmazonProducts']

To test the throughput and memory usage for a certain model on a dataset, simply add --debug_mem_speed

python main.py --cuda_num=0  --type_model=$type_model --dataset=$dataset --debug_mem_speed

To perform the same greedy hyperparemeter search as described in our paper, please run

python run_HP.py $cuda_num $type_model $dataset

For detailed configuration, please refer to run_HP.py.

Reproduce results of EnGCN

Updates: As introduced in issue label leakage, out reported results of EnGCN is not correct, we retested it on the four datasets and update the code in this PR. one is required to use the updated code for correct reproduction. Accordingly, we update the arxiv as well.

EnGCN Flickr Reddit ogbn-arxiv ogbn-products
Test Accuracy 56.2094 ± 0.0063 96.6573 ± 0.0673 71.5892 ± 0.9943 75.7893 ± 0.0828
Validation Accuracy 55.7234 ± 0.0793 96.7060 ± 0.0326 73.1803 ± 0.0453 90.0440 ± 0.0943

To reproduce the results, simply run

# dataset = [Flickr, Reddit, ogbn-arxiv, ogbn-products]
bash scripts/$dataset/EnGCN.sh

Some tricks for reducing the memory footprint

  1. When using PyG, as illustrated in the official post, it is recommended to use the transposed sparse matrix instead of the edge_index, which can significantly reduce both the memory footprint and the computation overhead. PyG provides a function called ToSparseTensor to convert the edge index into the transposed sparse matrix.

  2. PyG can be used with the mixed precision training or NVIDIA Apex to significantly reduce the memory footprint. Note that the SPMM operater officially support half precision since the end of August. You might need to upgrade the torch_sparse package to utilize this new feature.

Citation

If you find this repo useful, please star the repo and cite:

@inproceedings{duancomprehensive,
  title={A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking},
  author={Duan, Keyu and Liu, Zirui and Wang, Peihao and Zheng, Wenqing and Zhou, Kaixiong and Chen, Tianlong and Hu, Xia and Wang, Zhangyang},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year=2022,
}

About

This is an authors' implementation of the NIPS 2022 dataset and Benchmark Track Paper "A Comprehensive Study on Large Scale Graph Training: Benchmarking and Rethinking" in PyTorch.

Topics

Resources

License

Stars

Watchers

Forks

Languages