Ginex is a GNN training system for efficient training of a billion-scale dataset on a single machine by using SSD as a memory extension. Ginex accelerates the entire training procedure by provably optimal in-memory caching of feature vectors which reside on SSD without any negative implication on training quality.
Please refer to the full paper here.
Follow the instructions below to install the requirements and run a toy example using ogbn_papers100M dataset.
-
Disable
read_ahead
.sudo -s echo 0 > /sys/block/$block_device_name/queue/read_ahead_kb
-
Install necessary Linux packages.
sudo apt-get install -y build-essential
sudo apt-get install -y cgroup-tools
sudo apt-get install -y unzip
sudo apt-get install -y python3-pip
andpip3 install --upgrade pip
- Compatible NVIDIA CUDA driver and toolkit. Visit NVIDIA CUDA Installation Guide for Linux for details.
-
Install necessary Python modules.
-
PyTorch with version of >= 1.9.0. Visit here for details.
-
pip3 install tqdm
-
pip3 install ogb
-
PyG. Visit here for details.
-
Ninja
sudo wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip sudo unzip ninja-linux.zip -d /usr/local/bin/ sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force
-
-
Use cgroup to mimic the setting where the dataset size is much larger than the main memory size, as assumed in the paper, with ogbn_papers100M dataset. We recommend to limit the memory size to 8GB.
sudo -s cgcreate -g memory:8gb echo 8000000000 > /sys/fs/cgroup/memory/8gb/memory.limit_in_bytes
-
Make sure to allocate enough swap space. We recommend to allocate at least 4GB for swap space.
sudo fallocate -l 4G swap.img sudo chmod 600 swap.img sudo mkswap swap.img sudo swapon swap.img
- Clone our library
git clone https://github.com/SNU-ARC/Ginex.git
- Prepare dataset
python3 prepare_dataset.py
- Preprocess (Neighbor cache construction)
python3 create_neigh_cache.py --neigh-cache-size 6000000000
- Get
PYTHONPATH
python3 get_pythonpath.py
- Run baseline, i.e., PyG extended to support disk-based processing of graph dataset (denoted as PyG+ in the paper). Replace
PYTHONPATH=...
with the outcome of step 3.-W ignore
option is used to ignore warnings.sudo PYTHONPATH=/home/user/.local/lib/python3.8/site-packages cgexec -g memory:8gb python3 -W ignore run_baseline.py
- Run Ginex. Replace
PYTHONPATH=...
with the outcome of step 3.-W ignore
option is used to ignore warnings.sudo PYTHONPATH=/home/user/.local/lib/python3.8/site-packages cgexec -g memory:8gb python3 -W ignore run_ginex.py --neigh-cache-size 6000000000 --feature-cache-size 6000000000 --sb-size 1500
The following is the result of the toy example on our local server.
- CPU: Intel Xeon Gold 6244 CPU 8-core (16 logical cores with hyper-threading) @ 3.60GHz
- GPU: NVIDIA Tesla V100 16GB PCIe
- Memory: Samsung DDR4-2666 64GB (32GB X 2) (cgroup of 8GB is used)
- Storage: Samsung PM1725b 8TB PCIe Gen3 8-lane
- S/W: Ubuntu 18.04.5 & CUDA 11.4 & Python 3.6.9 & PyTorch 1.9
Per epoch training time: 216.1687 sec
Per epoch training time: 99.5562 sec
(Speedup of 2.2x)
Yeonhong Park (parkyh96@gmail.com)
Sunhong Min (sunhongmin@snu.ac.kr)
Please cite our paper if you find it useful for your work:
@inproceedings{park2022vldb,
author = {Yeonhong Park and Sunhong Min and Jae W. Lee},
title = {Ginex: SSD-enabled Billion-scale Graph Neural Network Training on a Single Machine via Provably Optimal In-memory Caching},
booktitle = {Proceedings of the VLDB Endowment},
volume = {15},
number = {11},
year = {2022}
}