This is the official code repo for the ICML 2022 paper -- GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks. A recorded video for the live talk at ICML 2022 is provided via this link. You are also welcome to read our poster.
Citing
If you find our repo or paper useful in your research, please consider adding the following citation:
@inproceedings{he2022gnnrank,
title={GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks},
author={He, Yixuan and Gan, Quan and Wipf, David and Reinert, Gesine D and Yan, Junchi and Cucuringu, Mihai},
booktitle={International Conference on Machine Learning},
pages={8581--8612},
year={2022},
organization={PMLR}
}
The project has been tested on the following environment specification:
- Ubuntu 18.04.6 LTS (Other x86_64 based Linux distributions should also be fine, such as Fedora 32)
- Nvidia Graphic Card (NVIDIA Tesla T4 with driver version 450.142.00) and CPU (Intel Core i7-10700 CPU @ 2.90GHz)
- Python 3.7 (and Python 3.6.12)
- CUDA 11.0 (and CUDA 9.2)
- Pytorch 1.10.1 (built against CUDA 11.0) and Pytorch 1.8.0 (build against CUDA 10.2)
- Other libraries and python packages (See below)
You should handle (1),(2) yourself. For (3), (4), (5) and (6), we provide a list of steps to install them.
We provide two examples of envionmental setup, one with CUDA 11.0 and GPU, the other with CPU.
Following steps assume you've done with (1) and (2).
-
Install conda. Both Miniconda and Anaconda are OK.
-
Create an environment and install python packages (GPU):
conda env create -f environment_GPU.yml
- Create an environment and install python packages (CPU):
conda env create -f environment_CPU.yml
The codebase is implemented in Python 3.6.12. package versions used for development are below.
networkx 2.6.3
tqdm 4.62.3
numpy 1.20.3
pandas 1.3.4
texttable 1.6.4
latextable 0.2.1
scipy 1.7.1
argparse 1.1.0
scikit-learn 1.0.1
stellargraph 1.2.1 (for link direction prediction: conda install -c stellargraph stellargraph)
torch 1.10.1
torch-scatter 2.0.9
pyg 2.0.3 (follow https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html)
sparse 0.13.0
When installation is done, you could check you enviroment via:
cd execution
bash setup_test.sh
- ./execution/ stores files that can be executed to generate outputs. For vast number of experiments, we use GNU parallel, which can be downloaded in command line and make it executable via:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 ./parallel
- ./joblog/ stores job logs from parallel. You might need to create it by
mkdir joblog
- ./Output/ stores raw outputs (ignored by Git) from parallel. You might need to create it by
mkdir Output
-
./data/ stores processed data sets.
-
./src/ stores files to train various models, utils and metrics.
-
./result_arrays/ stores results for different data sets. Each data set has a separate subfolder.
-
./result_anlysis/ stores notebooks for generating result plots or tables.
-
./logs/ stores trained models and logs, as well as predicted clusters (optional). When you are in debug mode (see below), your logs will be stored in ./debug_logs/ folder.
GNNRank provides various command line arguments, which can be viewed in the ./src/param_parser.py. Some examples are:
--epochs INT Number of GNNRank (maximum) training epochs. Default is 1000.
--early_stopping INT Number of GNNRank early stopping epochs. Default is 200.
--num_trials INT Number of trials to generate results. Default is 10.
--lr FLOAT Initial learning rate. Default is 0.01.
--weight_decay FLOAT Weight decay (L2 loss on parameters). Default is 5^-4.
--dropout FLOAT Dropout rate (1 - keep probability). Default is 0.5.
--hidden INT Number of embedding dimension divided by 2. Default is 32.
--seed INT Random seed. Default is 31.
--no-cuda BOOL Disables CUDA training. Default is False.
--debug, -D BOOL Debug with minimal training setting, not to get results. Default is False.
-AllTrain, -All BOOL Whether to use all data to do gradient descent. Default is False.
--SavePred, -SP BOOL Whether to save predicted results. Default is False.
--dataset STR Data set to consider. Default is 'ERO/'.
--all_methods LST Methods to use to generate results. Default is ['btl','DIGRAC'].
First, get into the ./execution/ folder:
cd execution
To reproduce basketball results executed on CUDA 1.
bash basketball1.sh
To reproduce results on synthetic data.
bash 0ERO.sh
Other execution files are similar to run.
Note that if you are operating on CPU, you may delete the commands ``CUDA_VISIBLE_DEVICES=xx". You can also set you own number of parallel jobs, not necessarily following the j numbers in the .sh files.
You can also use CPU for training if you add ``--no-duca", or GPU if you delete this.
First, get into the ./src/ folder:
cd src
Then, below are various options to try:
Creating a GNNRank model for animal data using DIGRAC as GNN, also produce results on syncRank.
python ./train.py --all_methods DIGRAC syncRank --dataset animal
Creating a GNNRank model for ERO data using both DIGRAC and ib as GNN with 350 nodes, using 0.05 as learning rate.
python ./train.py --N 350 --all_methods DIGRAC ib --lr 0.05
Creating a GNNRank model for basketball data in season 2010 using all baselines excluding mvr, also save predicted results.
python ./train.py --dataset basketball --season 2010 -SP --all_methods baselines_shorter
Creating a model for HeadToHead data set with specific number of trials, hidden units and use CPU.
python ./train.py --dataset HeadToHead --no-cuda --num_trials 5 --hidden 8
- For certain applications such as financial data sets, the original adjacency matrices might be skew-symmetric with negative edge weights. For our models here, however, we need to preprocess the data so that we only keep the positive edge weights, as our current pipeline, including the loss functions, are restricted to directed unsigned networks as inputs.