This repo is the official code for KDD-24 "Can Modifying Data Address Graph Domain Adaptation?".
Examination reveals the limitations inherent to these model-centric methods, while a data-centric method that is allowed to modify the source graph provably demonstrates considerable potential.
- By revisiting the theoretical generalization bound for UGDA, we identify two data-centric principles for UGDA: alignment principle and rescaling principle.
- Guided by these principles, we propose novel approach GraphAlign, that generates a small yet transferable graph. By exclusively training a GNN on this new graph with classic Empirical Risk Minimization (ERM), GraphAlign attains exceptional performance on the target graph.
For more technical details, kindly refer to the following links (to be continued)
1. File Structure [Back to Top]
.
├── data
│ ├── new
│ │ ├── acm
│ │ │ ├── acmv9.mat
│ │ ├── citation
│ │ │ └── citationv1.mat
│ │ └── dblp
│ │ ├── dblpv7.mat
│ └── old
│ ├── acm
│ │ ├── processed
│ │ │ ├── data.pt
│ │ │ ├── pre_filter.pt
│ │ │ └── pre_transform.pt
│ │ └── raw
│ │ ├── acm_docs.txt
│ │ ├── acm_edgelist.txt
│ │ └── acm_labels.txt
│ └── dblp
│ ├── processed
│ │ ├── data.pt
│ │ ├── pre_filter.pt
│ │ └── pre_transform.pt
│ └── raw
│ ├── dblp_docs.txt
│ ├── dblp_edgelist.txt
│ └── dblp_labels.txt
├── dual_gnn
│ ├── cached_gcn_conv.py
│ ├── dataset
│ │ ├── DomainDataNew.py
│ │ ├── DomainData.py
│ │ ├── __init__.py
│ │ ├── pre_cora.py
│ ├── __init__.py
│ ├── main.py
│ ├── models
│ │ ├── augmentation.py
│ │ ├── basic_gnn.py
│ │ └── SAGEEncoder.py
│ ├── ppmi_conv.py
├── data.py
├── generator.py
├── models
│ ├── gat.py
│ ├── gcn.py
│ ├── mygatconv.py
│ ├── mygraphsage.py
│ ├── parametrized_adj.py
│ ├── sgc_multi.py
│ └── sgc.py
├── README.md
├── requirements.txt
├── train.py
└── utils.py
Below, we will specifically explain the meaning of important file folders to help the user better understand the file structure.
data.zip
: need to unzip, contains the data of "ACMv9 (A), DBLPv7 (D), Citationv1 (C)".
dual_gnn
: contains the code for loading data
models
: contains classic GNN models used in GraphAlign.
utils
: contains the class definition for GraphAlign.
2. Environment dependencies [Back to Top]
The script has been tested running under Python 3.9, with the following packages installed (along with their dependencies):
torch==1.7.0
torch_geometric==1.6.3
scipy==1.6.2
numpy==1.19.2
ogb==1.3.0
tqdm==4.59.0
torch_sparse==0.6.9
deeprobust==0.2.4
scikit_learn==1.0.2
Python module dependencies are listed in requirements.txt, which can be easily installed with pip:
pip install -r requirements.txt
3. Usage: How to run the code [Back to Top]
GraphAlign paradigm consists of two stages: (1) Generate new source graph under two principles (2) Evaluate performance with ERM.
To conduct GraphAlign, you can execute train.py
as follows:
python train.py \
--source <source dataset> \
--target <target dataset> \
--epoch <epoch for GraphAlign> \
--alpha <coefficient for GraphAlign> \
--gpu <gpu id>
For more detail, the help information of the main script train.py
can be obtained by executing the following command.
python train.py -h
optional arguments:
-h, --help show this help message and exit
--gpu_id GPU_ID gpu id
--source SOURCE
--target TARGET
--savefile SAVEFILE where to save the newly generated source graph
--method METHOD computation method for alignment principle (default: mmd-un)
--init_method INIT_METHOD initialization method
--surrogate SURROGATE whether to use surrogate model for alignment principle (default: True)
--epochs EPOCHS epochs number (default: 500)
--nlayers NLAYERS number of layers for GNN (default:2)
--hidden HIDDEN (default: 256)
--lr_adj LR_ADJ (default: 0.01)
--lr_feat LR_FEAT (default: 0.01)
--lr_model LR_MODEL (default: 0.01)
--normalize_features NORMALIZE_FEATURES
--reduction_rate REDUCTION_RATE (default: 0.01)
--seed SEED Random seed.
--alpha ALPHA coefficient for regularization term (default: 30).
--beta BETA coefficient for alignment term (default: 30).
It is worth noting that the newly generated source graph is saved at the location specified by '--savefile'. Demo:
Using DBLPv7 as source graph and Citationv1 (C) as target graph.
python train.py --source dblp --target citation --epoch 500 --dis_metric ours --alpha 30 --method mmd-un --gpu_id 0
After obtaining the newly generated graph, we want to evaluate it by simply conducting ERM (Empirical Risk Minimization) training. Here, we temporarily combine the results with train.py
so that users can directly obtain the performance after training. In the future, we will provide more diverse versions of evaluation.
here's an example of a test.sh script that can help with evaluation:
Demo: Here is the demo instruction for evaluation.
nohup bash test.sh 0 > result.out 2>&1 &
cat result.out |grep "Performance"
If you have any questions about the code or the paper, feel free to contact me. Email: renh2@zju.edu.cn
If you find this work helpful, please cite (to be continued)
Part of this code is inspired by Jin et al.'s GCond.