LiteGEM: Solution to PCQM4M-LSC of KDD Cup 2021

Please refer to our techical report for the details of implementation and performance.

Installation requirements

ogb==1.3.0
rdkit>=2019.03.1
obabel>=3.1.0
torch>=1.7.0
paddlepaddle-gpu>=2.1.0
pgl>=2.1.4

Data preparation

Under the root directory, please run following command to downlaod the original pcqm4m dataset, DFT results for auxiliary tasks, and cross-validation split indexes.

mkdir dataset && cd dataset
wget http://ogb-data.stanford.edu/data/lsc/pcqm4m_kddcup2021.zip
unzip pcqm4m_kddcup2021.zip
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/PCQM_pretrain/sdf.tar.gz
mv sdf.tar.gz pcqm_pyscf_sdf.tar.gz
tar -xzvf pcqm_pyscf_sdf.tar.gz
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/PCQM_pretrain/cross_split.pkl
cd ..

Directory structure

|-- ogbg_lsc
    |-- README.md
    |-- src             # scripts
    |-- models          # model definition
    |-- utils
    |-- outputs         # model predictions (generated by training)
    |-- dataset         # the original dataset and customed splits 
    |-- checkpoints     # model checkpoints (generated by training)
    |-- ensemble        # submitted predictions and code for ensemble
    |-- logs            # train logs (generated by training)

Configuration

Model hyper-parameters and other arguments are all defined in ./src/config.yaml.

How to run

Our pipeline consists of 4 steps:

Data preprocessing

cd ./features
python mol_tree.py          # takes about 30 min

Model training

Train the model with 2-fold cross validation:

cd ./src
. ./cross_run.sh 0 1        # training on the whole dataset will take about 10 days
                            # "0 1" defines the CUDA devices

Or barely train the model with original validation set:

cd ./src
export CUDA_VISIBLE_DEVICES=0
python main.py --config config.yaml

Test inference (optional)

There is no need to call the inference program separately since it is included in the training program. If is needed, please set the model saved path on the infer_from hyper-parameter in ./src/config.yaml after training, then run the following commands:
```
cd ./src
python test.py --config config.yaml --output_path ./test_result
```
The test result will be saved in ./src/test_result.

Ensemble

Copy model predictions and do the ensemble:

cd ../outputs
rsync -av * ../ensemble/model_pred/new_run

cd ../ensemble
python ensemble.py

The whole training/ensemble pipeline is collectly defined in ./src/main.sh. The shortcut to start default training with 2-fold cross validation:

cd ./src
sh main.sh

Performance

Model	Test MAE	#Parameters	Hardware
LiteGEM	0.1204	74M	Nvidia Tesla P40 (24GB GPU)
GIN*	0.1678	3.8M	GeForce RTX 2080 (11GB GPU)
GIN-virtual*	0.1487	6.7M	GeForce RTX 2080 (11GB GPU)
GCN*	0.1838	2.0M	GeForce RTX 2080 (11GB GPU)
GCN-virtual*	0.1579	4.9M	GeForce RTX 2080 (11GB GPU)
MLP+Fingerprint*	0.2068	16.1M	GeForce RTX 2080 (11GB GPU)

* Results copied from the baseline performance for PCQM4M-LSC.

Citation

@misc{fang2021chemrlgem,
    title={ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction}, 
    author={Xiaomin Fang and Lihang Liu and Jieqiong Lei and Donglong He and Shanzhuo Zhang and Jingbo Zhou and Fan Wang and Hua Wu and Haifeng Wang},
    year={2021},
    eprint={2106.06130},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LiteGEM: Solution to PCQM4M-LSC of KDD Cup 2021

Installation requirements

Data preparation

Directory structure

Configuration

How to run

Performance

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

LiteGEM: Solution to PCQM4M-LSC of KDD Cup 2021

Installation requirements

Data preparation

Directory structure

Configuration

How to run

Performance

Citation