Skip to content

Latest commit

 

History

History
130 lines (98 loc) · 3.97 KB

README.md

File metadata and controls

130 lines (98 loc) · 3.97 KB

LiteGEM: Solution to PCQM4M-LSC of KDD Cup 2021

  • Please refer to our techical report for the details of implementation and performance.

Installation requirements

ogb==1.3.0
rdkit>=2019.03.1
obabel>=3.1.0
torch>=1.7.0
paddlepaddle-gpu>=2.1.0
pgl>=2.1.4

Data preparation

Under the root directory, please run following command to downlaod the original pcqm4m dataset, DFT results for auxiliary tasks, and cross-validation split indexes.

mkdir dataset && cd dataset
wget http://ogb-data.stanford.edu/data/lsc/pcqm4m_kddcup2021.zip
unzip pcqm4m_kddcup2021.zip
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/PCQM_pretrain/sdf.tar.gz
mv sdf.tar.gz pcqm_pyscf_sdf.tar.gz
tar -xzvf pcqm_pyscf_sdf.tar.gz
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/PCQM_pretrain/cross_split.pkl
cd ..

Directory structure

|-- ogbg_lsc
    |-- README.md
    |-- src             # scripts
    |-- models          # model definition
    |-- utils
    |-- outputs         # model predictions (generated by training)
    |-- dataset         # the original dataset and customed splits 
    |-- checkpoints     # model checkpoints (generated by training)
    |-- ensemble        # submitted predictions and code for ensemble
    |-- logs            # train logs (generated by training)

Configuration

Model hyper-parameters and other arguments are all defined in ./src/config.yaml.

How to run

Our pipeline consists of 4 steps:

  1. Data preprocessing

    cd ./features
    python mol_tree.py          # takes about 30 min
    
  2. Model training

    Train the model with 2-fold cross validation:

    cd ./src
    . ./cross_run.sh 0 1        # training on the whole dataset will take about 10 days
                                # "0 1" defines the CUDA devices
    

    Or barely train the model with original validation set:

    cd ./src
    export CUDA_VISIBLE_DEVICES=0
    python main.py --config config.yaml
    
  3. Test inference (optional)

    There is no need to call the inference program separately since it is included in the training program. If is needed, please set the model saved path on the infer_from hyper-parameter in ./src/config.yaml after training, then run the following commands:

    cd ./src
    python test.py --config config.yaml --output_path ./test_result
    

    The test result will be saved in ./src/test_result.

  4. Ensemble

    Copy model predictions and do the ensemble:

    cd ../outputs
    rsync -av * ../ensemble/model_pred/new_run
    
    cd ../ensemble
    python ensemble.py
    

The whole training/ensemble pipeline is collectly defined in ./src/main.sh. The shortcut to start default training with 2-fold cross validation:

cd ./src
sh main.sh

Performance

Model Test MAE #Parameters Hardware
LiteGEM 0.1204 74M Nvidia Tesla P40 (24GB GPU)
GIN* 0.1678 3.8M GeForce RTX 2080 (11GB GPU)
GIN-virtual* 0.1487 6.7M GeForce RTX 2080 (11GB GPU)
GCN* 0.1838 2.0M GeForce RTX 2080 (11GB GPU)
GCN-virtual* 0.1579 4.9M GeForce RTX 2080 (11GB GPU)
MLP+Fingerprint* 0.2068 16.1M GeForce RTX 2080 (11GB GPU)

* Results copied from the baseline performance for PCQM4M-LSC.

Citation

@misc{fang2021chemrlgem,
    title={ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction}, 
    author={Xiaomin Fang and Lihang Liu and Jieqiong Lei and Donglong He and Shanzhuo Zhang and Jingbo Zhou and Fan Wang and Hua Wu and Haifeng Wang},
    year={2021},
    eprint={2106.06130},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}