CPDiffusion-SS is composed of three primary components: a sequence encoder, a latent diffusion generator, and an autoregressive decoder.
The sequence encoder embeds AA sequences into a latent space of secondary structure-level (SS-level) representations, while the decoder maps the generated SS-level latent representation back to the AA space. The central module is a latent graph diffusion model that generates diverse SS-level hidden representations within the established latent space.
- [2024.06.29] our paper was presented ICMLw 2024
- [2024.10.14] our paper was accepted as a regular paper at BIBM 2024 (we'll update our results with baselines RoseTTAFold and ESM3 soon)
we compare CPDiffusion-SS with both sequence and structure-based baseline methods on 10 evaluation metrics concerning the diversity, novelty, and consistency of the generated sequences. CPDiffusion0SS outperforms baseline methods on 9 out of the 10 metrics, except for the TM_new.
Predicted 3D structures and composition of secondary structures on three cases from the test dataset. Here we use red, yellow, and blue colors to represent helices (H), sheets (E), and coils (C), respectively.
Please make sure you have installed Anaconda3 or Miniconda3.
- Manually download the ESM2 model from https://huggingface.co/facebook/esm2_t33_650M_UR50D.
- Save the downloaded model file to the directory CPDiffusion-SS/models/esm2_t33_650_UR50D/.
- install dssp-2.2.1
conda install -c ostrokach dssp
- install biopython, torch_geometric
pip install biopython
pip install torch_geometric==2.4.0
- install other packages required
you will need a dataset in pdb format
split the dataset and put the pdb files to "data/data_split_name/pdb"
python dataset_pipeline.py --dataset CATH43_S40_SPLIT_TRAIN_VAL
python train_encoder_decoder.py --batch_size 10 --wandb --dataset AFDB --val_num 360000 --encoder_type AttentionPooling --patience 4 --wandb_run_name decoder_ESM2_AFDB_AttentionPooling_train --model_name decoder_ESM2_AFDB_AttentionPooling.pt
args:
- --decoder_ckpt (Optional): Train from Encoder-Decoder checkpoint.
- --model_name: Encoder-Decoder checkpoint will be saved as results/decoder/ckpt/date/model_name
python train_diff.py --wandb --wandb_run_name diffusion_CATH43S40_SPLIT --dataset CATH43_S40_SPLIT_TRAIN_VAL --decoder_ckpt './results/decoder/ckpt/20240227/decoder_ESM2_AFDB_AttentionPooling.pt' --diff_batch_size 100
args:
- --wandb (Optional): delete this parameter to turn off wandb(https://wandb.ai/)
- --wandb_run_name: run name in wandb & diffusion model saved as results/diffusion/weight/date/wandb_run_name.pt
- --decoder_ckpt: check point of Encoder-Decoder
Metric Name | type | Detailed information |
---|---|---|
TM_new | Diversity | average pairwise TM-scores of all generated sequences |
RMSD | Diversity | average pairwise RMSD of all generated sequences |
Seq. ID | Diversity | average pairwise sequence identities of all generated sequences |
TM_wt | Novelty | average TM-score between the generated protein and the most similar protein in training set |
ID | Consistency | average identities in secondary structure sequences between top10 generated proteins and the target |
ID_max | Consistency | highest identities in secondary structure sequences between generated proteins and the target |
MSE | Consistency | average MSE in secondary structure compositions between generated proteins and the target |
pip install fair-esm
see README.md in "./baselines" for more details
-
download source code and checkpoint files from official repositories
ProstT5: https://github.com/mheinzinger/ProstT5
ProteinMPNN: https://github.com/dauparas/ProteinMPNN
-
unzip source code to directory "baselines/"
conda install -c conda-forge -c bioconda mmseqs2 -y
conda install -c conda-forge -c bioconda foldseek -y
conda install -c schrodinger tmalign -y
conda install -c conda-forge pymol-open-source -y
Download checkpoints from huggingface repository
huggingface-cli download --resume-download riacd/CPDiffusion-SS
Move decoder model weights to directory "results/decoder/ckpt/"
Move diffusion model weights to directory "results/diffusion/weight/"
python metrics.py
Please cite our work if you have used our code or data.
@article{hu2024cpdiffusion-ss,
title={Secondary Structure-Guided Novel Protein Sequence Generation with Latent Graph Diffusion},
author={Yutong Hu, Yang Tan, Andi Han, Lirong Zheng, Liang Hong, Bingxin Zhou},
journal={arXiv preprint arXiv:2407.07443},
year={2024}
}