Skip to content
forked from lezhang7/MSAGen

[NeurIPS 2024] MSA generative pretraining for advancing protein structure prediction

Notifications You must be signed in to change notification settings

BEAM-Labs/MSAGen

 
 

Repository files navigation

MSA-Augmentor codebase

method (1)

codebase for paper Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation arxiv

Pretrain

All the commands are designed for slurm cluster, we use huggingface trainer to pretrain the model, more details could be find here

  1. Construct local binary dataset ( load training data from cluster is too slow, so it's better to fisrt construct all your dataset to .bin file as shown in datasets )

    python utils.py \
       --output_dir ./datasets/ \
       --random_src --src_seq_per_msa_l 5\
       --src_seq_per_msa_u 10 \
       --total_seq_per_msa 25 \
       --local_file_path  path_to_pretrained_dataset 
    
  2. install dependency libraries pip install -r requirements.txt

  3. bash run.sh

Inference

  1. download checkpoints
  2. run inference by bash scripts/inference.sh

Note: all inference code is in inference.py

Evaluation

DATASET MSA STRUCTURE
CASP15 https://zenodo.org/record/8126538 google drive

Alphafold2 Prediction

  1. Please refer to Alphafold2 GitHub to learn more about set up af2.

  2. We provide scripts to use alphafold2 to launch protein structure prediction by bash scripts/run_af2, one need to modify msa directory

LDDT

  1. follow this document for lddt evaluation tool download https://www.openstructure.org/
  2. follow this document for https://www.openstructure.org/docs/2.4/mol/alg/lddt/ usage

Ensemble

Directly run following to get .json file of final results.

python ensemble.py --predicted_pdb_root_dir ./af2/casp15/orphan/A1T3R1.5/

📎 Citation

@misc{zhang2023enhancing,
      title={Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation}, 
      author={Le Zhang and Jiayang Chen and Tao Shen and Yu Li and Siqi Sun},
      year={2023},
      eprint={2306.01824},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM}
}

📧 Contact

please let us know if you have further questions or comments, reach out to [le.zhang@mila.quebec](

About

[NeurIPS 2024] MSA generative pretraining for advancing protein structure prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.1%
  • Shell 4.5%
  • Jupyter Notebook 3.4%