Antibody-SGM is a score-based generative modeling for de novo antibody heavy chain design. This repository contains the codebase for [Antibody-SGM, A score-based generative model for de novo antibody heavy chain design]
Run the following commands to install AB-SGM and the necessary dependencies; installation should take less than 15 minutes. We recommend using the conda environment supplied in this repository.
- Clone repository ` git clone https://github.com/xxiexuezhi/ABSGM.git ' or download the zip files and extract all.
- Install conda environment
create -f ab_env.yaml
- Activate conda environment
conda activate ab_env
To run AB-SGM, download and extract the model parameters,Saved weights
The code is tested on Python 3.8.17 and needs the Pytorch GPU version for training and sampling. We are using a GPU for model training, 6D coordinate sampling, and CPU batch processing for Rosetta. More specifically, we used a single NVIDIA V100/A100 for training and inference and at least 2 core CPU/8GB RAM per Rosetta job. Please refer to the shell script inside GitHub for detailed configurations.
All related codes are in the CDR_inpainting_conditional_generations directory. Structures are generated by sampling 6D coordinates from the model and running PyRosetta. The encoded examples (pdb id: 6nmv, 6hga, 1i9r) are uploaded. They can be found from:
The encoded examples (pdb id: 6nmv, 6hga, 1i9r) are uploaded. you could find from
- H1 inpaint encoded examples. this link
- H2 inpaint encoded examples.this link
- H3 inpaint encoded examples. this link
To encode the antigen-antibody complex structures, please refer to the readme inside the CDR_inpainting_conditional_generations/encoding/.
Please use python sampling_6d.py to generate 6d coordinates and sequences. We used the shell scripts for generations (please refer to h3_6d_sample.sh for more details). For instance,
python sampling_6d.py ./configs/inpainting_ch6.yml ../saved_weights/h3_inpaint.pth --pkl proteindataset_example_singlecdr_inpaint_h3_6nmv_6hga_1i9r.pkl --chain A --index 1 --tag singlecdr_inpaint_h3_Jun_2024_fixed_padding
The descriptions of each parameter are as below:
-
./configs/inpainting_ch6.yml are the config files containing hyperparameters like batch size and data dimensions.
-
../saved_weights/h3_inpaint.pth This is to load the saved weight.
-
--pkl proteindataset_example_singlecdr_inpaint_h3_6nmv_6hga_1i9r.pkl is the pickle file containing all encoded data with H3 regions indicated to be masked out.
-
--index 1 refers to the index number inside the proteindataset_example_singlecdr_inpaint_h3_6nmv_6hga_1i9r.pkl to be generated. This index number would match the generated file number; for example, the generated file is named samples_index.pkl.
-
--tag singlecdr_inpaint_h3_Jun_2024_fixed_padding is the generated folder name. The code would create this folder under the current directory and store all the generated data.
6D coordinate sampling should ~ take 1 minute per sample on a normal GPU. Rosetta minimization should take about 3 hours per iteration, depending on the selected H1, H2, or H3 region size for design.
Please use python 'convert_6d_seq_to_pdb.py' to convert the 6d coordinates and sequence pairs into the pdb using PyRosetta. Please refer to h3_shell_job_6d_to_pdb.sh for more details. For instance,
python convert_6d_seq_to_pdb.py singlecdr_inpaint_h3_Jun_2024_fixed_padding/samples_${SLURM_ARRAY_TASK_ID}.pkl single_hv_example_pdb/${files[${SLURM_ARRAY_TASK_ID}]} ${SLURM_ARRAY_TASK_ID} 3
- samples_${SLURM_ARRAY_TASK_ID}.pkl is the 6d data generated with the previous shell script. single_hv_example_pdb/${files[${SLURM_ARRAY_TASK_ID}]} is the pdb files read in by PyRosetta to do inpainting. This file is needed to avoid doing any superposing.
- ${SLURM_ARRAY_TASK_ID} is just the index file number for the pkl file.
- "3". The last number (1, 2, or 3) indicates the inpainting regions (h1, h2, or h3).
- The PyRosetta protocol saves all iterations and intermediate structures to subdirectories in outPath. The default name is test_single_cdr_inpaint_generations_single_h[123]. The final minimized structure can be found under the outPath with the name b_n1_n2.pdb. n1 matching ${SLURM_ARRAY_TASK_ID}, the index inside the pickle files, while n2 is just the index for the generated files
Raw antigen-antibody complex dataset for CDR conditional generations can be downloaded here
The 'sabdab_downloader.py' is also available for download.
Antibody-SGM (to be updated)