Skip to content

Target based molecule generation using protein embeddings and SELFIES molecule representation

License

Notifications You must be signed in to change notification settings

HUBioDataLab/Prot2Mol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prot2Mol

Target based molecule generation using protein embeddings and SELFIES molecule representation.

   

Installation

git clone https://github.com/atabeyunlu/Prot2Mol.git

pip install -r requirements.yaml

   

How to run?

Prot2Mol model can be run following the below documentation top-down. It is necessary to prepare Protein-Compound data however you can process a single protein embedding to run the model.

Data Processing

   

Prepare Papyrus Protein-Compound Data

This script prepares Papyrus data by downloading and decompressing molecule and protein files from specified URLs. It then processes the data by filtering based on specified thresholds and parameters. The processed data is saved as a CSV file in the specified output directory.

Usage:

python papyrus_data.py [--pchembl_threshold P] [--prot_len L] [--human_only H]
Arguments:
    --pchembl_threshold (int): pchembl threshold for filtering compounds (default: None)
    --prot_len (int): maximum protein length for filtering proteins (default: None)
    --human_only (bool): flag to filter only human proteins (default: False)

Example:

python papyrus_data.py --pchembl_threshold 6 --prot_len 500 --human_only True

   

AlphaFold2 Embedding Generator for Protein Sequences

This script downloads, processes, and organizes AlphaFold2 (AF2) embeddings into a format suitable for further analysis or model training. It handles the downloading of zipped AF2 embedding files, unzipping them, padding protein structures, and saving the embeddings in .npz format.

python af2_embeddings.py [--max_len L]
Arguments:
    --max_len (int): Maximum protein length to pad the embeddings to (default: 500).

Example:

python af2_embeddings.py --max_len 500

   

ESM-2 Embedding Generator for Protein Sequences

This script generates embeddings for protein sequences using the ESM-2 model. It processes a dataset of protein sequences, filters them based on a specified maximum length, and saves the resulting embeddings along with the corresponding protein IDs.

python esm2_embeddings.py [--dataset PATH] [--prot_len L]
Arguments:
    --dataset (str): Path to the input dataset containing protein sequences (default: ../data/papyrus/prot_comp_set_pchembl_None_protlen_500_human_False.csv).
    --prot_len (int): Maximum length of the protein sequences to be considered for embedding (default: 500).

Example:

python esm2_embeddings.py --dataset ../data/my_dataset.csv --prot_len 500

   

ESM-3 Embedding Generator for Protein Sequences

This script generates embeddings for protein sequences using the ESM-3 model. It processes a dataset of protein sequences, applies padding, and filters sequences based on a specified maximum length. The resulting embeddings are then saved along with the corresponding protein IDs.

python esm3_embeddings.py [--dataset PATH] [--max_len L] [--huggingface_token TOKEN]
Arguments:
    --dataset (str): Path to the input dataset containing protein sequences (default: ../data/papyrus/prot_comp_set_pchembl_6_protlen_500.csv).
    --max_len (int): Maximum length of the protein sequences to be considered for embedding (default: 500).
    --huggingface_token (str): User's Hugging Face token for authentication (required).

Example:

python esm3_embeddings.py --dataset ../data/my_dataset.csv --max_len 500 --huggingface_token my_hf_token

   

ProtT5 Embedding Generator for Protein Sequences

This script generates protein embeddings using the ProtT5 model from the Rostlab. It processes a dataset containing protein sequences, encodes the sequences using the ProtT5 model, and saves the resulting embeddings in .npz format.

python prot_t5_embeddings.py [--dataset DATASET_PATH] [--prot_len PROTEIN_LENGTH]
Arguments:
    --dataset (str): Path to the input CSV file containing protein sequences (default: ../data/papyrus/prot_comp_set_pchembl_8_protlen_150_human_False.csv).
    --prot_len (int): Maximum length of the protein sequences to consider (default: 500).

Example:

python prot_t5_embeddings.py --dataset ../data/my_protein_data.csv --prot_len 200

   

Model & Training

   

Prot2Mol Pre-Training Script

This script is designed to train and evaluate a GPT-2 model with cross-attention for generating molecular structures based on protein embeddings. The script utilizes SELFIES strings, and the protein embeddings can be derived from various models like ProtT5, ESM, or AlphaFold2 embeddings.

python pretrain.py [--selfies_path SELFIES_PATH] [--prot_emb_model PROT_EMB_MODEL] [--prot_ID PROT_ID] [--learning_rate LEARNING_RATE] [--train_batch_size TRAIN_BATCH_SIZE] [--valid_batch_size VALID_BATCH_SIZE] [--epoch EPOCH] [--weight_decay WEIGHT_DECAY] [--n_layer N_LAYER] [--n_head N_HEAD]
Arguments:

Dataset Parameters:
    --selfies_path (str): Path to the CSV file containing SELFIES strings and other related data (default: ../data/papyrus/prot_comp_set_pchembl_8_protlen_150_human_False.csv).
    --prot_emb_model (str): Specifies the protein embedding model to use (choices: prot_t5, esm2, esm3, af2_single, af2_struct, af2_combined; default: prot_t5).
    --prot_ID (str): Protein ID for filtering the dataset (default: CHEMBL4282).

Model Parameters:
    --learning_rate (float): Learning rate for the optimizer (default: 1.0e-5).
    --train_batch_size (int): Batch size for training (default: 64).
    --valid_batch_size (int): Batch size for validation (default: 64).
    --epoch (int): Number of training epochs (default: 50).
    --weight_decay (float): Weight decay for the optimizer (default: 0.0005).
    --n_layer (int): Number of layers in the GPT-2 model (default: 1).
    --n_head (int): Number of attention heads in the GPT-2 model (default: 4).

Example:

python pretrain.py --selfies_path ../data/my_selfies_data.csv --prot_emb_model esm3 --prot_ID CHEMBL4296327 --learning_rate 2e-5 --train_batch_size 32 --epoch 30 --n_layer 4 --n_head 8

   

Prot2Mol Fine-Tuning Script

This script fine-tunes a pre-trained Prot2Mol model on a specific target protein embedding. The fine-tuning process is tailored to a specific target ID (e.g., a ChEMBL ID) and involves further training the model on a subset of data related to that target.

python finetune.py [--selfies_path SELFIES_PATH] [--target_id TARGET_ID] [--prot_emb_model PROT_EMB_MODEL] [--pretrained_model_path PRETRAINED_MODEL_PATH] [--learning_rate LEARNING_RATE] [--train_batch_size TRAIN_BATCH_SIZE] [--valid_batch_size VALID_BATCH_SIZE] [--epoch EPOCH] [--weight_decay WEIGHT_DECAY] [--n_layer N_LAYER]
Arguments:

Dataset Parameters:
    --selfies_path (str): Path to the CSV file containing SELFIES strings and other related data (default: ../data/fasta_to_selfies_500.csv).
    --target_id (str): The ChEMBL ID of the target protein for fine-tuning (default: CHEMBL4282).
    --prot_emb_model (str): Specifies the protein embedding model to use (choices: prot_t5, esm2, esm3, af2_single, af2_struct, af2_combined; default: prot_t5).

Model Parameters:
    --pretrained_model_path (str): Path to the pre-trained model checkpoint to be fine-tuned (default: ./saved_models/set_100_saved_model/checkpoint-31628).
    --learning_rate (float): Learning rate for the optimizer during fine-tuning (default: 1.0e-5).
    --train_batch_size (int): Batch size for fine-tuning (default: 64).
    --valid_batch_size (int): Batch size for validation during fine-tuning (default: 64).
    --epoch (int): Number of epochs for fine-tuning (default: 50).
    --weight_decay (float): Weight decay for the optimizer during fine-tuning (default: 0.0005).
    --n_layer (int): Number of layers in the GPT-2 model during fine-tuning (default: 4).

Example:

python finetune_script.py --selfies_path ../data/my_selfies_data.csv --target_id CHEMBL12345 --prot_emb_model esm3 --pretrained_model_path ./saved_models/my_pretrained_model --learning_rate 2e-5 --train_batch_size 32 --epoch 30 --n_layer 6

   

Molecule Generation

This script is designed to generate molecular structures based on a pretrained model and evaluate them against a reference dataset. It loads the necessary protein embeddings and molecular data, generates new molecules, and calculates evaluation metrics. The generated molecules and metrics are then saved to specified files.

Usage:

python produce_molecules.py [--model_file PATH] [--prot_emb_model PATH] [--generated_mol_file PATH] [--selfies_path PATH] [--attn_output BOOL] [--prot_id ID] [--num_samples N] [--bs N]
Arguments:
    --model_file (str): Path of the pretrained model file (default: ./finetuned_models/set_100_finetuned_model/checkpoint-3100).
    --prot_emb_model (str): Path of the pretrained protein embedding model (default: ./data/prot_embed/prot_t5/prot_comp_set_pchembl_None_protlen_None/embeddings).
    --generated_mol_file (str): Path of the output file where generated molecules will be saved (default: ./saved_mols/_kt_finetune_mols.csv).
    --selfies_path (str): Path of the input SELFIES dataset (default: ./data/papyrus/prot_comp_set_pchembl_None_protlen_500_human_False).
    --attn_output (bool): Flag to output attention weights during molecule generation (default: False).
    --prot_id (str): Target Protein ID for molecule generation (default: CHEMBL4282).
    --num_samples (int): Number of samples to generate (default: 10000).
    --bs (int): Batch size for molecule generation (default: 100).

Example:

python produce_molecules.py --model_file ./finetuned_models/set_100_finetuned_model/checkpoint-3100  --prot_emb_model ./data/prot_embed/prot_t5/prot_comp_set_pchembl_None_protlen_None/embeddings --generated_mol_file ./saved_mols/generated_molecules.csv  --selfies_path ./data/papyrus/prot_comp_set_pchembl_None_protlen_500_human_False --attn_output False  --prot_id CHEMBL4282  --num_samples 10000  --bs 100

Citation

If you use this work in your research, please cite:

Ünlü, A., & Çevrim, E., & Doğan, T. (2024). Prot2Mol: Target based molecule generation using protein embeddings and SELFIES molecule representation. GitHub. https://github.com/HUBioDataLab/Prot2Mol

About

Target based molecule generation using protein embeddings and SELFIES molecule representation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages