PeptideGPT

A GPT-2 based protein language model for designing proteins with specific properties. The models can be found on Hugging Face.

Important Files

inference.py: This script is used for running inference on the trained models to generate peptide sequences.

Clone this repository.
Install the required dependencies (pip install -r requirements.txt)
To generate sequences and run the inference pipeline, use python inference.py --model_path=path_to_model --num_return_sequences=num --max_length=max_len --starts_with=starting_sequence --output_dir=output_directory --pred_model_path=path_to_prediction_model --seed=random_seed
- model_path: Path of the model to run generation from. You can choose any of the four models from the models folder.
- num_return_sequences: Number of sequences to generate (default is 1000).
- max_length: Maximum length of generated sequences (default is 50).
- starts_with: Starting amino acids for generation (default is an empty string).
- output_dir: Directory for storing all output files.
- pred_model_path: You need to clone PeptideBERT and give its path to run the predictions.
- seed: Random seed for reproducibility (default is 42).
Hugging Face's run_clm.py script can be used to fine-tune the model on a custom dataset.