A GPT-2 based protein language model for designing proteins with specific properties. The models can be found on Hugging Face.
- inference.py: This script runs inferences using a chosen model.
- data: Contains the necessary data for model training and evaluation.
- inference.py: This script is used for running inference on the trained models to generate peptide sequences.
-
Clone this repository.
-
Install the required dependencies (
pip install -r requirements.txt
) -
To generate sequences and run the inference pipeline, use
python inference.py --model_path=path_to_model --num_return_sequences=num --max_length=max_len --starts_with=starting_sequence --output_dir=output_directory --pred_model_path=path_to_prediction_model --seed=random_seed
- model_path: Path of the model to run generation from. You can choose any of the four models from the models folder.
- num_return_sequences: Number of sequences to generate (default is 1000).
- max_length: Maximum length of generated sequences (default is 50).
- starts_with: Starting amino acids for generation (default is an empty string).
- output_dir: Directory for storing all output files.
- pred_model_path: You need to clone PeptideBERT and give its path to run the predictions.
- seed: Random seed for reproducibility (default is 42).
-
Hugging Face's run_clm.py script can be used to fine-tune the model on a custom dataset.