This repository contains scripts and data to repeat the analyses in Blaabjerg et al.: "Rapid protein stability prediction using deep learning representations".
Overview of files:
src/run_pipeline.py
- Main script for repeating the analyses in paper.src/rasp_model.py
- Classes for models and data.src/helpers.py
- Various helper functions.src/visualization.py
- Functions for plotting results.src/pdb_parser_scripts/
- Scripts for parsing PDBs.
Tested on Linux using Miniconda with package versions specified below.
-
Clone this repository.
-
Install and activate conda environment with requirements:
conda create --name rasp-model python=3.6
conda activate rasp-model
conda install pyyaml=5.3.1 pandas=1.1.4 scipy=1.5.3 numpy=1.17.3 scikit-learn=0.24.0 mpl-scatter-density=0.7 pdbfixer=1.5 pytorch=1.2.0 cudatoolkit=10.0 biopython=1.72 openmm=7.3.1 matplotlib=3.1.1 seaborn=0.11.2 ptitprince=0.2.5 dssp=3.0.0 vaex=4.5.0 -c salilab -c omnia -c conda-forge -c anaconda -c defaults
-
Install reduce in the right directory. This program is used by the parser to add missing hydrogens to the proteins.
cd src/pdb_parser_scripts
git clone https://github.com/rlabduke/reduce.git
cd reduce/
make
;make install
# This might give an error but provides the reduce executable in this directory. -
Download the data file
rasp_preds_exp_strucs_gnomad_clinvar.csv
from https://sid.erda.dk/sharelink/fFPJWflLeE and add it to the directorydata/test/Human/
. -
Download the Vaex data file
rasp_preds_alphafold_UP000005640_9606_HUMAN_v2_vaex_dataframe.zip
from https://sid.erda.dk/sharelink/fFPJWflLeE and add it to the directorydata/test/Human/
. Unpack the file using the command:gunzip rasp_preds_alphafold_UP000005640_9606_HUMAN_v2_vaex_dataframe.zip
.
Execute the pipeline using src/run_pipeline.py
.
The RaSP model can be used in Colab using this link.
All data related to the RaSP ddG predictions for the human proteome (alphafold UP000005640_9606_HUMAN_v2) is available at https://sid.erda.dk/sharelink/fFPJWflLeE. Overview of available data files:
rasp_preds_alphafold_UP000005640_9606_HUMAN_v2
- Single directory containing all 23,391 human RaSP ddG predictions. Access to individual protein files is available by clicking through the browser interface.rasp_preds_alphafold_UP000005640_9606_HUMAN_v2.zip
- Zipped version of the directory above useful for local download.rasp_preds_alphafold_UP000005640_9606_HUMAN_v2_prism_dir
- Directory containing RaSP ddG predictions sorted into subdirectories using the PRISM default tree folder structure based on UniProt ID. Example: RaSP prediction file for UniProt ID P12345 will be located in P1/23/45/. Access to individual protein files is available by clicking through the browser interface.rasp_preds_alphafold_UP000005640_9606_HUMAN_v2_prism_dir.zip
- Zipped version of the directory above useful for local download.rasp_preds_alphafold_UP000005640_9606_HUMAN_v2_vaex_dataframe.zip
- Vaex data file containing all 23,391 human RaSP ddG predictions. The Vaex format enables easy access of data using a single file. Vaex documentation is available here.rasp_preds_exp_strucs_gnomad_clinvar.csv
- Selected RaSP ddG predictions mapped to relevant gnomAD and ClinVar annotations.
Note that in a few cases, the residue numbering for proteins in the experimental test data has been shifted to align with the residue numbering found in the structural data.
Please report any bugs or other issues using this repository or contact one of the listed authors in the connected manuscript.
Please cite:
Lasse M. Blaabjerg, Maher M. Kassem, Lydia L. Good, Nicolas Jonsson, Matteo Cagiada, Kristoffer E. Johansson, Wouter Boomsma, Amelie Stein, Kresten Lindorff-Larsen (2022). Rapid protein stability prediction using deep learning representations. bioRxiv, 2022.07.
@article {Blaabjerg2022.07.14.500157,
author = {Lasse M. Blaabjerg and Maher M. Kassem and Lydia L. Good and Nicolas Jonsson and Matteo Cagiada and Kristoffer E. Johansson and Wouter Boomsma and Amelie Stein and Kresten Lindorff-Larsen},
title = {Rapid protein stability prediction using deep learning representations},
year = {2022},
doi = {10.1101/2022.07.14.500157},
URL = {https://www.biorxiv.org/content/early/2022/07/15/2022.07.14.500157},
eprint = {https://www.biorxiv.org/content/early/2022/07/15/2022.07.14.500157.full.pdf},
journal = {bioRxiv}
}
Source code and model weights are licensed under the Apache Licence, Version 2.0.
Parts of the code - specifically related to the 3D CNN model - was developed by Maher Kassem and Wouter Boomsma. We thank them for their contributions.