DeepMainMast is a computational tool using deep learning to automatically build full protein complex structure from cryo-EM map.
Copyright (C) 2022 Genki Terashi, Xiao Wang, Devashish Prasad, Tsukasa Nakamura, Daisuke Kihara, and Purdue University.
License: GPL v3. (If you are interested in a different license, for example, for commercial use, please contact us.)
Contact: Daisuke Kihara (dkihara@purdue.edu)
For technical problems or questions, please reach to Genki Terashi (gterashi@purdue.edu).
The newer version of Intel MKL will cause pyTorch to give the following error due to a symbol being removed: ImportError undefined symbol: iJIT_NotifyEvent
is encountered.
We have updated the environment.yml and requirements.txt to fix the version to an older version. Any previous installation should work just fine.
If you have ever encountered this issue, please activate the conda env first and run conda install mkl==2024.0
.
Genki Terashi, Xiao Wang, Devashish Prasad, Tsukasa Nakamura & Daisuke Kihara. DeepMainmast: integrated protocol of protein structure modeling for cryo-EM with deep learning and structure prediction. Nature Methods, 21: 122-131 (2024) Paper
@article{Terashi2024deepmainmast,
title={DeepMainmast: integrated Protocol of Protein Structure Modeling for cryo-EM with Deep Learning and Structure Prediction},
author={Genki Terashi, Xiao Wang, Devashish Prasad, Tsukasa Nakamura, and Daisuke Kihara},
journal={Nature Methods},
year={2023},
volume={21},
number={1},
pages={122-131}
}
Server(Recommended): https://em.kiharalab.org/algorithm/DeepMainMast
Google Colab: https://colab.research.google.com/github/kiharalab/DeepMainMast/blob/main/DeepMainMast.ipynb
Code Ocean: https://codeocean.com/capsule/0749800
Project website: https://kiharalab.org/emsuites
Detailed pipeline instructions can be found https://kiharalab.org/emsuites/deepmainmast.php
DeepMainMast is a computational tool using deep learning to automatically build full protein complex structure from cryo-EM map.
Structure modeling from maps is an indispensable step for studying proteins and their complexes with cryogenic electron microscopy (cryo-EM). Although the resolution of determined cryo-EM maps has generally improved, there are still many cases where tracing protein main-chains is difficult, even in maps determined at a near atomic resolution. Here, we have developed a protein structure modeling method, called DeepMainmast, which employs deep learning to capture the local map features of amino acids and atoms to assist main-chain tracing. Moreover, since Alphafold2 demonstrates high accuracy in protein structure prediction, we have integrated complementary strengths of de novo density tracing using deep learning with Alphafold2’s structure modeling to achieve even higher accuracy than each method alone. Additionally, the protocol is able to accurately assign chain identity to the structure models of homo-multimers.![image](https://github.com/kiharalab/dmm/assets/50850224/f04a5a46-7cd0-4961-a1c4-e7e044ea8870)(2) Tracing Ca path and assigning the target sequence using the Vehicle Routing Problem Solver and the Dynamic Programming algorithm. Different parameter combinations are used.
(3) Assembling Ca fragments using the Constraint Problem (CP) Solver. Colors indicate chain IDs.
(4) Combining Ca models built under different parameter combinations using the CP Solver. Colors indicate the direction of chains from blue to red for the N-terminal to the C-terminal residues.
(5) Full-atom building and refinement using PULCHRA and Rosetta-CM.
(6) Scoring generated full-atom models based on the DAQ(AA) score and the DOT score.
CPU: >=8 cores
Memory (RAM): >=50Gb. For maps with more than 3,000 residues, memory space should be higher than 200GB.
GPU: any GPU supports CUDA with at least 12GB memory.
GPU is required for DeepMainMast and no CPU version is available for DeepMainMast since it is too slow.
1. Install git
git clone https://github.com/kiharalab/DeepMainMast.git && cd DeepMainMast
You have two options to install dependency on your computer:
3.2.1 install anaconda
.
Make sure you are in the DeepMainMast directory and then run
conda env create -f environment.yml
Each time when you want to run this software, simply activate the environment by
conda activate deepmainmast
conda deactivate(when you want to exit)
Run the following command:
bash make_c_programs.sh
If you want to build full-atom protein structure, then you need to install rosetta for DeepMainMast. Please check the instructions here to install Rosetta locally.
Protein complex structure modeling without AlphaFold (DeepMainMast-base)
./dmm_full_multithreads.sh -p [program_path] -m [map_path] -f [fasta_path] -c [contour] -o [output_path] -t [path_training_time] -T [fragment_assembling_time] -C [num_cpu] -M [num_cpu] -x [ROSETTA_PROGRAM_PATH]
[program_path] is the path of DeepMainmast program folder, absolute path suggested.
[map_path] is the path of the experimental cryo-EM map.
[fasta_path] is the path of the input fasta file about sequence information.
[contour] specifies the contour level as a density threshold to remove outside regions to save processing time.
[output_path] specifies the output directory of DeepMainMast modeled structure. Final modeled structure DeepMainmast.pdb will be saved here. If the directory already exists, the final output will be DeepMainmast[random_digit].pdb to avoid overwritting.
[path_training_time] specifies the computational Time Limit for PATH tracing per thread, default: 600 seconds. Suggested time: [total_num_residues].
[fragment_assembling_time] specifies the computational Time Limit for Fragment assembly, default: 600 seconds. Suggested time: [total_num_residues].
[num_cpu] Number of CPUs used for structure modeling processes. Minimum: 8. The more is better to accelerate the process.
[ROSETTA_PROGRAM_PATH] specifies the rosetta program path, please make sure it is the directory that includes "main" directory after rosetta installment. It is optional, if you do not use this parameter, then DeepMainMast will only build Calpha only structures. It is required if you want to build full-atom protein structure.
Please include -H in command line if you have any two or more chains are identical. That is important of correct chain assignment for identical chains.
You can also add -F
argument to accelerate the backbone tracing and sequence assignment steps, but may reduce the quality of final structure.
If you are running on multi-GPU machines, please see FAQ to configure GPU correctly before running.
Protein complex structure modeling with AlphaFold (DeepMainMast)
./dmm_full_multithreads.sh -p [program_path] -m [map_path] -f [fasta_path] -A [alphafold_pdb_path] -c [contour] -o [output_path] -t [path_training_time] -T [fragment_assembling_time] -C [num_cpu] -M [num_cpu] -x [ROSETTA_PROGRAM_PATH]
[program_path] is the path of DeepMainmast program folder, absolute path suggested.
[map_path] is the path of the experimental cryo-EM map.
[fasta_path] is the path of the input fasta file about sequence information.
[alphafold_pdb_path] is the path of the alphafold modeled structure in pdb format. Please combine all single-chain structures in one PDB file, separated by "TER" for different chains' records. The chain ID does not matter, for identical chains, you only need to provide the chain records once in the pdb format.
[contour] specifies the contour level as a density threshold to remove outside regions to save processing time.
[output_path] specifies the output directory of DeepMainMast modeled structure. Final modeled structure DeepMainmast.pdb will be saved here. If the directory already exists, the final output will be DeepMainmast[random_digit].pdb to avoid overwritting.
[path_training_time] specifies the computational Time Limit for PATH tracing per thread, default: 600 seconds. Suggested time: [total_num_residues].
[fragment_assembling_time] specifies the computational Time Limit for Fragment assembly, default: 600 seconds. Suggested time: [total_num_residues].
[num_cpu] Number of CPUs used for structure modeling processes. Minimum: 8. The more is better to accelerate the process.
[ROSETTA_PROGRAM_PATH] specifies the rosetta program path, please make sure it is the directory that includes "main" directory after rosetta installment. It is optional, if you do not use this parameter, then DeepMainMast will only build Calpha only structures. It is required if you want to build full-atom protein structure.
Please include -H in command line if you have any two or more chains are identical. That is important of correct chain assignment for identical chains.
You can also add -F
argument to accelerate the backbone tracing and sequence assignment steps, but may reduce the quality of final structure.
If you are running on multi-GPU machines, please see FAQ to configure GPU correctly before running.
Example: single-chain structure modeling (3j9sA)
Please run the following command in DeepMainMast program folder.
Examples are kept in data/3j9sA directory.
- Target sequence file (-f): 3j9sA.fasta
- MAP file (-m): 3j9sA.mrc
- AlphaFold2 Model (optional -A): 3j9sA_af2.pdb
./dmm_full_multithreads.sh -p ./ -c 0.01 -o output_3j9sA -t 600 -T 600 -C 8 -M 8 -m ./data/3j9sA/3j9sA.mrc -f ./data/3j9sA/3j9sA.fasta
The outputs will be generated in [output_3j9sA/results] directory. The expected output structure is [output_3j9sA/DeepMainmast.pdb].
./dmm_full_multithreads.sh -p ./ -c 0.01 -o output_3j9sa -t 600 -T 600 -C 8 -M 8 -m ./data/3j9sA/3j9sA.mrc -f ./data/3j9sA/3j9sA.fasta -A ./data/3j9sA/3j9sA_af2.pdb
The outputs will be generated in [output_3j9sA/results] directory. The expected output structure is [output_3j9sA/DeepMainmast.pdb].
./dmm_full_multithreads.sh -p ./ -c 0.01 -o output_3j9sa -t 600 -T 600 -C 8 -M 8 -m ./data/3j9sA/3j9sA.mrc -f ./data/3j9sA/3j9sA.fasta -A ./data/3j9sA/3j9sA_af2.pdb -x [ROSETTA PROGRAM PATH]
Please confirm you install rosetta well, then please change -x to the rosetta program directory that includes "main" directory.
The outputs will be generated in [output_3j9sA/results] directory. The expected output structure is [output_3j9sA/DeepMainmast.pdb].
The example output is kept here for your reference. All intermediate outputs are also kept here.
Example: multi-chain structure modeling (EMD-1461, HomoOligomer)
Please run the following command in DeepMainMast program folder.
Examples are kept in data/1461 directory.
- Target sequence file (-f): emd_1461.fasta
- MAP file (-m): emd_1461.mrc
- AlphaFold2 Model (optional -A): emd_1461_af2.pdb
./dmm_full_multithreads.sh -p ./ -c 0.3 -o output_1461 -t 1200 -T 600 -C 8 -M 8 -m ./data/1461/emd_1461.mrc -f ./data/1461/emd_1461.fasta -H
Since homo-oligomer includes identical chains, -H argument is required in the command line.
The outputs will be generated in [output_1461/results] directory. The expected output structure is [output_1461/DeepMainmast.pdb].
./dmm_full_multithreads.sh -p ./ -c 0.3 -o output_1461 -t 1200 -T 600 -C 8 -M 8 -m ./data/1461/emd_1461.mrc -f ./data/1461/emd_1461.fasta -H -A ./data/1461/emd_1461_af2.pdb
Since homo-oligomer includes identical chains, -H argument is required in the command line.
The outputs will be generated in [output_1461/results] directory. The expected output structure is [output_1461/DeepMainmast.pdb].
./dmm_full_multithreads.sh -p ./ -c 0.3 -o output_1461 -t 1200 -T 600 -C 8 -M 8 -m ./data/1461/emd_1461.mrc -f ./data/1461/emd_1461.fasta -H -A ./data/1461/emd_1461_af2.pdb -x [ROSETTA PROGRAM PATH]
Since homo-oligomer includes identical chains, -H argument is required in the command line.
The outputs will be generated in [output_1461/results] directory. The expected output structure is [output_1461/DeepMainmast.pdb].
The example output is kept here for your reference. All intermediate outputs are also kept here.
Example: multi-chain structure modeling (EMD-2513, HeteroOligomer)
Please run the following command in DeepMainMast program folder.
Examples are kept in data/2513 directory.
- Target sequence file (-f): emd_2513.fasta
- MAP file (-m): emd_2513.mrc
- AlphaFold2 Model (optional -A): emd_2513_af2.pdb
./dmm_full_multithreads.sh -p ./ -c 0.01 -o output_2513 -t 1200 -T 1200 -C 8 -M 8 -m ./data/2513/emd_2513.mrc -f ./data/2513/emd_2513.fasta
The outputs will be generated in [output_2513/results] directory. The expected output structure is [output_2513/DeepMainmast.pdb].
This example does not include any identical chains, so we should not add -H in command line. For other examples that may include identical chains in the complex, please do not forget to add -H in command line.
./dmm_full_multithreads.sh -p ./ -c 0.01 -o output_2513 -t 1200 -T 1200 -C 8 -M 8 -m ./data/2513/emd_2513.mrc -f ./data/2513/emd_2513.fasta -A ./data/2513/emd_2513_af2.pdb
The outputs will be generated in [output_2513/results] directory. The expected output structure is [output_2513/DeepMainmast.pdb].
This example does not include any identical chains, so we should not add -H in command line. For other examples that may include identical chains in the complex, please do not forget to add -H in command line.
./dmm_full_multithreads.sh -p ./ -c 0.01 -o output_2513 -t 1200 -T 1200 -C 8 -M 8 -m ./data/2513/emd_2513.mrc -f ./data/2513/emd_2513.fasta -A ./data/2513/emd_2513_af2.pdb -x [ROSETTA PROGRAM PATH]
The outputs will be generated in [output_2513/results] directory. The expected output structure is [output_2513/DeepMainmast.pdb].
This example does not include any identical chains, so we should not add -H in command line. For other examples that may include identical chains in the complex, please do not forget to add -H in command line.
The example output is kept here for your reference. All intermediate outputs are also kept here.
Q: How to configure the GPU correctly for DeepMainMast to run?
Please run the following command line before you run DeepMainMast in multi-GPU machines.
export CUDA_VISIBLE_DEVICES=[gpu_id]
Here the gpu_id should be the GPU ID that you want to use.
Q: How to interpret different models in DeepMainmast.pdb?
The 3D model is colored by DAQ(AA) score scaled from red (-1.0) to blue (1.0) with a 19 residues sliding window.
The 3D model contains four models (with rosetta)/ two models (without rosetta).
In MODEL1, Ca-only structure, all modeled positions are colored by DAQ(AA) score.
In MODEL2, Ca-only structure, amino acids with DAQ(CA) score below -0.5 are excluded, and amino acids with DAQ(AA) score below -0.5 are replaced with UNK.
In MODEL3, full-atomic structure, all modeled positions are colored by DAQ(AA) score.
In MODEL4, full-atomic structure, amino acids with DAQ(CA) score below -0.5 are excluded, and amino acids with DAQ(AA) score below -0.5 are replaced with UNK.