SMSNet

Sequence-Mask-Search framework and SMSNet model for de novo peptide sequencing.

The pre-trained model can be downloaded from FigShare.

SMSNet's predicted amino acid sequences for a public HLA peptidome dataset (MassIVE accession MSV000080527) and phosphoproteome dataset (PRIDE accession PXD009227) can be found on FigShare.

The preprint can be found on bioRxiv.

Dependencies

This project uses Python 3.5.2, with the following lib dependencies:

Tensorflow 1.4 (compatible up to 1.11)
Keras 2.2.4

A list of all python packages can be found in requirement.txt

Instructions

Decode

python run.py --model_dir <model_directory> --inference_input_file <input_file_directory/input_file.mgf> --rescore

<model_directory> is the directory of the model (can be downloaded from the link above).

<input_file_directory/input_file.mgf> is the path to the input file.

Using --rescore flag will generate another probability file with suffix “_rescore” in the same directory. The output will be in “<input_file_directory>_output/”.

Other options can be found in "run.py".

Model parameters (including possible amino acids) can be found in "nmt/input_config.py".

Note

In order to generate the report file, the TITLE lines in .mgf file must end with "scan=".
To switch between m-mod and p-mod model, the following changes are needed (default: p-mod):
- tgt_vocab_size (24 for m_mod / 27 for p-mod) and tgt_vocab_file in run.py line 61-62.
- Comment/uncomment the possible vocab in inverse_vocab in nmt/input_config.py accordingly ('s', 't', 'y' at line 65, 67, 71).
- Select the corresponding AA_weight_table in function create_aa_tables() in nmt/input_config.py (by comment/uncomment line 169-174 or 176-180).

Outputs

For each input file, three output files will be generated in the output directory: <input_file>, <input_file>_prob and, <input_file>_rescore. They are the output sequences, probabilities of each amino acid, and probabilities of each amino acid after rescoring, respectively.
The report summarizing the outputs in .tsv format will be in the same parent directory as the input.

Example

Decoding “test_decode/test_file.mgf” with a model in "model/m_mod/".


python run.py --model_dir model/m_mod/ --inference_input_file test_decode/test_file.mgf --rescore

The report will be in test_decode_output.

The model will also produce three files:

test_decode_output/test_file (sequence)
test_decode_output/test_file_prob (score)
test_decode_output/test_file_rescore (score after rescoring with post-processing model)

Database search

For database searching, change the file name in utils_masking/SMSNet_final_database_search.py, then run python utils_masking/SMSNet_final_database_search.py

Acknowledgement

This code is based on TensorFlow Neural Machine Translation GNMT.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
nmt		nmt
utils_data		utils_data
utils_masking		utils_masking
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMSNet

Sequence-Mask-Search framework and SMSNet model for de novo peptide sequencing.

Dependencies

Instructions

Decode

Note

Outputs

Example

Database search

Acknowledgement

About

Releases

Packages

Languages

License

guhanrv/SMSNet

Folders and files

Latest commit

History

Repository files navigation

SMSNet

Sequence-Mask-Search framework and SMSNet model for de novo peptide sequencing.

Dependencies

Instructions

Decode

Note

Outputs

Example

Database search

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages