Table of Contents
Introducing the Peptonizer2000 - a tool that combines the capabilities of Unipept and PepGM to analyze metaproteomic mass spectrometry-based samples. Originally designed for taxonomic inference of viral mass spectrometry-based samples, we've extended PepGM's functionality to analyze metaproteomic samples by retrieving taxonomic information from the Unipept database.
PepGM is a probabilistic graphical model developed by the eScience group at BAM (Federal Institute for Materials Research and Testing) that uses belief propagation to infer the taxonomic origin of peptides and taxa in viral samples. You can learn more about PepGM on our eScience group at BAM (Federal Institute for Materials Research and Testing). Please refer to our GitHub page.
Unipept, on the other hand, is a web-based metaproteomics analysis tool that provides taxonomic information for identified peptides. To make it work seamlessly with PepGM, we've extended Unipept with new functionalities that restrict the taxa queried and provide all potential taxonomic origins of the peptides queried. Check out more information about Unipept here.
With the Peptonizer2000, you can look forward to a comprehensive and streamlined workflow that simplifies the process of identifying peptides and their taxonomic origins in metaproteomic samples.
The Peptonizer2000 workflow is comprised of the following steps:
- Query all identified peptides, provided by the user in a .tsv file, in the Unipept API, and restrict the taxonomic range queried based on any prior knowledge of the sample.
- Assemble the peptide-taxon associations provided by Unipept into a bipartite graph, where peptides and taxa are represented by different nodes, and an edge is drawn between a peptide and a taxon if the peptide is part of the taxon's proteome.
- Transform the bipartite graph into a factor graph using convolution trees and conditional probability table factors (CPD).
- Run the belief propagation algorithm multiple times with different sets of CPD parameters until convergence, to obtain posterior probabilities of candidate taxa.
- Use an empirically deduced metric to determine the ideal graph parameter set.
- Output the top scoring taxa as a results barchart. The results are also available as comma-separated files for further downstream analysis or visualizations.
- A .tsv file of your peptides output from any protoemic peptide search method. The first column should be the peptide, the second column it's score attributed by the search engine. An example is provided in test files.
- A config file with your parameters for the peptonizer2000. A more detailed description of the configuration file can be found below. Additionally, an exemplary config file is provided in this repository.
Make sure you have git installed and clone the repo:
git clone https://github.com/BAMeScience/Peptonizer2000.git
The Peptonizer relies on a snakemake workflow developed with snakemake 5.10.0.
Installing snakemake requires mamba.
To install mamba:
conda install -n <your_env> -c conda-forge mamba
Alternatively, if you do not have conda installed, you can download mamba directly together with miniforge(intructions from the mamba installation guide):
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
To install snakemake:
conda activate <your_env>
mamba install -c conda-forge -c bioconda -n <your_snakemake_env> snakemake
In accordance with the Snakemake recommendations, we suggest to save your sample data
in resources
folder. All outputs will be saved in results
.
Additional dependencies necessary are Java and GCC.
The Peptonizer2000 is tested for Linux OS.
All necessary binaries are autmatically installed using conda.
The Peptonizer2000 relies on a configuration file in yaml
format to set up the workflow.
An example configuration file is provided in config/config.yaml
.
Do not change the config file location.
Peptonizer parameter
- DataDir: Relative path to raw spectra
- ResultsDir: Relative path to results
- ResourcesDir: Relative path to resources
- ExperimentName: Name of subfolder in results
- TaxaInPlot: # of inferred taxa that appear in the barplot that is created of the results csv
- Alpha: Grid search increments for alpha
- Beta: Grid search increments for beta
- prior: grid search increments for prior
Sample specific parameter
- PeptidesAndScores: path to you .tsv file of input peptides
- SampleName: wildcard for spectra file and folder name
UniPept parameter
- TaxaNumber: # of taxa
- targetTaxa: Comma separated list of taxa compromised in the UniPept query. If querying all of Unipept, use '1'
All Peptonizer2000 output files are saved into the results folder and include the following:
Main results:
- Peptonizer_Results.csv: Table with values ID, score, type (contains all taxids under 'ID' and all probabilities under '
score' tosterior probabilities of n (default: 15) highest scoring taxa
Additional (intermediate):
- Intermediate results folder sorted by their prior value for all possible grid search parameter combinations
- TaxaWeights.csv: csv file of all taxids that had at least one protein map to them and their weight
- PepGM_graph.graphml: graphml file of the graphical model (without convolution tree factors). Useful to visualize the graph structure and peptide-taxon connections
- paramcheck.png: barplot of the metric used to determine the graphical model parameters for n (default: 15) best performing parameter combinations
- additional .csv files resulting from the clustering of taxa by peptidome
- log files for bug fixing
To test the Peptonizer2000 and see if it is set up correctly on your machine, we provide a test file under resources/test_files. This should be dowloaded automatically if you follow the installation instructions above. The test file is a .tsv resulting from the sample S03 of the CAMPI study searched against a sample specific database using X!Tandem and MS2Rescore. The original file are available through PRIDE under PXD023217.
To execute a test run of the Peptonizer2000 using the provided files:
- Follow the installation instructions above
- In you terminal, go to the folder resources/test_files
- execute the following code to move config file to the right directory
cp ./config.yaml ../../config/
- You need to make some alterations to the provided example config file.
- input the path to the S03 .tsv file . It should be something like 'path_to_workflow_directory/resources/SampleData/S03_test.tsv'
You should now me all set up to run the Peptonizer2000 on the test files. In your terminal, run
snakemake --use-conda --cores <n>
is the number of cores available on your machine to run this workflow. Make sure your mamba environment, to which you downloaded snakemake, is active.
Distributed under the MIT License. See LICENSE.txt
for more information.
Tanja Holstein - @HolsteinTanja - tanja.holstein@ugent.be
Pieter Verschaffelt - pieter.verschaffelt@ugent.be