Skip to content

Latest commit

 

History

History
253 lines (194 loc) · 13.1 KB

readme.md

File metadata and controls

253 lines (194 loc) · 13.1 KB

Logo

The Peptonizer 2000

Integrating PepGM and Unipept for probability-based taxonomic inference of metaproteomic samples

Table of Contents
  1. About The Project
  2. Input
  3. Getting Started
  4. Usage
  5. Roadmap
  6. Contributing
  7. License
  8. Contact

About The Project

Introducing the Peptonizer2000 - a tool that combines the capabilities of Unipept and PepGM to analyze metaproteomic mass spectrometry-based samples. Originally designed for taxonomic inference of viral mass spectrometry-based samples, we've extended PepGM's functionality to analyze metaproteomic samples by retrieving taxonomic information from the Unipept database.

PepGM is a probabilistic graphical model developed by Tanja Holstein et al. that uses belief propagation to infer the taxonomic origin of peptides and taxa in viral samples. You can learn more about PepGM at GitHub page.

Unipept, on the other hand, is a web-based metaproteomics analysis tool that provides taxonomic information for identified peptides. To make it work seamlessly with PepGM, we've extended Unipept with new functionalities that restrict the taxa queried and provide all potential taxonomic origins of the peptides queried. Check out more information about Unipept here.

With the Peptonizer2000, you can look forward to a comprehensive and streamlined workflow that simplifies the process of identifying peptides and their taxonomic origins in metaproteomic samples.

The Peptonizer2000 workflow is comprised of the following steps:

  1. Query all identified peptides, provided by the user in a .tsv file, in the Unipept API, and restrict the taxonomic range queried based on any prior knowledge of the sample.
  2. Assemble the peptide-taxon associations provided by Unipept into a bipartite graph, where peptides and taxa are represented by different nodes, and an edge is drawn between a peptide and a taxon if the peptide is part of the taxon's proteome.
  3. Transform the bipartite graph into a factor graph using convolution trees and conditional probability table factors (CPD).
  4. Run the belief propagation algorithm multiple times with different sets of CPD parameters until convergence, to obtain posterior probabilities of candidate taxa.
  5. Use an empirically deduced metric to determine the ideal graph parameter set.
  6. Output the top scoring taxa as a results barchart. The results are also available as comma-separated files for further downstream analysis or visualizations.
workflow scheme

(back to top)

Input

  • A .tsv file of your peptides output from any protoemic peptide search method. The first column should be the peptide, the second column it's score attributed by the search engine. An example is provided in test files.
  • A config file with your parameters for the peptonizer2000. A more detailed description of the configuration file can be found below. Additionally, an exemplary config file is provided in this repository.

(back to top)

Getting Started

Prerequisites

Python counterpart

The actual code that builds the factor graph and executes the Peptonizer algorithm, is implemented in Python and can be found in the peptonizer folder.

Running as snakemake workflow

In order to run the Peptonizer2000 on your own system, you should install Conda, Mamba and all of its dependencies. Follow the installation instructions step-by-step for an explanation of what you should do.

  • Make sure that Conda and Mamba are installed. If these are not yet present on your system, you can follow the instructions on their README.
  • Go to the "workflow" directory by executing cd snakemake/workflow from the terminal.
  • Run conda env create -f env.yml (make sure to run this command from the workflow directory) in order to install all dependencies and create a new conda environment (which is named "peptonizer" by default).
  • Run mamba install -c conda-forge -c bioconda -n peptonizer snakemake to install snakemake which is required to run the whole workflow from start-to-finish.
  • Run conda activate peptonizer to switch the current Conda environment to the peptonizer environment you created earlier.
  • Start the peptonizer with the command snakemake --use-conda --cores 1. If you have sufficient CPU and memory power available to your system, you can increase the amount of cores in order to speed up the workflow.

Configuration file

The Peptonizer2000 relies on a configuration file in yaml format to set up the workflow. An example configuration file is provided in config/config.yaml.
Do not change the config file location.

Directory parameters
  • data_dir: relative path to output files
  • input_file: relative path to input .tsv
  • log_dir: relative path to log directory
Analysis specific parameter
  • taxa_in_graph: # of inferred taxa that appear in the barplot that is created of the results csv
  • taxa_in_plot: number of taxa reported in bar plot
  • alpha: grid search increments for alpha (list)
  • beta: grid search increments for beta (list)
  • prior: grid search increments for prior (list)
  • regularized: boolean. If True, the probability for the number of parents taxa of a peptide is regularized to be inversely proportional to the number of parents
UniPept query parameters
  • taxon_rank: rank at which results will be reported
  • taxon_query: taxa comprised in the UniPept query. If querying all of Unipept, use 1 (list)

Output files

All Peptonizer2000 output files are saved into the results folder and include the following:

Main results:

  • peptonizer_results.csv: table with values ID, score, type (contains all taxids under 'ID' and all probabilities under 'score'
  • peptonizer_results.png: bar plot of the peptonizer results showing the scores for the #'taxa_in_plot' (see config parameters) highest scoring taxa

Additional files:

  • Intermediate results folders sorted by their prior value for all possible grid search parameter combinations
  • taxa_weights_dataframe.csv: csv file of all taxids that had at least one peptide map to them and their weight
  • pepgm_graph.graphml: graphml file of the graphical model (without convolution tree factors). Useful to visualize the graph structure and peptide-taxon connections
  • sequence_scores_dataframe.csv: dataframe with petides, taxa and scores used to create the graph
  • best_parameter.csv: file with best parameter
  • unipept_responses.json: response of unipept queries
  • clustered_taxa_weights_datatframe: additional .csv file resulting from the clustering of taxa by peptidome used for rbo

(back to top)

Testing the Peptonizer

To test the Peptonizer2000 and see if it is set up correctly on your machine, we provide a test file under resources/test_files. This should be dowloaded automatically if you follow the installation instructions above. There are several test files from different metaproteomic samples. These are:

To execute a test run of the Peptonizer2000 using the provided files:

  1. Follow the installation instructions above
  2. In the config file, make sure to point to the test sample you want to use. By default, this is S03
  3. Start to peptonize with the command snakemake --use-conda --cores 1. If you have sufficient CPU and memory power available to your system, you can increase the amount of cores in order to speed up the workflow.

License

Distributed under the Apache 2.0 License. See LICENSE.txt for more information.

(back to top)

Contact

Tanja Holstein - @HolsteinTanja - tanja.holstein@ugent.be
Pieter Verschaffelt - pieter.verschaffelt@ugent.be

Logo

(back to top)