Skip to content

Commit

Permalink
Updated documentation and Dockerfile
Browse files Browse the repository at this point in the history
  • Loading branch information
thinkle12 committed Dec 28, 2024
1 parent 5d4e57c commit 7257504
Show file tree
Hide file tree
Showing 7 changed files with 144 additions and 48 deletions.
5 changes: 4 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,7 @@ RUN pip install -r requirements.txt

RUN python setup.py install

WORKDIR /
RUN chmod +x /usr/local/bin/protein_inference_cli.py
RUN chmod +x /usr/local/bin/protein_inference_heuristic_cli.py

WORKDIR /
44 changes: 42 additions & 2 deletions docs/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ Advanced usage flags
1. `-p` This flag is a True/False on whether to skip appending alternative proteins from the Fasta database digestion. If this flag is left blank, it will not skip appending alternative proteins (recommended).
2. `-i` True/False on whether to split the IDs in the Fasta database file. If this is left blank, it will not split IDs in the Fasta database file (recommended).

You can run the tool as follows:
You can run the tool as follows with separate target and decoy files:
```shell
protein_inference_cli.py \
-t /path/to/target/file.txt \
Expand All @@ -115,6 +115,15 @@ protein_inference_cli.py \
-o /path/to/output/directory/
```

Or from combined files like an mzIdentML file:
```shell
protein_inference_cli.py \
-f /path/to/target/file.mzid \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/
```

Running with multiple input target/decoy files:
```shell
protein_inference_cli.py \
Expand All @@ -125,6 +134,15 @@ protein_inference_cli.py \
-o /path/to/output/directory/
```

Or from multiple mzIdentML / idXML / pepXML files:
```shell
protein_inference_cli.py \
-f /path/to/target/file1.mzid /path/to/target/file2.mzid \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/
```


#### Running Within Python
To run within a python console please see the following example:
Expand All @@ -149,7 +167,29 @@ pipeline = ProteinInferencePipeline(parameter_file=yaml_params,
pipeline.execute()
```

Or running mzIdentML files within python:
```python
from pyproteininference.pipeline import ProteinInferencePipeline

yaml_params = "/path/to/yaml/params.yaml"
database = "/path/to/database/file.fasta"
### target_files can either be a list of files or one file
mzid_files = ["/path/to/file1.mzid","/path/to/file2.mzid"]
### decoy_files can either be a list of files or one file
output_directory_name = "/path/to/output/directory/"

pipeline = ProteinInferencePipeline(parameter_file=yaml_params,
database_file=database,
target_files=None,
decoy_files=None,
combined_files=mzid_files,
output_directory=output_directory_name)
# Calling .execute() will initiate the pipeline with the given data
pipeline.execute()
```

### Running the Heuristic Method
**NOTE: The Heuristic Method is experimental and has not be extensively tested on multiple datasets yet. Check back for updates on this tool.** <br> <br>
Py Protein Inference also has a built-in Heuristic that runs through four inference methods (Inclusion, Exclusion, Parsimony, and Peptide Centric) and selects a recommended method for your given dataset.
By default, all four result files will be written, and the optimal method will be highlighted to the user.
The Heuristic method also outputs a density plot that showcases all the inference methods compared to one another to gain further insight. For more information on the Heuristic Method see the [__Heuristic algorithm__](supplementary.md#heuristic-algorithm) section.
Expand Down Expand Up @@ -327,6 +367,6 @@ Console Output is as follows and indicates the recommended method at the end:
Below is an example of a Heuristic Density plot. The plot indicates the distribution of the number of standard deviations
from the mean (of identified proteins at a specified FDR) for each inference method for a range of FDRs from 0 to the false discovery rate threshold (100 fdrs are incrementally selected in the range [0, fdr threshold])
In general, the closer that the peak of a distribution is to 0 the more likely the associated method is to be selected as the recommended method.
For more information on the specifics of the Heuristic Algorithm see [__Heuristic Algorithm Description__](supplementary.md#heuristic-algorithm-description)
For more information on the specifics of the Heuristic Algorithm see [__Heuristic Algorithm Description__](supplementary.md#heuristic-algorithm)

![density](img/swissprot_example_density.png)
7 changes: 3 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
## Introduction

**PyProteinInference** is a Python package for running various protein inference algorithms on tandem mass spectrometry search results and generating protein to peptide mappings with protein level false discovery rates.

For a quick start guide please click [here](quickstart.md).

**PyProteinInference** is a Python package for running various protein inference algorithms on tandem mass spectrometry search results and generating protein to peptide mappings with protein level false discovery rates..

**Key Features** <br>

* **Protein Inference and Scoring**:
Expand All @@ -14,7 +13,7 @@ For a quick start guide please click [here](quickstart.md).
* **Supported Input Formats**:
* Search Result File Types: [idXML](input_format.md#idxml), [mzIdentML](input_format.md#mzidentml), or [pepXML](input_format.md#pepxml).
* PSM files from [Percolator](https://github.com/percolator/percolator).
* Custom tab-delimited files (see [input formats](input_format.md#custom-input)).
* Custom tab-delimited [files](input_format.md#custom-input).
* **Output**:
* User-friendly CSV file containing Proteins, Peptides, q-values, and Protein Scores.
* Details on output formats: [supplementary](supplementary.md#export-types).
Expand Down Expand Up @@ -49,5 +48,5 @@ Please see the [__Protein Score Types__](supplementary.md#protein-score-types) s
## Using Py Protein Inference
1. [Yaml Parameter File](parameters.md#yaml-parameter-file-outline)
2. [Input File Examples](input_format.md#input-file-examples) (idXML, mzIdentML, pepXML, Tab Delimited)
3. [Fasta Database](input_format.md#fasta-file-example)
3. [Fasta Database](input_format.md#fasta-file)
4. [Running Py Protein Inference](advanced.md#running-py-protein-inference)
12 changes: 8 additions & 4 deletions docs/input_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,17 +50,19 @@ For example, lets say you wanted to use the `Comet:xcorr` value (for Comet Xcorr
```
One of the Accession values can also be used. For example, `MS:1001493` actually means `Posterior Error Probability`. <br>
Since this is the Posterior Error Probability from percolator, the input psm_score would actually be posterior_error_prob. This will be changed in a future release to be the accession `MS:1001493` instead. <br>
You would simply enter in `MS:1001493` to utilize that output PSM score.

So to use that value as the PSM score you might have this as your score section for your parameter file:

```yaml
score:
protein_score: multiplicative_log
psm_score: posterior_error_prob
psm_score: MS:1001493
psm_score_type: multiplicative
```

Remember, `MS:1001493` actually means `Posterior Error Probability`. Please refer to your specific search algorithms documentation to learn more about what each accession means.

### mzIdentML
Lets inspect a standard MSGF+ based mzIdentML file for assistance in running pyProteinInference from a a mzIdentML file. <br>

Expand Down Expand Up @@ -93,8 +95,8 @@ When inspecting a file you might see a spectrum identification as such:
</SpectrumIdentificationItem>
```

When selecting custom scores to use from your mzIdentML file you can select any of the cvParam options under SpectrumIdentificationItem. For example, we might want to use the MS-GF:PepQValue (MS:1002055). <br>
However, when selecting custom scores from mzIdentML files you will actually input the `accession` listed above instead of the `name`. <br>
When selecting custom scores to use from your mzIdentML file you can select any of the `cvParam` options under SpectrumIdentificationItem. For example, we might want to use the `MS-GF:PepQValue` (`MS:1002055`). <br>
An important note when selecting custom scores from mzIdentML files is that you will actually input the `accession` listed above instead of the `name`. <br>
So, for selecting `MS-GF:PepQValue`, you would actually input `MS:1002055` into the parameter file (See below). <br>
Also, given that `MS:1002055` scores are better if they are lower we would insert it as a multiplicative score. <br>
The `score` section of the parameter file in this case might look like this:
Expand All @@ -106,6 +108,8 @@ The `score` section of the parameter file in this case might look like this:
psm_score_type: multiplicative
```

Remember, you can use any `cvParam` accession from your mzIdentML file. Make sure to note if the selected score is better when lower or higher to calculate protein scores correctly.

### pepXML

Lets next inspect a standard MSFragger Version 4.0 based pepXML file for assistance in running pyProteinInference from a pepXML file
Expand Down
Loading

0 comments on commit 7257504

Please sign in to comment.