Updated documentation and Dockerfile

thinkle12 · Dec 28, 2024 · 7257504 · 7257504
1 parent 5d4e57c
commit 7257504
Show file tree

Hide file tree

Showing 7 changed files with 144 additions and 48 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -21,4 +21,7 @@ RUN pip install -r requirements.txt
 
 RUN python setup.py install
 
-WORKDIR /
+RUN chmod +x /usr/local/bin/protein_inference_cli.py
+RUN chmod +x /usr/local/bin/protein_inference_heuristic_cli.py
+
+WORKDIR /
diff --git a/docs/advanced.md b/docs/advanced.md
@@ -105,7 +105,7 @@ Advanced usage flags
 1. `-p` This flag is a True/False on whether to skip appending alternative proteins from the Fasta database digestion. If this flag is left blank, it will not skip appending alternative proteins (recommended). 
 2. `-i` True/False on whether to split the IDs in the Fasta database file. If this is left blank, it will not split IDs in the Fasta database file (recommended).
 
-You can run the tool as follows:
+You can run the tool as follows with separate target and decoy files:
 ```shell
 protein_inference_cli.py \
     -t /path/to/target/file.txt \
@@ -115,6 +115,15 @@ protein_inference_cli.py \
     -o /path/to/output/directory/
 ```
 
+Or from combined files like an mzIdentML file:
+```shell
+protein_inference_cli.py \
+    -f /path/to/target/file.mzid \
+    -db /path/to/database/file.fasta \
+    -y /path/to/parameter/file.yaml \
+    -o /path/to/output/directory/
+```
+
 Running with multiple input target/decoy files:
 ```shell
 protein_inference_cli.py \
@@ -125,6 +134,15 @@ protein_inference_cli.py \
     -o /path/to/output/directory/
 ```
 
+Or from multiple mzIdentML / idXML / pepXML files:
+```shell
+protein_inference_cli.py \
+    -f /path/to/target/file1.mzid /path/to/target/file2.mzid \
+    -db /path/to/database/file.fasta \
+    -y /path/to/parameter/file.yaml \
+    -o /path/to/output/directory/
+```
+
 
 #### Running Within Python
 To run within a python console please see the following example:
@@ -149,7 +167,29 @@ pipeline = ProteinInferencePipeline(parameter_file=yaml_params,
 pipeline.execute()
 ```
 
+Or running mzIdentML files within python:
+```python
+from pyproteininference.pipeline import ProteinInferencePipeline
+
+yaml_params = "/path/to/yaml/params.yaml"
+database = "/path/to/database/file.fasta"
+### target_files can either be a list of files or one file
+mzid_files = ["/path/to/file1.mzid","/path/to/file2.mzid"]
+### decoy_files can either be a list of files or one file
+output_directory_name = "/path/to/output/directory/"
+
+pipeline = ProteinInferencePipeline(parameter_file=yaml_params,
+									database_file=database,  
+                                    target_files=None,  
+                                    decoy_files=None,  
+                                    combined_files=mzid_files,  
+                                    output_directory=output_directory_name)  
+# Calling .execute() will initiate the pipeline with the given data                                                               
+pipeline.execute()
+```
+
 ### Running the Heuristic Method
+**NOTE: The Heuristic Method is experimental and has not be extensively tested on multiple datasets yet. Check back for updates on this tool.** <br> <br>
 Py Protein Inference also has a built-in Heuristic that runs through four inference methods (Inclusion, Exclusion, Parsimony, and Peptide Centric) and selects a recommended method for your given dataset. 
 By default, all four result files will be written, and the optimal method will be highlighted to the user.
 The Heuristic method also outputs a density plot that showcases all the inference methods compared to one another to gain further insight. For more information on the Heuristic Method see the [__Heuristic algorithm__](supplementary.md#heuristic-algorithm) section.
@@ -327,6 +367,6 @@ Console Output is as follows and indicates the recommended method at the end:
 Below is an example of a Heuristic Density plot. The plot indicates the distribution of the number of standard deviations 
 from the mean (of identified proteins at a specified FDR) for each inference method for a range of FDRs from 0 to the false discovery rate threshold (100 fdrs are incrementally selected in the range [0, fdr threshold])
 In general, the closer that the peak of a distribution is to 0 the more likely the associated method is to be selected as the recommended method.
-For more information on the specifics of the Heuristic Algorithm see [__Heuristic Algorithm Description__](supplementary.md#heuristic-algorithm-description)
+For more information on the specifics of the Heuristic Algorithm see [__Heuristic Algorithm Description__](supplementary.md#heuristic-algorithm)
 
 ![density](img/swissprot_example_density.png)
diff --git a/docs/index.md b/docs/index.md
@@ -1,10 +1,9 @@
 ## Introduction
 
+**PyProteinInference** is a Python package for running various protein inference algorithms on tandem mass spectrometry search results and generating protein to peptide mappings with protein level false discovery rates.  
 
 For a quick start guide please click [here](quickstart.md).
 
-**PyProteinInference** is a Python package for running various protein inference algorithms on tandem mass spectrometry search results and generating protein to peptide mappings with protein level false discovery rates..  
-
 **Key Features** <br>
 
 * **Protein Inference and Scoring**:
@@ -14,7 +13,7 @@ For a quick start guide please click [here](quickstart.md).
 * **Supported Input Formats**:
     * Search Result File Types: [idXML](input_format.md#idxml), [mzIdentML](input_format.md#mzidentml), or [pepXML](input_format.md#pepxml).  
     * PSM files from [Percolator](https://github.com/percolator/percolator).
-    * Custom tab-delimited files (see [input formats](input_format.md#custom-input)).  
+    * Custom tab-delimited [files](input_format.md#custom-input).  
 * **Output**:
     * User-friendly CSV file containing Proteins, Peptides, q-values, and Protein Scores.  
     * Details on output formats: [supplementary](supplementary.md#export-types).  
@@ -49,5 +48,5 @@ Please see the [__Protein Score Types__](supplementary.md#protein-score-types) s
 ## Using Py Protein Inference
  1. [Yaml Parameter File](parameters.md#yaml-parameter-file-outline)
  2. [Input File Examples](input_format.md#input-file-examples) (idXML, mzIdentML, pepXML, Tab Delimited)
- 3. [Fasta Database](input_format.md#fasta-file-example)
+ 3. [Fasta Database](input_format.md#fasta-file)
  4. [Running Py Protein Inference](advanced.md#running-py-protein-inference)
diff --git a/docs/input_format.md b/docs/input_format.md
@@ -50,17 +50,19 @@ For example, lets say you wanted to use the `Comet:xcorr` value (for Comet Xcorr
 ```
 
 One of the Accession values can also be used. For example, `MS:1001493` actually means `Posterior Error Probability`. <br>
-Since this is the Posterior Error Probability from percolator, the input psm_score would actually be posterior_error_prob. This will be changed in a future release to be the accession `MS:1001493` instead. <br>
+You would simply enter in `MS:1001493` to utilize that output PSM score.
 
 So to use that value as the PSM score you might have this as your score section for your parameter file:
 
 ```yaml
   score:
     protein_score: multiplicative_log
-    psm_score: posterior_error_prob
+    psm_score: MS:1001493
     psm_score_type: multiplicative
 ```
 
+Remember, `MS:1001493` actually means `Posterior Error Probability`. Please refer to your specific search algorithms documentation to learn more about what each accession means.
+
 ### mzIdentML
 Lets inspect a standard MSGF+ based mzIdentML file for assistance in running pyProteinInference from a a mzIdentML file. <br>
 
@@ -93,8 +95,8 @@ When inspecting a file you might see a spectrum identification as such:
 </SpectrumIdentificationItem>
 ```
 
-When selecting custom scores to use from your mzIdentML file you can select any of the cvParam options under SpectrumIdentificationItem. For example, we might want to use the MS-GF:PepQValue (MS:1002055). <br>
-However, when selecting custom scores from mzIdentML files you will actually input the `accession` listed above instead of the `name`. <br>
+When selecting custom scores to use from your mzIdentML file you can select any of the `cvParam` options under SpectrumIdentificationItem. For example, we might want to use the `MS-GF:PepQValue` (`MS:1002055`). <br>
+An important note when selecting custom scores from mzIdentML files is that you will actually input the `accession` listed above instead of the `name`. <br>
 So, for selecting `MS-GF:PepQValue`, you would actually input `MS:1002055` into the parameter file (See below). <br>
 Also, given that `MS:1002055` scores are better if they are lower we would insert it as a multiplicative score. <br>
 The `score` section of the parameter file in this case might look like this:
@@ -106,6 +108,8 @@ The `score` section of the parameter file in this case might look like this:
     psm_score_type: multiplicative
 ```
 
+Remember, you can use any `cvParam` accession from your mzIdentML file. Make sure to note if the selected score is better when lower or higher to calculate protein scores correctly.
+
 ### pepXML
 
 Lets next inspect a standard MSFragger Version 4.0 based pepXML file for assistance in running pyProteinInference from a pepXML file