Fix IPF Export with Unimod to Codename Table #101
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Exporting TSV IPF Results
When exporting peptidoform IPF results, I noticed that a few of the theoretically generated forms (non-experimentally seen forms from the library) are being reported/exported. To my understanding, we would only want to report the forms that we actually have experimental evidence for that we identify in the library (i.e. from DDA data), and the theoretical forms are only used for the scoring and assessment of localization of peptidoforms to peak-group?
It also seems to report duplicate entries for some peptidoforms that have the same RT, but a different feature id from.
To fix these issues (if they are issues and not intented behaviour), I implemented a method to generate a UNIMOD_TO_CODENAME mapping table, which is just a mapping of the experimental UniMod entry peptide ID mapping to the corresponding CodeName entry peptide ID. I use this table in the
export
module to join the SCORE_IPF table on the UniMod peptide ID, instead of the CodeName peptide IDs. I wasn't sure where was best to put this table generation step, so I just put it in a separate module callededit-osw
. Ideally, I think this table should be created from the library generation step in OpenMS (OpenSwathAssayGenerator/Decoy Generator), but for now this is easier to implement. I also added a method to add a column to the PEPTIDE table that groups the sequence and peptidoforms, more so for downstream analyses, but is not necessary. I useNOTE: These methods (
create_unimod_codename_mapping
andcreate_peptidoform_group_mapping
) introduce two new dependencies/suggestions. The first dependency isfrom pyopenms import AASequence
required for handling the modified sequence strings conversions from UniMod to CodeName. The second is more of a suggested module, but not necessary. I use Dask to create a partioned DataFrame, which allows for parallel computation on the small subset data frames. If the user doesn't have this module, then it will just perform regulargroupby
andapply
serially.Run Command
Help
Example
Using the following export command:
pyprophet export --in merged_original.osw --out merged.tsv --format=legacy_merged --max_rs_peakgroup_qvalue=0.2 --ipf_max_peptidoform_pep=0.4 --max_global_peptide_qvalue=0.01 --max_global_protein_qvalue=0.01 --no-transition_quantification
If we look at a specific example sequence, IASPIQHEHDSGSR from the synthetic phospho dilution series data
Library
The library has two experimentally seen peptidoforms IAS(Phospho)PIQHEHDSGSR(Label:13C(6)15N(4)) and IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4)), and then there is also a theoretical generated peptidoform IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4))
Current Export Results
If we look at the results that get exported, one of the experimentally seen forms IASPIQHEHDS(Phospho)GSR(Label:13C(6)15N(4)) is identified in run 2848806567456108792. The theoretical peptidoform IASPIQHEHDSGS(Phospho)R(Label:13C(6)15N(4)) also get reported. Depending on what cutoff you use for m_score, some of these would probably get removed, so maybe it's not that big of deal? But there are a few cases where it might still pass through the cutoff as in 2848806567456108792, it gets an m_score of 0.009810127. As mentioned above, there are also duplciate entries of the same feature, from different assays; in 2848806567456108792 there is -4795326927976004004 and -7026778265845113779 mapping to the same RT.
Export Results Merged on UNIMOD_CODENAME_MAPPING
If we join the SCORE_IP table on the UniMod peptide ID instead, then we only get the experimentally seen peptidoform in the results. One caveat that comes up now, is the FullPeptideName that is returned is the one with the UniMod entry instead of the CodeName, which I guess would leave the user to have to manually look up what it responds to. If this is an issue, I could add a step to convert all UniMod to CodeName using AASequence.
Example of the UNIMOD_CODENAME_MAPPING
When generating the UNIMOD_CODENAME_MAPPING, if there is either no matching UniMod peptide ID to a corresponding CodeName peptideID or vice-versa, I simiply put a -1.
Example of the PEPTIDOFORM_GROUPING column in PEPTIDE table
Allowing Color Adjustment for Color Blind Friendly Reports
This is a minor addition to add color blind friendly color adjustment to the reports, as requested by #100
This change includes
color_blind_friendly()
method for extracting a pair of color blind friendly colors. There are 4 color palettes: 'normal'. 'protan', 'deutran', or 'tritan'.This introduces an extra argument
color_palette
for extracting one of the 4 color palette pairs, with the default being set to 'normal'.Example
Report and Message when pi0 Estimation fails
I extended the error messaged when the estimated pi0 <= 0, to show what the current lambda values were used. I also included a report pdf of the pvalue histogram used in the pi0 estimation, to give the user a better idea of why it may have failed.
Error: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda. Current lambda: [0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
0.8 0.85 0.9 0.95]
Example