Some Rescue Results in SQANTI3 Not Comprehensible #327

klenhart · 2024-09-12T15:25:34Z

klenhart
Sep 12, 2024

Hello :)

I ran SQANTI Rescue on my dataset and noticed that NA values in the ratio_TSS column of the classification file are not handled when running Rescue with the ML filter option. I manually replaced NAs with 1 (as I saw it being done when preparing files for running the ML filter in the SQANTI3_MLfilter.R script). I'm not sure if this is intentional. If so, it would be helpful to understand the reasoning behind it.

After successfully running SQANTI3 Rescue, I observed some inconsistencies when running it in "full" mode.

For example: I have a rescue candidate, G10096.7, which was subject to rescue by mapping. This rescue candidate appears twice in the rescue table. It was only mapped to reference transcripts (ENST00000342988.7 and ENST00000398417.6), according to the mapping hits table. Only one mapping (ENST00000342988.7) passed the filter with hit_POS_MLprob = 1. However, the candidate G10096.7 was not rescued due to the exclusion reason "reference_already_present," and the "best_match_for_candidate" is listed as a "reference_transcript" (presumably ENST00000342988.7, which is also present in the best_match_id column). Given this result, I expected that transcript ENST00000342988.7 would be present in the rescued GTF file, but it is not.

When I searched for other matches to this transcript in the classification file (column associated_transcript, obtained prior to Rescue when running the ML filter), I found one long-read transcript (G10096.1), defined as Isoform (POS_MLprob = 0.752), which belongs to the ISM category.

According to the documentation, "we consider all reference or long-read-defined transcripts from genes that have at least one rescue candidate to be rescue targets." Now, I'm wondering why the rescue candidate was not mapped to this long-read transcript G10096.1 (since it’s a long-read transcript from the same gene as the candidate). I understand that, to avoid redundancy, the reference transcript was ignored because a long-read transcript with the same associated reference transcript is already present in the transcriptome. However, this long-read transcript has a lower ML probability, and the candidate was never mapped to it. It seems counterintuitive that the reference transcript was not added to the final transcriptome.

Has anyone had similar experiences or can explain the rationale behind this behavior?

Cheers,
Katharina

Answered by aarzalluz

Sep 24, 2024

Hi @klenhart,

I am no longer a part of the SQANTI3 dev team, but I occasionally jump in to see if I can help. I saw your question and thought I could add some insight:

I ran SQANTI Rescue on my dataset and noticed that NA values in the ratio_TSS column of the classification file are not handled when running Rescue with the ML filter option. I manually replaced NAs with 1 (as I saw it being done when preparing files for running the ML filter in the SQANTI3_MLfilter.R script). I'm not sure if this is intentional. If so, it would be helpful to understand the reasoning behind it.

I just realized that the script to run the ML classifier on the reference lacks a replacement value for the rati…

View full answer

aarzalluz · 2024-09-24T10:22:06Z

aarzalluz
Sep 24, 2024
Maintainer

Hi @klenhart,

I am no longer a part of the SQANTI3 dev team, but I occasionally jump in to see if I can help. I saw your question and thought I could add some insight:

I ran SQANTI Rescue on my dataset and noticed that NA values in the ratio_TSS column of the classification file are not handled when running Rescue with the ML filter option. I manually replaced NAs with 1 (as I saw it being done when preparing files for running the ML filter in the SQANTI3_MLfilter.R script). I'm not sure if this is intentional. If so, it would be helpful to understand the reasoning behind it.

I just realized that the script to run the ML classifier on the reference lacks a replacement value for the ratio_TSS column. I will submit a PR for the team to check.

Anyhow, the rationale behind NA value handling in SQANTI3 ML filter is as follows: we assume that NA values are equivalent to low quality values for any classification column (quality attribute). Therefore, we replace NAs with different values depending on the quality attribute. In the case of the ratio_TSS col, a value of 1 is the equivalent of having the same coverage upstream and downstream of the TSS, which equals no short-read validation of the site. In summary, an NA value will "penalize" the transcript as much as showing poor quality for that particular attribute. This was thoroughly discussed with @FJPardoPalacios during the SQANTI3 upgrade, decided to do it basically because the random forest classifier training model that is run under the hood failed to deal with NAs, at least to the best of our knowledge at the time.

I understand that, to avoid redundancy, the reference transcript was ignored because a long-read transcript with the same associated reference transcript is already present in the transcriptome. However, this long-read transcript has a lower ML probability, and the candidate was never mapped to it. It seems counterintuitive that the reference transcript was not added to the final transcriptome.

That is exactly the case. When several long-read (LR) defined transcripts are associated to the same reference, we make two assumptions: a) if sufficiently supported, the LR transcript can be considered to be an alternative isoform of the same gene and b) if not passing the filter, the LR transcript can be considered to be a false positive variant (sequencing error or library prep artifact) of another, true positive transcript, be it represented in the LR transcriptome, or not. Rescue-by-mapping tries to find that "best match" (or closest match) true positive transcript by running minimap2 under the hood (rescue candidates vs targets from the same gene). We do not really control which transcript maps to which, it is just done based on sequence similarity, and then we select the one with the best quality according to the ML filter probability.

At the beginning of the rescue process (automatic rescue), we just select all reference isoforms lacking a LR associated transcript (i.e. all LR transcripts were eliminated from the transcriptome during filtering). This representative can be an FSM or ISM. If these lost references have an FSM, the FSM artifact is rescued. If not, these ISM artifacts are considered rescue candidates.

I think that your concern is valid in that it is an ISM that is "representing" that reference in the transcriptome, and thus the reference transcript never made it to the step in which we discard the ISMs as valid representatives. It just depends on how much you trust the ISM category and the filtering in itself. Since this ISM was validated, we understand that it is reliable, as we try to minimize redundancy by not incorporating the reference. I understand that some people may want to get a more complete, even if more redundant, transcriptome, which is why we included as detailed an output as we could: to allow users to define their own rescue criteria, if desired. I would suggest playing around with the output tables to also rescue references that are only supported by filter-passing ISM, if another long-read artifact maps to them.

I hope that helps,

Ángeles

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Rescue Results in SQANTI3 Not Comprehensible #327

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Some Rescue Results in SQANTI3 Not Comprehensible #327

klenhart Sep 12, 2024

Replies: 1 comment

aarzalluz Sep 24, 2024 Maintainer

klenhart
Sep 12, 2024

aarzalluz
Sep 24, 2024
Maintainer