Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework single-end MAPQ computation #328

Merged
merged 4 commits into from
Nov 15, 2023
Merged

Rework single-end MAPQ computation #328

merged 4 commits into from
Nov 15, 2023

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Aug 21, 2023

This contains two changes: A bugfix and a change to the way MAPQ values are computed for single-end mapping.

First, the fix: There was a bug in MAPQ computation due to initializing best_score to -1000. In the first iteration, min_mapq_diff is updated in this way:

min_mapq_diff = std::max(0, alignment.score - best_score);

With best_score=-1000, this is just alignment.score + 1000, so most of the time, min_mapq_diff ends up being something like 1600 or so for typical read lengths. This is then clamped to 60 in the output, which is probably the reason why we see MAPQ=60 most of the time.

Even with the fix, I didn’t understand what the code was doing. I think the intention is to use the second best score and relate it to the best score to compute the mapping quality, but it appears this isn’t implemented that way. So the second change is to compute MAPQ values differently:

  • We keep track of the highest and second-highest alignment score in best_score and second_best_score.
  • MAPQ is computed such that it is
    • 60 for second_best_score=0 and
    • 0 for second_best_score=best_score,
      where anything in between is interpolated:
    uint8_t mapq = 60.0 * (best_score - second_best_score) / best_score;

To me, the two extreme cases (MAPQ=0 and MAPQ=60) make sense.

I’m not so sure about the linear interpolation, but it also seems to give the results one would want in some cases. For example, if best_score is 600 and second_best_score is 595, the formula gives 60 * (600-595)/600 = 0.5, which is rounded down and gives mapq=0, so if the scores are very close, the read is still considered a multimapper.

This is intended to address the single-end part of #25.

@ksahlin
Copy link
Owner

ksahlin commented Aug 21, 2023

Welcome back from vacation :) Great that you found a bug and I like most of your solution! I think I like the linear projection too.

For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location 60 * (300-290)/300 = 2 if I have understood correctly. But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?

@marcelm
Copy link
Collaborator Author

marcelm commented Aug 23, 2023

For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location 60 * (300-290)/300 = 2 if I have understood correctly.

Yes. For longer reads, though, the situation would become as described in my example (with the score 595 vs 600), so you would get MAPQ zero again.

Hm, maybe dividing by the best score isn’t such a good idea after all? Consider the same situation for a 1000 nt read: If it has one location where it maps perfectly and one where it maps with one mismatch, then I would be quite confident that the location without the mismatch is the correct one. And my (intuitive) level of confidence would be quite similar even when the read is significantly shorter.

But also, this changes when there are more mismatches: If one location gives 9 and another 10 mismatches, I’d say the difference doesn’t matter that much and both are almost equally good. (Unless perhaps if the 9 mismatches were shared between the two, but taking that into account goes too far here.)

But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?

The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?

@ksahlin
Copy link
Owner

ksahlin commented Aug 23, 2023

Hm, maybe dividing by the best score isn’t such a good idea after all?

I am not sure what would be a good MAPQ score calculation. I just found this biostars post but haven't studied it closely. Looks a bit like magic and claims to come from this paper.

The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?

Yes.

@marcelm
Copy link
Collaborator Author

marcelm commented Nov 13, 2023

Getting back to this as I’m working on #359, where I need to touch the same code: Can this PR be merged? My feeling is that the PR is maybe not the best thinkable solution, but an improvement over the status quo.

@ksahlin
Copy link
Owner

ksahlin commented Nov 15, 2023

Can this PR be merged?

Yes. However, as discussed over email, it is probably good to add the case: if mapq == 0 and best_score > second_best_score then MAPQ=1 to avoid the linear interpolation assigning MAPQ 0 when there is an actual score difference.

First, there was a bug in MAPQ computation due to initializing
best_score to -1000. In the first iteration, min_mapq_diff is updated thus:

min_mapq_diff = std::max(0, alignment.score - best_score);

This is just alignment.score + 1000, so most of the time, min_mapq_diff
ends up being something like 1600 or so for typical read lengths. (This is
clamped to 60 in the output, but still.)

Here, we keep track of the highest and second-highest alignment score in
best_score and second_best_score. Then, mapq is computed such that it is
* 60 for second_best_score=0 and
* 0 for second_best_score=best_score,
where anything in between is interpolated.

See #25
Keep mapping quality separate from the alignment itself appears to make more
sense:
- Mapping quality for an alignment is not a property of an alignment per se,
  but is derived from its relationship to other alignments.
- get_alignment returns an Alignment, but leaves mapq uninitialized, that
  is, we must remember to fill it in afterwards, which can easily be
  forgotten.
- Sam::add_pair is passed two mapq values and also two Alignment instances,
  which also contain mapq values. It is unclear for the caller which of the
  values is chosen.
- Adding a mapq parameter to Sam::add makes it mirror Sam::add_pair.
@marcelm
Copy link
Collaborator Author

marcelm commented Nov 15, 2023

Is it ok If I use the following alternative (pushed in commit acac70b)? The idea is to round up instead of down. It has the same effect, that is, we get MAPQ=0 only if second_best == best:

>>> def f(best, second_best):
...  return (60 * (best - second_best) + best - 1) // best
... 
>>> f(100, 0)
60
>>> f(100, 50)
30
>>> f(100, 100)
0
>>> f(10000, 9999)
1

I checked the distribution of MAPQ values on the drosophila test dataset (the one on which the CI baseline comparison script runs).

  • MAPQ=0 occurs at the same frequency as before
  • MAPQ=60 occurs at the same frequency as before
    For MAPQ=1..59, it looks like this:
    mapq-before
    mapq-after
    Not sure what I would have expected, but it does look nicer ...

@ksahlin
Copy link
Owner

ksahlin commented Nov 15, 2023

Great, I agree on your formula which is better than what I suggested. About the plot: seems like a sound distribution assuming there are repeats with 1,2,3,... mutations (and indels) to the best location.

approved to merge.

@marcelm marcelm merged commit 8c74210 into main Nov 15, 2023
9 checks passed
@marcelm marcelm deleted the mapq branch November 15, 2023 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants