Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid MAPQ=60 for single-end multimappers #327

Merged
merged 2 commits into from
Aug 23, 2023
Merged

Avoid MAPQ=60 for single-end multimappers #327

merged 2 commits into from
Aug 23, 2023

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Aug 21, 2023

by ensuring that we compute at least two alignments even if the first that is found has edit distance 0.

Closes #293

Because I still need to test whether this actually fixes #293, I’m marking this as draft.

@ksahlin
Copy link
Owner

ksahlin commented Aug 21, 2023

approved.

Base automatically changed from refactor to main August 21, 2023 19:03
@marcelm marcelm marked this pull request as draft August 21, 2023 19:04
by ensuring that we compute at least two alignments even if the first that
is found has edit distance 0.

See #293
@marcelm
Copy link
Collaborator Author

marcelm commented Aug 23, 2023

(Note that I had a typo/logic error in the PR and pushed a fix.)

I tried to do some measurements.

It’s a bit hard to inspect multimappers only, so I looked at incorrectly mapped reads instead, which should all be multimappers: Assuming a couple of things about how the read mapper works, these should have at least two roughly equivalent mapping locations (from the read mapper’s point of view): The correct one and the one the read mapper actually outputs. So they should get MAPQ=0.

So issue #293 is essentially that we get many incorrectly mapped reads that are assigned MAPQ>0, which should actually have been assigned MAPQ=0.

I generated a test dataset with know mapping locations by picking every tenth 100-mer from the Drosophila reference (about 14.4 million reads). Then I could count incorrectly mapped reads that have MAPQ>0 and MAPQ=0, respectively. For good measure, I also counted the correctly mapped reads with MAPQ>0 and MAPQ=0. The results are as follows:

Read category main (40748fe) this PR (097807b) BWA-MEM
Incorrect and MAPQ>0 (bad) 9.87% 0.54% 0%
Incorrect and MAPQ=0 (good) 0.03% 9.36% 9.94%
Correct and MAPQ>0 90.08% 87.20% 86.63%
Correct and MAPQ=0 0% 2.88% 3.41%

I think the relevant numbers are the two bold ones: The "bad" alignments are much closer to zero and the number of "good" ones has gone up. Also, all four percentages are now closer to BWA-MEM.

Note that the percentage of "correct and MAPQ>0" goes down (and "correct and MAPQ=0" up accordingly). I assume these are mostly multimappers that strobealign by chance placed into the "correct" position (the one from which the k-mer was sampled). They previously got MAPQ=60, but since they are multimappers, they should get MAPQ=0, which they now do.

@marcelm marcelm marked this pull request as ready for review August 23, 2023 16:23
@ksahlin
Copy link
Owner

ksahlin commented Aug 23, 2023

great improvement, approved!

@marcelm marcelm merged commit 678654c into main Aug 23, 2023
9 checks passed
@marcelm marcelm deleted the mapqse branch August 23, 2023 19:25
@marcelm
Copy link
Collaborator Author

marcelm commented Aug 23, 2023

I forgot to add a changelog entry in this PR and took the liberty of pushing that directly to main without a PR.

@marcelm
Copy link
Collaborator Author

marcelm commented Aug 30, 2023

I realized I hadn’t check the performance impact of this change, so here are some measurements on 100 bp reads.

  • Drosophila: ~3% slower
  • Maize: ~9% slower
  • CHM13: ~5% slower
  • Rye: ~5% slower

These measurements were quite noisy, especially for rye.

@ksahlin
Copy link
Owner

ksahlin commented Aug 30, 2023

Ok, worth it.

@marcelm
Copy link
Collaborator Author

marcelm commented Aug 30, 2023

Agreed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reads get MAPQ=60 although they are multimappers
2 participants