-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choose random mapping locations for multimappers #364
Conversation
... by shuffling the NAMs that have the same score as the best one See #359
Awesome! I will check on the 10 E. coli genomes dataset that I generated before. (also pinging @psj1997 @luispedro FYI) |
Oh, I realize that this is currently only for SE. I will do the check in SE mode on the 10 E. colis. |
One could do the exact same thing with the list of NAMs in paired-end mode. Do you think this could help a little bit until we have something smarter? |
I added that now. It seems to do something ...:
|
I first compared a30d779 (strobealign-random) with v0.11.0 (denoted strobealign). I accidentally ran in PE mode, the uniformity was still as biased as in v0.11.0 (as expected). However, I noticed a small change in accuracy, stats below. Is this expected? Then you posted commit 57fe7df so I tried that too (strobealign-random). Here are the accuracy stats: The bump in accuracy is quite bizarre if you ask me. This is a subset of 10k reads so sample size is small, however it is nearly 1% improvement. I have to test on a larger dataset to see if it evens out. Now to what I was actually gonna test (the randomness). The name of the game is to get close to 0.945-0.950 in the last column (based on bowtie2 and BWA's distributions). This column measures "fraction of bases on chromosome with depth equal to column 2 (i.e., 0 in our case)". The true numbers for each genome are unknown. However, since reads were simulated at random, I expect then to be the same if there is no mapping bias. As we can see the random version improves the numbers a lot - although not as even as BWA/Bt2 (but we do obtain higher accuracy than both of them on this small dataset). v0.11.0
commit 57fe7df
BWA-MEM
Bowtie2
(I also checked the distribution in SE mode and it evend out the distribution substantially (I don't have the BWA/Bt2 comparison stats) but I beleive it works as expected. |
Finally, I also checked that in mapping mode First, in the 50 genomes dataset the true highest and lowest number of reads simulated from the genomes are and 4,530 and 3,680, respectively. Counting the number of reads mapping to each of the 50 genomes, I observed 8699 and 1974 reads mapped to highest and lowest genome for v0.11.0, respectively. For this PR I observed 6913 and 2788, respectively. So a big improvement but still quite far of the true values. However, we need to be careful not to get too disappointed here because statistical variation comes into play (too tired to derive how large it is, but the upper bound - all reads completely random - probably follows some common distribution). |
Yes, that change is from #336. Here is the change in accuracy just from this PR. Comparing accuracy
Average difference se: +0.0028 |
…omly This is only for single-end reads. This is taken from #360, but without the slowdown. The idea is that NAM shuffling takes care of the cases where we have multiple perfect hits, and this takes care of the cases where we have multiple non-perfect hits.
As I suggested by e-mail, I combined this PR with #360. PR #360 was problematic because it would not allow the optimization that one can stop doing alignments if an exact match has been found. This is taken care of by NAM shuffling: Assuming that two exact matches also give the exact NAM score, we pick one of these exact matches randomly. However, if the first match is not an exact one, we do compute alignments until the dropoff is reached or the maximum number of sites has been tried. So we do that already and can, without additional cost, do the exact version from #360. This changes the samdiff output thus:
So instead of 7421 (see output pasted somewhere above), now 9716 multimappers are affected. I also reproduced what you did on the 50 E. coli reference. I used different genomes and simulated 1 million reads, so the numbers are different (should be somewhat reproducible now, will document later). Here is the last column of |
This is great! Did you see any noticeable slowdown? Would this approach work for PE? |
It did not become measurably slower.
I think something similar could work; I will look into it. |
So this PR now
I would like to suggest that we merge this PR with the three features above included and that I open a separate PR for adding random read pair selection for paired-end reads (so as not to make the discussion here too long). |
Awesome! Approved to merge. |
This is an alternative to #360 that works by shuffling the top NAMs that have the same score as the best one.
See #359
samdiff.py
outputs this when comparing to main:The high number of reads that get the same alignment score before and after is a sign that this works as intended. There are some reads that get a different score, but this is as expected and the price to pay for this faster variant that only shuffles the NAMs.
Accuracy changes in an unremarkable way (I ran this comparison only on a subset of the datasets):
Average difference se: -0.0015
We don’t have a way at the moment to test whether this results in the intended outcome, that is, whether locations for multimappers are less biased.