Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose random mapping locations for multimappers #364

Merged
merged 7 commits into from
Nov 20, 2023
Merged

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Nov 16, 2023

This is an alternative to #360 that works by shuffling the top NAMs that have the same score as the best one.

See #359

samdiff.py outputs this when comparing to main:

       9643 reads were unmapped before and after
          0 reads became mapped
          0 reads became unmapped
      82741 reads were mapped to same locus before and after
*      7421 reads were multimapper before and after, same alignment score (AS)
*        17 reads were multimapper before and after, better alignment score (AS)
*         5 reads were multimapper before and after, worse alignment score (AS)
*       173 reads changed in another way
     100000 total reads

The high number of reads that get the same alignment score before and after is a sign that this works as intended. There are some reads that get a different score, but this is as expected and the price to pay for this faster variant that only shuffles the NAMs.

Accuracy changes in an unremarkable way (I ran this comparison only on a subset of the datasets):

dataset main (8c74210) this PR difference
drosophila-50-se 82.5175 82.5108 -0.0067
drosophila-100-se 89.2626 89.255 -0.0076
drosophila-150-se 90.937 90.9448 +0.0078
drosophila-200-se 91.9463 91.9559 +0.0096
drosophila-300-se 93.2153 93.2187 +0.0034
maize-50-se 47.4622 47.4417 -0.0205
maize-100-se 70.5148 70.503 -0.0118
maize-200-se 86.7001 86.7083 +0.0082
maize-300-se 91.8516 91.8522 +0.0006
CHM13-50-se 81.7231 81.7321 +0.0090
CHM13-100-se 90.6468 90.6401 -0.0067
CHM13-150-se 92.4066 92.4081 +0.0015
CHM13-200-se 93.221 93.2167 -0.0043
CHM13-300-se 94.1508 94.1472 -0.0036

Average difference se: -0.0015

We don’t have a way at the moment to test whether this results in the intended outcome, that is, whether locations for multimappers are less biased.

... by shuffling the NAMs that have the same score as the best one

See #359
@ksahlin
Copy link
Owner

ksahlin commented Nov 16, 2023

Awesome! I will check on the 10 E. coli genomes dataset that I generated before. (also pinging @psj1997 @luispedro FYI)

@ksahlin
Copy link
Owner

ksahlin commented Nov 16, 2023

Oh, I realize that this is currently only for SE. I will do the check in SE mode on the 10 E. colis.

@marcelm
Copy link
Collaborator Author

marcelm commented Nov 16, 2023

One could do the exact same thing with the list of NAMs in paired-end mode. Do you think this could help a little bit until we have something smarter?

@marcelm
Copy link
Collaborator Author

marcelm commented Nov 16, 2023

One could do the exact same thing with the list of NAMs in paired-end mode. Do you think this could help a little bit until we have something smarter?

I added that now. It seems to do something ...:

      17339 reads were unmapped before and after
*         5 reads became mapped
*         2 reads became unmapped
     171879 reads were mapped to same locus before and after
*      9761 reads were multimapper before and after, same alignment score (AS)
*        22 reads were multimapper before and after, better alignment score (AS)
*        11 reads were multimapper before and after, worse alignment score (AS)
*       981 reads changed in another way
     200000 total reads

@ksahlin
Copy link
Owner

ksahlin commented Nov 16, 2023

I first compared a30d779 (strobealign-random) with v0.11.0 (denoted strobealign). I accidentally ran in PE mode, the uniformity was still as biased as in v0.11.0 (as expected). However, I noticed a small change in accuracy, stats below. Is this expected?

Screenshot 2023-11-16 at 18 02 34

Then you posted commit 57fe7df so I tried that too (strobealign-random). Here are the accuracy stats:

Screenshot 2023-11-16 at 18 09 18

The bump in accuracy is quite bizarre if you ask me. This is a subset of 10k reads so sample size is small, however it is nearly 1% improvement. I have to test on a larger dataset to see if it evens out.


Now to what I was actually gonna test (the randomness). The name of the game is to get close to 0.945-0.950 in the last column (based on bowtie2 and BWA's distributions). This column measures "fraction of bases on chromosome with depth equal to column 2 (i.e., 0 in our case)". The true numbers for each genome are unknown. However, since reads were simulated at random, I expect then to be the same if there is no mapping bias.

As we can see the random version improves the numbers a lot - although not as even as BWA/Bt2 (but we do obtain higher accuracy than both of them on this small dataset).

v0.11.0

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.strobealign_default.bam | grep -E "[0-9]+\t0\t"
0	0	4900606	5269732	0.929954
3	0	4878646	5217379	0.935076
5	0	5009761	5270094	0.950602
8	0	4955499	5255484	0.94292
11	0	5051905	5214647	0.968791
15	0	5370618	5703439	0.941646
18	0	5510964	5730399	0.961707
21	0	5139976	5536811	0.928328
26	0	5250803	5668289	0.926347
32	0	5674359	5770266	0.983379

commit 57fe7df

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.strobealign_random.bam | grep -E "[0-9]+\t0\t"
0	0	4968465	5269732	0.942831
3	0	4928918	5217379	0.944712
5	0	4989804	5270094	0.946815
8	0	4958396	5255484	0.943471
11	0	4946985	5214647	0.948671
15	0	5334311	5703439	0.93528
18	0	5464417	5730399	0.953584
21	0	5227056	5536811	0.944055
26	0	5396273	5668289	0.952011
32	0	5511444	5770266	0.955146

BWA-MEM

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.bwa_mem.bam | grep -E "[0-9]+\t0\t"
0	0	4979591	5269732	0.944942
3	0	4928776	5217379	0.944684
5	0	4981872	5270094	0.94531
8	0	4988722	5255484	0.949241
11	0	4930204	5214647	0.945453
15	0	5411526	5703439	0.948818
18	0	5431322	5730399	0.947809
21	0	5234316	5536811	0.945367
26	0	5364423	5668289	0.946392
32	0	5466397	5770266	0.947339

Bowtie2

$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_10/150.bowtie2.bam | grep -E "[0-9]+\t0\t"
0	0	4981502	5269732	0.945305
3	0	4929527	5217379	0.944828
5	0	4988778	5270094	0.94662
8	0	4978964	5255484	0.947384
11	0	4927270	5214647	0.94489
15	0	5398555	5703439	0.946544
18	0	5431461	5730399	0.947833
21	0	5251857	5536811	0.948535
26	0	5360894	5668289	0.945769
32	0	5469114	5770266	0.94781

(I also checked the distribution in SE mode and it evend out the distribution substantially (I don't have the BWA/Bt2 comparison stats) but I beleive it works as expected.

@ksahlin
Copy link
Owner

ksahlin commented Nov 16, 2023

I couldn't believe the accuracy increase, and rightfully so. I simulated three replicate instances with 100k PE reads in each instance, below are the PE alignment results. There seems to be no noticeable effect on accuracy, which we expect, I guess.

I will to the same analysis for 100 genomes (I want to check the effect as more repeated NAMs are introduced)

Screenshot 2023-11-16 at 18 50 20

@ksahlin
Copy link
Owner

ksahlin commented Nov 16, 2023

The read simulator complained about the 100 ecoli dataset, but here are the results for 50 ecoli genomes.

Screenshot 2023-11-16 at 18 59 37

I also checked the read placement randomness (coverage uniformity across genomes). Looks much better - centered around 0.88-0.89 (below I dump the data for that for documentation - sry).

I think we can merge this. What do you say? We still have the potential improvement in PE NAM pairing that can be added before the random shuffle, but for now this version should be an improvement over baseline.

(devel) math172l2:OBJ_C ksahlin$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_default.bam | grep -E "[0-9]+\t0\t"
0	0	4425709	5269732	0.839836
3	0	4785792	5217379	0.917279
5	0	4729574	5270094	0.897436
8	0	4642790	5255484	0.883418
11	0	4801312	5214647	0.920736
15	0	4957996	5703439	0.869299
18	0	5266074	5730399	0.918972
21	0	5163049	5536811	0.932495
26	0	5181816	5668289	0.914176
32	0	5296838	5770266	0.917954
37	0	5288767	5767030	0.917069
41	0	5232706	5747479	0.910435
44	0	5150896	5770126	0.892683
47	0	5540245	5833706	0.949696
53	0	5184368	5722815	0.905912
58	0	4758577	5366488	0.886721
62	0	4831584	5226109	0.924509
65	0	4628852	5029720	0.9203
72	0	4629632	5061873	0.914608
75	0	4498989	5110834	0.880285
80	0	4732493	5132219	0.922114
85	0	4712882	5255408	0.896768
87	0	4542237	5111512	0.888629
89	0	4415368	5062632	0.872149
92	0	4291407	5079315	0.844879
95	0	4636143	5197709	0.891959
99	0	4407837	5041928	0.874236
102	0	4614709	5138060	0.898142
108	0	4492315	5138043	0.874324
114	0	4749347	5147947	0.922571
122	0	4628475	5029969	0.92018
126	0	4582979	5103692	0.897973
132	0	4076951	5098211	0.799683
133	0	4464208	5098635	0.875569
135	0	4519142	5255987	0.859808
137	0	4671210	5256017	0.888736
146	0	4708633	5099979	0.923265
147	0	4636300	5085570	0.911658
148	0	4624204	5099806	0.906741
149	0	4629665	5088968	0.909745
151	0	4668302	5089992	0.917153
153	0	4516693	5088968	0.887546
155	0	4759101	5285418	0.900421
159	0	4673995	5072410	0.921454
161	0	4717066	5060072	0.932213
163	0	4794753	5090382	0.941924
165	0	4679791	5088970	0.919595
167	0	4532985	5285512	0.857625
169	0	4851099	5285097	0.917883
171	0	4387097	5284087	0.830247
(devel) math172l2:OBJ_C ksahlin$ samtools view -b /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.sam | samtools sort -o /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.bam
(devel) math172l2:OBJ_C ksahlin$ bedtools genomecov -ibam /Users/ksahlin/tmp/STROBEALIGN_ECOLI/1/ecoli_50/150.strobealign_random.bam | grep -E "[0-9]+\t0\t"
0	0	4683567	5269732	0.888768
3	0	4702112	5217379	0.90124
5	0	4709841	5270094	0.893692
8	0	4632677	5255484	0.881494
11	0	4728239	5214647	0.906723
15	0	4962045	5703439	0.870009
18	0	5287916	5730399	0.922783
21	0	5061211	5536811	0.914102
26	0	5200693	5668289	0.917507
32	0	5280821	5770266	0.915178
37	0	5306395	5767030	0.920126
41	0	5262653	5747479	0.915645
44	0	5176862	5770126	0.897184
47	0	5383307	5833706	0.922794
53	0	5259823	5722815	0.919097
58	0	4761994	5366488	0.887358
62	0	4776485	5226109	0.913966
65	0	4493271	5029720	0.893344
72	0	4590887	5061873	0.906954
75	0	4497195	5110834	0.879934
80	0	4639345	5132219	0.903965
85	0	4705894	5255408	0.895438
87	0	4568653	5111512	0.893797
89	0	4399648	5062632	0.869044
92	0	4248203	5079315	0.836373
95	0	4609376	5197709	0.886809
99	0	4386183	5041928	0.869942
102	0	4571898	5138060	0.88981
108	0	4535183	5138043	0.882667
114	0	4673614	5147947	0.90786
122	0	4565770	5029969	0.907713
126	0	4563826	5103692	0.89422
132	0	4550035	5098211	0.892477
133	0	4472361	5098635	0.877168
135	0	4533234	5255987	0.86249
137	0	4624282	5256017	0.879807
146	0	4680840	5099979	0.917816
147	0	4626446	5085570	0.90972
148	0	4601493	5099806	0.902288
149	0	4630542	5088968	0.909918
151	0	4658783	5089992	0.915283
153	0	4560742	5088968	0.896202
155	0	4674017	5285418	0.884323
159	0	4635390	5072410	0.913844
161	0	4657656	5060072	0.920472
163	0	4684726	5090382	0.920309
165	0	4675791	5088970	0.918809
167	0	4610326	5285512	0.872257
169	0	4702754	5285097	0.889814
171	0	4606436	5284087	0.871756

@ksahlin
Copy link
Owner

ksahlin commented Nov 16, 2023

Finally, I also checked that in mapping mode -x, the accuracy stayed the same (statistically) the randomness improved.

First, in the 50 genomes dataset the true highest and lowest number of reads simulated from the genomes are and 4,530 and 3,680, respectively.

Counting the number of reads mapping to each of the 50 genomes, I observed 8699 and 1974 reads mapped to highest and lowest genome for v0.11.0, respectively.

For this PR I observed 6913 and 2788, respectively.

So a big improvement but still quite far of the true values. However, we need to be careful not to get too disappointed here because statistical variation comes into play (too tired to derive how large it is, but the upper bound - all reads completely random - probably follows some common distribution).

@marcelm marcelm changed the title Choose random mapping locations for single-end multimappers Choose random mapping locations for multimappers Nov 16, 2023
@marcelm
Copy link
Collaborator Author

marcelm commented Nov 17, 2023

However, I noticed a small change in accuracy, stats below. Is this expected?

Yes, that change is from #336.

Here is the change in accuracy just from this PR.

Comparing accuracy

dataset 8c74210 57fe7df difference
drosophila-50-se 82.5175 82.5108 -0.0067
drosophila-75-se 87.9797 88.033 +0.0533
drosophila-100-se 89.2626 89.255 -0.0076
drosophila-150-se 90.937 90.9448 +0.0078
drosophila-200-se 91.9463 91.9559 +0.0096
drosophila-300-se 93.2153 93.2187 +0.0034
drosophila-500-se 94.5722 94.5586 -0.0136
maize-50-se 47.4622 47.4417 -0.0205
maize-75-se 61.8938 61.9045 +0.0107
maize-100-se 70.5148 70.503 -0.0118
maize-150-se 81.1606 81.154 -0.0066
maize-200-se 86.7001 86.7083 +0.0082
maize-300-se 91.8516 91.8522 +0.0006
maize-500-se 95.3796 95.4038 +0.0242
CHM13-50-se 81.7231 81.7321 +0.0090
CHM13-75-se 88.9376 88.9318 -0.0058
CHM13-100-se 90.6468 90.6401 -0.0067
CHM13-150-se 92.4066 92.4081 +0.0015
CHM13-200-se 93.221 93.2167 -0.0043
CHM13-300-se 94.1508 94.1472 -0.0036
CHM13-500-se 95.0531 95.0499 -0.0032
rye-50-se 44.7328 44.7464 +0.0136
rye-75-se 60.236 60.2692 +0.0332
rye-100-se 69.3509 69.345 -0.0059
rye-150-se 80.4219 80.412 -0.0099
rye-200-se 85.9117 85.9279 +0.0162
rye-300-se 90.5181 90.5152 -0.0029
rye-500-se 93.5172 93.5133 -0.0039
drosophila-50-pe 90.1814 90.17505 -0.0063
drosophila-75-pe 91.6055 91.62985 +0.0243
drosophila-100-pe 92.38555 92.37945 -0.0061
drosophila-150-pe 93.2124 93.1991 -0.0133
drosophila-200-pe 93.5107 93.5143 +0.0036
drosophila-300-pe 95.36815 95.37105 +0.0029
drosophila-500-pe 95.70805 95.673 -0.0350
maize-50-pe 71.48315 71.48905 +0.0059
maize-75-pe 82.1135 82.1222 +0.0087
maize-100-pe 87.1423 87.1474 +0.0051
maize-150-pe 91.68815 91.6918 +0.0037
maize-200-pe 92.9413 92.93575 -0.0055
maize-300-pe 96.71515 96.71725 +0.0021
maize-500-pe 97.2744 97.29125 +0.0168
CHM13-50-pe 90.648 90.64065 -0.0073
CHM13-75-pe 92.51265 92.51435 +0.0017
CHM13-100-pe 93.2113 93.22665 +0.0154
CHM13-150-pe 94.1508 94.1352 -0.0156
CHM13-200-pe 94.4201 94.4225 +0.0024
CHM13-300-pe 95.62845 95.63475 +0.0063
CHM13-500-pe 95.9551 95.97645 +0.0213
rye-50-pe 69.146 69.185 +0.0390
rye-75-pe 80.5645 80.59005 +0.0255
rye-100-pe 85.6234 85.59965 -0.0237
rye-150-pe 90.23185 90.22415 -0.0077
rye-200-pe 91.4807 91.47285 -0.0079
rye-300-pe 94.55785 94.56635 +0.0085
rye-500-pe 95.15445 95.16665 +0.0122

Average difference se: +0.0028
Average difference pe: +0.0027

…omly

This is only for single-end reads.

This is taken from #360, but without the slowdown.

The idea is that NAM shuffling takes care of the cases where we have
multiple perfect hits, and this takes care of the cases where we have
multiple non-perfect hits.
@marcelm
Copy link
Collaborator Author

marcelm commented Nov 17, 2023

As I suggested by e-mail, I combined this PR with #360.

PR #360 was problematic because it would not allow the optimization that one can stop doing alignments if an exact match has been found. This is taken care of by NAM shuffling: Assuming that two exact matches also give the exact NAM score, we pick one of these exact matches randomly.

However, if the first match is not an exact one, we do compute alignments until the dropoff is reached or the maximum number of sites has been tried. So we do that already and can, without additional cost, do the exact version from #360.

This changes the samdiff output thus:

       9643 reads were unmapped before and after
          0 reads became mapped
          0 reads became unmapped
      80441 reads were mapped to same locus before and after
*      9716 reads were multimapper before and after, same alignment score (AS)
*        14 reads were multimapper before and after, better alignment score (AS)
*         4 reads were multimapper before and after, worse alignment score (AS)
*       182 reads changed in another way
     100000 total reads

So instead of 7421 (see output pasted somewhere above), now 9716 multimappers are affected.

I also reproduced what you did on the 50 E. coli reference. I used different genomes and simulated 1 million reads, so the numbers are different (should be somewhat reproducible now, will document later). Here is the last column of bedtools genomecov (for those rows where the second column is 0) plotted for the various program versions ("shuffle-more" is the mentioned change):

genomes

@ksahlin
Copy link
Owner

ksahlin commented Nov 17, 2023

This is great!

Did you see any noticeable slowdown?

Would this approach work for PE?

@marcelm
Copy link
Collaborator Author

marcelm commented Nov 17, 2023

Did you see any noticeable slowdown?

It did not become measurably slower.

Would this approach work for PE?

I think something similar could work; I will look into it.

@marcelm
Copy link
Collaborator Author

marcelm commented Nov 20, 2023

So this PR now

  • adds NAM shuffling for single-end reads,
  • adds NAM shuffling for paired-end reads,
  • and adds random read selection (among those with the same score) for single-end reads.

I would like to suggest that we merge this PR with the three features above included and that I open a separate PR for adding random read pair selection for paired-end reads (so as not to make the discussion here too long).

@ksahlin
Copy link
Owner

ksahlin commented Nov 20, 2023

Awesome! Approved to merge.

@marcelm marcelm merged commit c91e994 into main Nov 20, 2023
9 checks passed
@marcelm marcelm deleted the shuffle-nams branch November 20, 2023 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants