-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does using a multi part index affect the accuracy? #141
Comments
Given a multi-part index, minimap2 aligns all query sequences against each part in turn. It doesn't know sequences in other parts and assigns mapping quality based on sequences in one part only. This is why your highlighted read got mapQ=60 twice, though clearly the chr17 hit is the optimal one.
Unfortunately, no. As it is often impractical to load all query into memory, doing this would require minimap2 to save the exact internal states of each query to disk. This will involve a lot of efforts and code refactoring. A possibly easier solution is to save sam/paf mapped against to each part to a separate file and then merge them back at the end. However, mapping quality calculated this way will sometimes differ from mapQ computed from a uni-part index. In addition, this strategy is still tricky to implement.
When you use a multi-part index, I believe accuracy is not reduced as long as you merge alignments to each part at the end. However, the results will be different in corner cases.
Unfortunately, no. I know your intended applications. However, minimap2 is not designed for those. Multi-part index is mostly used for read overlapping or searching hits against gigantic databases (e.g. nt) when you care about all local hits. |
Thank you. I will try with the approach you suggested to write to separate files and then merge them back at the end. |
Hi, I actually want to do something alongs the lines of mapping query reads/assemblies against a gigantic database, more gigantic than nt/nr. People have recommended minimap2 but I suppose a multi-index strategy is a must. A few questions if I may ask here:
|
The minimap2 index size varies greatly with
mapQ is a global score, considering all hits to the database. Blast bit score is proportional to Smith-Waterman score. It is a local score in that it is computed from a single hit. The two scores are used for distinct purposes. With base-alignment, minimap2 computes local SW score.
When you use a multi-part index, minimap2 is broadly similar to blastn in functionality. It will be faster than blastn but less sensitive in their typical settings. What are your query sequences? What level of sensitivity do you want to achieve? |
Thanks! So it does have an option to give local SW score for each query, that's great! I understand mapQ is probably more similar to the e value of blastn which takes the queried database into consideration but is not necessarily comparable between different databases. The query sequences are environmental metagenomic samples so the database is very large to hopefully classify as many reads/contigs as possible. I am basically using all of the blastn hits for each query (limited to 5 per reference for each query) and do LCA to get a most conservative estimate of the taxonomy. I have been using contigs assembled by megahit because blasting every read will just takes years to finish for the stuff I am working with. I have been using the dc-megablast so I suppose a sensitivity level comparable to that algorithm would be most ideal. Based on experiences, it appears that in this setup my pipeline was able to pick up sequences that are 70% or more similar (also the reason to use contig, as 70% for reads are too shaky). |
@hasindu2008 I had a look at the minimap2 code and realized that it is not that difficult to dump the internal states of each query against each part of the index to disk. For a multi-part index, the ideal pipeline in my head would be:
The output of this procedure will be very similar, though not identical, to the alignment against a uni-part index. The procedure will be definitely cleaner and perhaps even simpler than manipulating SAM/PAF from the current multi-part minimap2 output. Unfortunately, I am busy with another project. I may not be able to implement this feature soon. |
@Confurious in your case, I think you can give minimap2 a try. The command line could be something like: minimap2 -c -k15 -w10 -t 8 db.fa contigs.fa You can reduce |
mapQ largely evaluates if there is a hit of similar SW score. E-value measures how likely the hit is random. They are distinct, too. Blast e-value is a local score in effect. It is different from the initial definition of e-value which is computed by fitting the tail of the extreme-value distribution of all hits. The initial e-value definition needs to see all hits across the entire database. |
"mapQ largely evaluates if there is a hit of similar SW score. E-value measures how likely the hit is random. They are distinct, too. Blast e-value is a local score in effect. It is different from the initial definition of e-value which is computed by fitting the tail of the extreme-value distribution of all hits. The initial e-value definition needs to see all hits across the entire database." So it looks like to make it truly comparable in multi-index situation, SW score is the absolute objective measurement of how "good" the alignment for each query. So instead of list of hits in blastn, I will be dealing with a list of hits with the top SW scores after pooling the results of multi-index. mapQ, blastn e-value, or even blastn bit score all take the size of search space/database into account directly or indirectly (based on reading https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html). Unless of course you implement the internal state merging steps. Btw, thanks for your tips on using minimap2! I am giving it a try, I assume the peak memory of 20-30 GB is based on a defined database in your post right? You do not imply that minimap2 can achieve mapping, maybe slower, with a constant memory cap? Sorry for hijiking the thread, thanks! |
When you use a multi-part index, the peak memory is capped by the total number of bases loaded into memory in per batch, regardless of the total size of the database. The default batch size is 4Gb (option
Mapping with a multi-part index will be slower. You essentially trade speed for lower peak memory. It is hard to say about accuracy. Probably similar. |
That makes sense. I assume it would my best interest to construct the index first for my database, which is around 500 GB. I assume I will need more than 1 TB memory for that? I can manage to find a powerful enough computer to do this one-time deal, but considering the index will essentially be searched against in a multi-part way, i am wondering if there is a better way of doing this? Or do do suggest to simply use the 500 GB fasta file directly with the -I option? I assume in this case the program basically takes 4GB (Default) in chunk, build an index, and do the alignment? This would seem redoing the index work for every set of query sequences. Thanks! |
Minimap2 indexing is very fast. If your network file system is slow, indexing can sometimes be faster than loading index from disk. For a first try, I would just use the command line above: minimap2 -c -k15 -w10 -t 16 db.fa contigs.fa This will take 20-30GB. You can consider to use a large |
Thank you. Meanwhile, I also will attempt to do some implementation at least for primary mappings for non-paired reads. If secondary mappings and paired reads are considered it would be required to dump more information, isn't it? I think merging would be further complicated in such a case. |
Just want to update here that your suggestion worked beautifully for a 500 GB database!! Now I just need to dig through all the tags to come up with a objective "score". What would you recommend to calculate an objective "Score" to evaluate what the best alignment is over splitted indexes? I noticed in my case the AS is almost always the same as ms, I do not know what that means. Posting some output here for reference. Thanks!
|
@hasindu2008 For long reads, minimap2 reports secondary mappings, which should often be adequate for merging. Minimap2 doesn't report secondary mappings in the short read mode (as there will be too many). You can still estimate mapping quality from primary alignments, but occasionally such an estimate can be inaccurate. @Confurious Thanks for the confirmation. |
@lh3
By doing so I have verified that the dumping and reading back work perfectly to give the same output. Now I am attempting to generate the MAPQ. I see that the function mm_set_mapq(b->km, n_regs0, regs0, opt->min_chain_score, opt->a, rep_len, is_sr) is used for this. Can you briefly tell me about this replen as it is the only parameter that is not stored inside the mm_reg1_t? Can it be estimated using the parameters inside the mm_reg1_t or should I dump all the replen values for all mappings and then add them back? |
That's great! Thanks a lot.
replen is calculated by collect_matches() in map.c. It is the sum of length of regions covered by highly repetitive k-mers. You can't infer replen from a list of mm_reg1_t, unfortunately.
You have to save replen for each read in each batch. You don't need to save it for each mm_reg1_t (that's why replen is not a member of mm_reg1_t). When you generate the final MAPQ, you read all replen and choose the largest (or average; not sure) replen as the value fed to mm_set_mapq(). Note that it is not possible to generate identical alignment for each read because minimap2 is throwing away k-mers with high occurrences and this set of k-mers differs between a uni-part index and a multi-part index. Nonetheless, this only affects a tiny fraction of reads and these reads tend to have low MAPQ. The impact to downstream processing should be little. |
I do have a few files that I run minimap2 on before every release. I am out of the town these two days. I will make them available tomorrow. |
@lh3 |
Sorry for the late response. I was on another trip. You can find files here. More specifically:
|
I actually forgot this score0. It is the score from initial chaining. Initial chains may be broken into smaller pieces or joined together. "score" is computed after these operations. "score0" was added to improve accuracy in a corner case. How did you choose "score0" when combining multiple mm_reg1_t? |
Under mm_reg1_t there is a score0.
I directly used this value in each entry when computing the MAPQ after merging. However, the value that is stored in score0 for the same hit, seems to be the difference for a uni-part vs a multi-part. Is this the_score0_ what you were asking about? |
With #189 effectively pulled, I am closing this issue. |
A nanopore read when mapped using minimap2 to the human genome (the whole index is in memory) gives an output as follows.
When using a multi-part index (I manually split the human genome into 2 pieces and ran minimap2 on each part separately - same as using -I option in minimap I presume), the combined output is as follows.
I made the following observations
I have the following two questions.
How is the mapping accuracy affected when a multi-part index is used rather than a single whole index?
The text was updated successfully, but these errors were encountered: