Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent mapping against full reference and a single chromosome #107

Open
alexeigurevich opened this issue Feb 8, 2018 · 4 comments

Comments

@alexeigurevich
Copy link

Hi, Heng!
We experimented with Drosophila melanogaster genome and found several strange things in minimap2 behaviour (Version: 2.8-r686-dirty). I've attached two our files for reproducing the issues (full chrX extracted from the reference genome and its fragment from our simulated assembly which should perfectly map to it). We also used the full reference Drosophila_melanogaster.BDGP6.dna.toplevel.fa which could be downloaded e.g. here.

The issues are:

  1. ./minimap2 Drosophila_melanogaster.chrX.fa chrX_fragment.fa gives empty output while we expect to see several perfect mappings. Note that ./minimap2 chrX_fragment.fa Drosophila_melanogaster.chrX.fa results in many alignments.
  2. ./minimap2 Drosophila_melanogaster.BDGP6.dna.toplevel.fa chrX_fragment.fa produces expected matches and all of them are from chrX, so (1) looks strange even without taking into account the issue with target/query order.
  3. ./minimap2 -x asm10 Drosophila_melanogaster.BDGP6.dna.toplevel.fa chrX_fragment.fa results in no alignments again. Note that all alignments from (2) are perfect, so should not be excluded because of -x asm10 in theory.

Drosophila_melanogaster.chrX.fa.gz
chrX_fragment.fa.gz

@lh3
Copy link
Owner

lh3 commented Feb 8, 2018

This is all because chrX_fragment.fa has 65 copies on chrX.

  1. If you look at the minimap2 log, you will see this line: mid_occ = 37. This means minimizers occurring 37 times or more will be ignored. This fragment has 65 copies, more than the threshold.

  2. When you map to the whole genome, the threshold is changed to 115, probably because other regions have more copies. At 115, the fragment become mappable.

  3. -x asm10 uses k-mer of different lengths. The threshold will change.

You can choose a fixed threshold with -f 200. You will see consistent results from these three runs.

I wish minimap2 could output a random mapping even if a query has thousands of copies. However, I could not find an efficient way to achieve that. This is one of the things suffix array/BWT based algorithms are better at.

@lh3
Copy link
Owner

lh3 commented Feb 8, 2018

PS: a high -f will make minimap2 slower. I have not tested how much slower this option may cause.

@lh3
Copy link
Owner

lh3 commented Feb 8, 2018

PSS: there are not as many as 65 copies. Some of the inexact copies are duplicates.

@alexeigurevich
Copy link
Author

Thanks for the explanation! Will try to use -f in our experiments.

lh3 added a commit that referenced this issue Mar 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants