Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimap2 --split-prefix option #887

Closed
YasirKusay opened this issue Mar 13, 2022 · 5 comments
Closed

Minimap2 --split-prefix option #887

YasirKusay opened this issue Mar 13, 2022 · 5 comments

Comments

@YasirKusay
Copy link

Hi, I would like to know more about the --split-prefix option.

I have a very large (1200 GB) index that I want to align and I of course don't want to load the entire index into memory. I wanted to test out minimap2 --split-prefix on an assembly of 17000 contigs using relatively small resources (48 GB of ram, 16 threads and 150 GB of disk space) just to see how it worked and I expected that the command will fail (as it would run out of disk space to store the index partitions) but the command actually executed to completion (taking 134 minutes). I am confused, did alignment happen against the entire index, or did something else happen?

If this helps, I did inspect the index partitions during execution and they were about 20 MB each.

@YasirKusay
Copy link
Author

YasirKusay commented Mar 14, 2022

Hi, This is just an extension to the above. I would like to be able to run minimap2 incrementally (e.g. load 4 GB of the index into RAM, align, move on to the next 4 GB of the index, etc). I did try and run the program on default settings with 5GB and 10GB of RAM just to see what would happen, but there was an error: 24781 Killed and 17044 Killed respectively.

I don't know why minimap2 failed, but I thought that it loaded 4GB of the index per alignment, before moving onto the next index batch.

@hasindu2008
Copy link
Contributor

Here is some information from the first implementation of the --split-prefix:

It loads the index part by part, while iteratively mapping the queries to each index partition. The intermediate results will be saved as temporary files. Finally, it will go through all the temporary files and merge the results. Detailed methodology is available at https://www.nature.com/articles/s41598-019-40739-8 and technical information in https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-019-40739-8/MediaObjects/41598_2019_40739_MOESM1_ESM.pdf.

Here are some example commands we originally used for testing on the human genome:
https://github.com/hasindu2008/minimap2-arm/tree/master/misc/idxtools

  1. Is the 1200GB index a fasta file or a minimap2 index?
  2. When you say 4GB of the index - are you referring to -I 4G option? As far as I am aware it means 4 Gigabases of the reference. This can become like 10-12GB depending on the reference characteristics. And minimap by default loads a batch of queries (1G as I remember), and for those and intermediate data structures need extra RAM.
  3. The disk usage for temporary will depend on the size of your query reads rather than the reference. What is the size of your fastq?

@YasirKusay
Copy link
Author

Hi @hasindu2008, Thank you for your reply.

I was actually confused by --split-prefix option initially, as I assumed that the temporary files were partitioned indexes rather than the results, so thank you for clearing that up.

  1. The 1200GB index is the actual minimap2 index.
  2. I am not referring to the -I 4G index as I already have the index. Does it still load 4 gigabases of the index while doing the alignment with the default settings?

My primary concern is to be able to use the entire 1200 GB index, with as little RAM as possible. Based on what you have told me, I think that I can achieve this using the --split-prefix option. Does that mean the default settings for alignments will load the entire index at once? If not, what is the difference between the default settings and --split-prefix option?

@hasindu2008
Copy link
Contributor

If you created minimap2 index with default options, it will create load create an index with multiple parts each with 4Gbases. So yes, it will load only 4 gigabases at a time with default options.

The default settings will not load the whole index, it will still iteratively map part by part and output all mappings to each part. In summary, if --split-prefix is used, the mappings will be more accurate than without it - https://www.nature.com/articles/s41598-019-40739-8 contains all the information.

@YasirKusay
Copy link
Author

YasirKusay commented Mar 15, 2022

Thanks for your reply!

I will now use minimap2 with --split-prefix as the option (I did notice that both --split-prefix and the default settings actually took similar times).

On a side note, has this tool ever been benchmarked using the NCBI NT index?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants