Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improving parallelism when mapping very long sequences #491

Closed
ekg opened this issue Oct 9, 2019 · 5 comments
Closed

improving parallelism when mapping very long sequences #491

ekg opened this issue Oct 9, 2019 · 5 comments

Comments

@ekg
Copy link

ekg commented Oct 9, 2019

I'm mapping some long sequences (pseudo-unitigs from an assembly graph) to the human reference genome. It seems that some sequences are taking a very long time. Almost all of my runtime (>80%) is spent running single-threaded while waiting for minimap2 to finish mapping the longest contigs in each batch.

I've attempted to increase the batch size to 8G by setting -K 8000M, but this seems to be making minimap2 "hang" from my perspective. After setting up its minimizer index, It's sat for 15 minutes without increasing its memory or using more than a single thread.

For reference, here's my command line:

minimap2 -t 72 -K 8000M -c --cs hg38.fa unitig.fq.gz

Can minimap2 be convinced to pick up a larger set of reads in the minibatch (am I doing this with -K?), or just pick up each read and align it one at a time?

@tseemann
Copy link
Contributor

tseemann commented Oct 10, 2019

Is hg38.fa pre-indexed?

    -I NUM       split index for every ~NUM input bases [4G]
    -d FILE      dump index to FILE []

@ekg
Copy link
Author

ekg commented Oct 10, 2019

@tseemann It's not, but the indexing wasn't the problem. That takes ten minutes or so at maximum.

It looks like there is a integer truncation in the option parsing that will limit the batch size to 2^31. The parsing of the option returns a 64-bit integer, but it's cast to a plain int.

After looking at this again, I'm not sure how much utility there is in fixing this. I was unable to allocate enough memory at 2000M (which doesn't overflow and so can be specified via -K) and I have a system with 137G of RAM.

I was able to improve throughput slightly by sorting my sequences from largest to smallest. This results in most of the sequences in a given batch having approximately the same runtime, and so we at least achive parallelism on average close to the number of sequences. This could be in the 1s and 10s for really long sequences, but it's better than 1 which tends to be the result when the order is randomized. When the job starts to run into the smaller sequences the throughput goes up significantly, and so things complete pretty efficiently.

I would think that much of the alignment of long sequences is basically serial. To me this suggests that it wouldn't be hard to make alignment of long sequences parallel, by completing the time consuming steps like traceback/cigar generation in parallel. This would probably require restructuring the alignment algorithm. With whole genome alignment becoming more important, it's worth thinking about how to do this. Perhaps a different algorithm is needed for such applications.

@tseemann
Copy link
Contributor

tseemann commented Oct 11, 2019

Ah yes, this (int) cast loses the int64_t goodness:
else if (c == 'K') opt.mini_batch_size = (int)mm_parse_num(o.arg);
https://github.com/lh3/minimap2/blob/master/main.c#L169

that it got from here:
static inline int64_t mm_parse_num(const char *str)
https://github.com/lh3/minimap2/blob/master/main.c#L81-L90

The struct only uses an int. should be uint64_t perhaps?

typedef struct {
	int mini_batch_size;
	uint64_t batch_size, sum_len;
	mm_bseq_file_t *fp;
	mm_idx_t *mi;
} pipeline_t;

@ekg
Copy link
Author

ekg commented Oct 11, 2019

I guess it should be. The memory requirements for extremely long reads are prohibitive. This could be mitigated by using temporary files or memory mapping to hold parts of the alignment that aren't being worked on currently. @lh3 it's not that we need to store the entire alignment structure in memory when doing operations like traceback?

@lh3
Copy link
Owner

lh3 commented Oct 30, 2019

@tseemann I am aware of the integer overflow issue. However, changing the integer in pipeline_t alone is not enough. There are a few other lines to change.

@ekg minimap2 keeps all intermediate data (including the full trackback matrix) in memory and doesn't produce temporary files unless --split-prefix is used for a huge multi-part reference genome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants