improving parallelism when mapping very long sequences #491

ekg · 2019-10-09T15:56:03Z

I'm mapping some long sequences (pseudo-unitigs from an assembly graph) to the human reference genome. It seems that some sequences are taking a very long time. Almost all of my runtime (>80%) is spent running single-threaded while waiting for minimap2 to finish mapping the longest contigs in each batch.

I've attempted to increase the batch size to 8G by setting -K 8000M, but this seems to be making minimap2 "hang" from my perspective. After setting up its minimizer index, It's sat for 15 minutes without increasing its memory or using more than a single thread.

For reference, here's my command line:

minimap2 -t 72 -K 8000M -c --cs hg38.fa unitig.fq.gz

Can minimap2 be convinced to pick up a larger set of reads in the minibatch (am I doing this with -K?), or just pick up each read and align it one at a time?

The text was updated successfully, but these errors were encountered:

tseemann · 2019-10-10T02:39:51Z

Is hg38.fa pre-indexed?

    -I NUM       split index for every ~NUM input bases [4G]
    -d FILE      dump index to FILE []

ekg · 2019-10-10T08:32:37Z

@tseemann It's not, but the indexing wasn't the problem. That takes ten minutes or so at maximum.

It looks like there is a integer truncation in the option parsing that will limit the batch size to 2^31. The parsing of the option returns a 64-bit integer, but it's cast to a plain int.

After looking at this again, I'm not sure how much utility there is in fixing this. I was unable to allocate enough memory at 2000M (which doesn't overflow and so can be specified via -K) and I have a system with 137G of RAM.

I was able to improve throughput slightly by sorting my sequences from largest to smallest. This results in most of the sequences in a given batch having approximately the same runtime, and so we at least achive parallelism on average close to the number of sequences. This could be in the 1s and 10s for really long sequences, but it's better than 1 which tends to be the result when the order is randomized. When the job starts to run into the smaller sequences the throughput goes up significantly, and so things complete pretty efficiently.

I would think that much of the alignment of long sequences is basically serial. To me this suggests that it wouldn't be hard to make alignment of long sequences parallel, by completing the time consuming steps like traceback/cigar generation in parallel. This would probably require restructuring the alignment algorithm. With whole genome alignment becoming more important, it's worth thinking about how to do this. Perhaps a different algorithm is needed for such applications.

tseemann · 2019-10-11T07:04:11Z

Ah yes, this (int) cast loses the int64_t goodness:
else if (c == 'K') opt.mini_batch_size = (int)mm_parse_num(o.arg);
https://github.com/lh3/minimap2/blob/master/main.c#L169

that it got from here:
static inline int64_t mm_parse_num(const char *str)
https://github.com/lh3/minimap2/blob/master/main.c#L81-L90

The struct only uses an int. should be uint64_t perhaps?

typedef struct {
	int mini_batch_size;
	uint64_t batch_size, sum_len;
	mm_bseq_file_t *fp;
	mm_idx_t *mi;
} pipeline_t;

ekg · 2019-10-11T11:23:04Z

I guess it should be. The memory requirements for extremely long reads are prohibitive. This could be mitigated by using temporary files or memory mapping to hold parts of the alignment that aren't being worked on currently. @lh3 it's not that we need to store the entire alignment structure in memory when doing operations like traceback?

lh3 · 2019-10-30T13:21:55Z

@tseemann I am aware of the integer overflow issue. However, changing the integer in pipeline_t alone is not enough. There are a few other lines to change.

@ekg minimap2 keeps all intermediate data (including the full trackback matrix) in memory and doesn't produce temporary files unless --split-prefix is used for a huge multi-part reference genome.

lh3 added the enhancement label Jan 18, 2020

lh3 closed this as completed in d2e1470 Jan 18, 2020

tseemann mentioned this issue Feb 19, 2020

-K 4000M overflow bug? #562

Closed

lh3 mentioned this issue May 22, 2020

minimap2 does not respect large -K option? #613

Closed

mrvollger mentioned this issue May 22, 2020

Large -K is not respected lh3/yak#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improving parallelism when mapping very long sequences #491

improving parallelism when mapping very long sequences #491

ekg commented Oct 9, 2019

tseemann commented Oct 10, 2019 •

edited

Loading

ekg commented Oct 10, 2019

tseemann commented Oct 11, 2019 •

edited

Loading

ekg commented Oct 11, 2019

lh3 commented Oct 30, 2019

improving parallelism when mapping very long sequences #491

improving parallelism when mapping very long sequences #491

Comments

ekg commented Oct 9, 2019

tseemann commented Oct 10, 2019 • edited Loading

ekg commented Oct 10, 2019

tseemann commented Oct 11, 2019 • edited Loading

ekg commented Oct 11, 2019

lh3 commented Oct 30, 2019

tseemann commented Oct 10, 2019 •

edited

Loading

tseemann commented Oct 11, 2019 •

edited

Loading