Improved memory requirements of CollectReadCounts. #5715

samuelklee · 2019-02-25T16:23:15Z

Goal was to get WGS coverage collection at 100bp at ~15 cents per sample. Since this is I/O bound (takes ~2 hours to stream or localize a BAM, or about the same to decompress a CRAM), cost reduction can be most easily achieved by reducing the memory requirements and moving down to a cheaper VM.

Memory requirements at 100bp are dominated by manipulations of the list of ~30M intervals. There were a few easy fixes to reduce requirements that did not require changing the collection method (which can be easily modified for future investigations, see #4551):

-removed WellformedReadFilter. See #5233. EDIT: We decided after PR review to retain this filter by default and disable it at the WDL level when Best Practices is released. Leaving the issue open.
-initialized HashMultiSet capacity
-removed unnecessary call to OverlapDetector.getAll
-avoided a redundant defensive copy in SimpleCountCollection
-used per-contig OverlapDetectors, rather than a global one

This brought the cost down to ~9 cents per sample using n1-standard-2's with 7.5GB of memory when collecting on BAMs with NIO. Note that I didn't optimize disk size, which accounts for ~50% of the total cost and is unused when running with NIO, so we are closer to ~5 cents per sample. It is possible that using CRAMs with or without NIO and with or without SSDs might be cheaper.

Note that OverlapDetectors may be overkill for our case, since bins are guaranteed to be sorted and non-overlapping and queries are also sorted. We could probably roll something that is O(1) in memory. However, since we are I/O bound, as long as we are satisfied with the current cost, I am willing to sacrifice memory for implementation and maintenance costs, as well as the option to change strategies easily. In any case, @lbergelson found some easy wins in OverlapDetector that may further bring the memory usage down, and will issue a fix in htsjdk soon.

samuelklee · 2019-02-25T16:28:34Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/CollectReadCounts.java

-                "Input intervals may not be overlapping.");
+        final SAMFileHeader.SortOrder sortOrder = getHeaderForReads().getSortOrder();
+        isCoordinateSorted = sortOrder == SAMFileHeader.SortOrder.coordinate;
+        if (!isCoordinateSorted) {


This and other checks might be overkill. Not sure if they were inherited from older tools where sorting wasn't guaranteed? We can get rid of them if anyone feels strongly.

IMO, I don't think the coordinate-sorted check is justified if we currently require the bam to be indexed. How expensive is the interval overlap check? I would say keep it unless it takes more than a minute or so on 100bp WGS intervals.

Sure, I agree. Interval-overlap check is cheap and I'm OK with keeping it.

codecov-io · 2019-02-25T17:03:57Z

Codecov Report

Merging #5715 into master will decrease coverage by 0.008%.
The diff coverage is 83.333%.

@@               Coverage Diff               @@
##              master     #5715       +/-   ##
===============================================
- Coverage     87.069%   87.062%   -0.008%     
- Complexity     31875     31880        +5     
===============================================
  Files           1940      1940               
  Lines         146738    146804       +66     
  Branches       16226     16234        +8     
===============================================
+ Hits          127764    127810       +46     
- Misses         13061     13073       +12     
- Partials        5913      5921        +8

Impacted Files	Coverage Δ	Complexity Δ
...hellbender/tools/copynumber/CollectReadCounts.java	`84.746% <83.333%> (-0.439%)`	`11 <3> (+1)`
...oadinstitute/hellbender/utils/text/XReadLines.java	`81.818% <0%> (-3.182%)`	`18% <0%> (+1%)`
...stitute/hellbender/utils/nio/PathLineIterator.java	`61.111% <0%> (-3.175%)`	`4% <0%> (ø)`
...rs/variantutils/SelectVariantsIntegrationTest.java	`98% <0%> (-2%)`	`71% <0%> (ø)`
...llbender/tools/walkers/validation/Concordance.java	`87.179% <0%> (-1.417%)`	`41% <0%> (+2%)`
...walkers/validation/ConcordanceIntegrationTest.java	`98.601% <0%> (-1.399%)`	`8% <0%> (+2%)`
...org/broadinstitute/hellbender/engine/GATKTool.java	`91.163% <0%> (-0.426%)`	`101% <0%> (+1%)`
...lbender/utils/variant/GATKVariantContextUtils.java	`84.892% <0%> (-0.172%)`	`256% <0%> (-4%)`
...ls/walkers/mutect/CreateSomaticPanelOfNormals.java	`93.846% <0%> (ø)`	`21% <0%> (ø)`	⬇️
.../walkers/vqsr/TruthSensitivityTrancheUnitTest.java	`85.714% <0%> (ø)`	`12% <0%> (ø)`	⬇️
... and 7 more

mwalker174

@samuelklee Thanks for this! Looks good - just a couple of questions regarding the input validation.

mwalker174 · 2019-02-25T18:35:04Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/CollectReadCounts.java


    @Override
    public List<ReadFilter> getDefaultReadFilters() {
-        final List<ReadFilter> filters = new ArrayList<>(super.getDefaultReadFilters());
+        final List<ReadFilter> filters = new ArrayList<>();


What was the motivation for this? I know we experimented with it some. @vruano This was related to CRAM performance, correct?

See #5233. I don't know if we should disable this by default, or just do it at the WDL level for the production WDL. Not sure if we should also disable over in CollectAllelicCounts or if that will have any effects. With the cost reductions from the other fixes, I don't think it's absolutely necessary to disable the filter (although we should rerun cost estimates to be sure memory requirements don't change). Up to you.

I think we should keep it enabled by default, since that's the expected behavior for all GATK tools, but disable it from the WDL (and make it part of best practices).

mwalker174 · 2019-02-25T18:43:13Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/CollectReadCounts.java

-                "Input intervals may not be overlapping.");
+        final SAMFileHeader.SortOrder sortOrder = getHeaderForReads().getSortOrder();
+        isCoordinateSorted = sortOrder == SAMFileHeader.SortOrder.coordinate;
+        if (!isCoordinateSorted) {


IMO, I don't think the coordinate-sorted check is justified if we currently require the bam to be indexed. How expensive is the interval overlap check? I would say keep it unless it takes more than a minute or so on 100bp WGS intervals.

samuelklee · 2019-02-26T20:51:38Z

CRAM + NIO looks to be ~3 cents per sample. This essentially includes disk optimizations, since the disk size is determined by the CRAM size; this is not too large, so this results in disk costs of ~0.3 cents per sample.

Note that I ran on the CRAMs in gs://broad-sv-dev-data/TCGA_blood_normals.

samuelklee · 2019-02-26T21:03:07Z

CRAM w/o NIO is also ~3 cents per sample (it was marginally more expensive than CRAM w/ NIO, but within the noise). CRAM w/o NIO w/ SSD is ~5 cents.

So I'd say CRAM w/ or w/o NIO is fine. Strictly speaking, we can't directly compare the BAM and CRAM costs, since they were done on different sets of TCGA samples. But both are well under the goal of ~15 cents per sample, so I think it's safe to say that we can turn our attention to optimizing inference costs.

mwalker174

Thanks, looks good to merge assuming the tests pass.

Improved memory requirements of CollectReadCounts.

a43ff73

samuelklee force-pushed the sl_crc_fixes branch from 39ed129 to a43ff73 Compare February 25, 2019 16:24

samuelklee mentioned this pull request Feb 25, 2019

Improve capabilities of CNV collections classes. #5716

Closed

samuelklee commented Feb 25, 2019

View reviewed changes

samuelklee requested a review from mwalker174 February 25, 2019 16:28

mwalker174 reviewed Feb 26, 2019

View reviewed changes

Addressed PR comments.

02170c2

mwalker174 approved these changes Feb 27, 2019

View reviewed changes

samuelklee mentioned this pull request Feb 27, 2019

Disable WellformedReadFilter in CNV WDLs. #5233

Closed

samuelklee merged commit 213f99c into master Feb 27, 2019

samuelklee deleted the sl_crc_fixes branch February 27, 2019 05:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved memory requirements of CollectReadCounts. #5715

Improved memory requirements of CollectReadCounts. #5715

samuelklee commented Feb 25, 2019 •

edited

Loading

samuelklee Feb 25, 2019

mwalker174 Feb 25, 2019

samuelklee Feb 26, 2019

codecov-io commented Feb 25, 2019 •

edited

Loading

mwalker174 left a comment

mwalker174 Feb 25, 2019

samuelklee Feb 26, 2019

mwalker174 Feb 26, 2019

mwalker174 Feb 25, 2019

samuelklee commented Feb 26, 2019 •

edited

Loading

samuelklee commented Feb 26, 2019 •

edited

Loading

mwalker174 left a comment

Improved memory requirements of CollectReadCounts. #5715

Improved memory requirements of CollectReadCounts. #5715

Conversation

samuelklee commented Feb 25, 2019 • edited Loading

samuelklee Feb 25, 2019

Choose a reason for hiding this comment

mwalker174 Feb 25, 2019

Choose a reason for hiding this comment

samuelklee Feb 26, 2019

Choose a reason for hiding this comment

codecov-io commented Feb 25, 2019 • edited Loading

Codecov Report

mwalker174 left a comment

Choose a reason for hiding this comment

mwalker174 Feb 25, 2019

Choose a reason for hiding this comment

samuelklee Feb 26, 2019

Choose a reason for hiding this comment

mwalker174 Feb 26, 2019

Choose a reason for hiding this comment

mwalker174 Feb 25, 2019

Choose a reason for hiding this comment

samuelklee commented Feb 26, 2019 • edited Loading

samuelklee commented Feb 26, 2019 • edited Loading

mwalker174 left a comment

Choose a reason for hiding this comment

samuelklee commented Feb 25, 2019 •

edited

Loading

codecov-io commented Feb 25, 2019 •

edited

Loading

samuelklee commented Feb 26, 2019 •

edited

Loading

samuelklee commented Feb 26, 2019 •

edited

Loading