Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved memory requirements of CollectReadCounts. #5715

Merged
merged 2 commits into from
Feb 27, 2019
Merged

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Feb 25, 2019

Goal was to get WGS coverage collection at 100bp at ~15 cents per sample. Since this is I/O bound (takes ~2 hours to stream or localize a BAM, or about the same to decompress a CRAM), cost reduction can be most easily achieved by reducing the memory requirements and moving down to a cheaper VM.

Memory requirements at 100bp are dominated by manipulations of the list of ~30M intervals. There were a few easy fixes to reduce requirements that did not require changing the collection method (which can be easily modified for future investigations, see #4551):

-removed WellformedReadFilter. See #5233. EDIT: We decided after PR review to retain this filter by default and disable it at the WDL level when Best Practices is released. Leaving the issue open.
-initialized HashMultiSet capacity
-removed unnecessary call to OverlapDetector.getAll
-avoided a redundant defensive copy in SimpleCountCollection
-used per-contig OverlapDetectors, rather than a global one

This brought the cost down to ~9 cents per sample using n1-standard-2's with 7.5GB of memory when collecting on BAMs with NIO. Note that I didn't optimize disk size, which accounts for ~50% of the total cost and is unused when running with NIO, so we are closer to ~5 cents per sample. It is possible that using CRAMs with or without NIO and with or without SSDs might be cheaper.

Note that OverlapDetectors may be overkill for our case, since bins are guaranteed to be sorted and non-overlapping and queries are also sorted. We could probably roll something that is O(1) in memory. However, since we are I/O bound, as long as we are satisfied with the current cost, I am willing to sacrifice memory for implementation and maintenance costs, as well as the option to change strategies easily. In any case, @lbergelson found some easy wins in OverlapDetector that may further bring the memory usage down, and will issue a fix in htsjdk soon.

"Input intervals may not be overlapping.");
final SAMFileHeader.SortOrder sortOrder = getHeaderForReads().getSortOrder();
isCoordinateSorted = sortOrder == SAMFileHeader.SortOrder.coordinate;
if (!isCoordinateSorted) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other checks might be overkill. Not sure if they were inherited from older tools where sorting wasn't guaranteed? We can get rid of them if anyone feels strongly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I don't think the coordinate-sorted check is justified if we currently require the bam to be indexed. How expensive is the interval overlap check? I would say keep it unless it takes more than a minute or so on 100bp WGS intervals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I agree. Interval-overlap check is cheap and I'm OK with keeping it.

@codecov-io
Copy link

codecov-io commented Feb 25, 2019

Codecov Report

Merging #5715 into master will decrease coverage by 0.008%.
The diff coverage is 83.333%.

@@               Coverage Diff               @@
##              master     #5715       +/-   ##
===============================================
- Coverage     87.069%   87.062%   -0.008%     
- Complexity     31875     31880        +5     
===============================================
  Files           1940      1940               
  Lines         146738    146804       +66     
  Branches       16226     16234        +8     
===============================================
+ Hits          127764    127810       +46     
- Misses         13061     13073       +12     
- Partials        5913      5921        +8
Impacted Files Coverage Δ Complexity Δ
...hellbender/tools/copynumber/CollectReadCounts.java 84.746% <83.333%> (-0.439%) 11 <3> (+1)
...oadinstitute/hellbender/utils/text/XReadLines.java 81.818% <0%> (-3.182%) 18% <0%> (+1%)
...stitute/hellbender/utils/nio/PathLineIterator.java 61.111% <0%> (-3.175%) 4% <0%> (ø)
...rs/variantutils/SelectVariantsIntegrationTest.java 98% <0%> (-2%) 71% <0%> (ø)
...llbender/tools/walkers/validation/Concordance.java 87.179% <0%> (-1.417%) 41% <0%> (+2%)
...walkers/validation/ConcordanceIntegrationTest.java 98.601% <0%> (-1.399%) 8% <0%> (+2%)
...org/broadinstitute/hellbender/engine/GATKTool.java 91.163% <0%> (-0.426%) 101% <0%> (+1%)
...lbender/utils/variant/GATKVariantContextUtils.java 84.892% <0%> (-0.172%) 256% <0%> (-4%)
...ls/walkers/mutect/CreateSomaticPanelOfNormals.java 93.846% <0%> (ø) 21% <0%> (ø) ⬇️
.../walkers/vqsr/TruthSensitivityTrancheUnitTest.java 85.714% <0%> (ø) 12% <0%> (ø) ⬇️
... and 7 more

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee Thanks for this! Looks good - just a couple of questions regarding the input validation.


@Override
public List<ReadFilter> getDefaultReadFilters() {
final List<ReadFilter> filters = new ArrayList<>(super.getDefaultReadFilters());
final List<ReadFilter> filters = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the motivation for this? I know we experimented with it some. @vruano This was related to CRAM performance, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #5233. I don't know if we should disable this by default, or just do it at the WDL level for the production WDL. Not sure if we should also disable over in CollectAllelicCounts or if that will have any effects. With the cost reductions from the other fixes, I don't think it's absolutely necessary to disable the filter (although we should rerun cost estimates to be sure memory requirements don't change). Up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it enabled by default, since that's the expected behavior for all GATK tools, but disable it from the WDL (and make it part of best practices).

"Input intervals may not be overlapping.");
final SAMFileHeader.SortOrder sortOrder = getHeaderForReads().getSortOrder();
isCoordinateSorted = sortOrder == SAMFileHeader.SortOrder.coordinate;
if (!isCoordinateSorted) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I don't think the coordinate-sorted check is justified if we currently require the bam to be indexed. How expensive is the interval overlap check? I would say keep it unless it takes more than a minute or so on 100bp WGS intervals.

@samuelklee
Copy link
Contributor Author

samuelklee commented Feb 26, 2019

CRAM + NIO looks to be ~3 cents per sample. This essentially includes disk optimizations, since the disk size is determined by the CRAM size; this is not too large, so this results in disk costs of ~0.3 cents per sample.

Note that I ran on the CRAMs in gs://broad-sv-dev-data/TCGA_blood_normals.

@samuelklee
Copy link
Contributor Author

samuelklee commented Feb 26, 2019

CRAM w/o NIO is also ~3 cents per sample (it was marginally more expensive than CRAM w/ NIO, but within the noise). CRAM w/o NIO w/ SSD is ~5 cents.

So I'd say CRAM w/ or w/o NIO is fine. Strictly speaking, we can't directly compare the BAM and CRAM costs, since they were done on different sets of TCGA samples. But both are well under the goal of ~15 cents per sample, so I think it's safe to say that we can turn our attention to optimizing inference costs.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to merge assuming the tests pass.

@samuelklee samuelklee merged commit 213f99c into master Feb 27, 2019
@samuelklee samuelklee deleted the sl_crc_fixes branch February 27, 2019 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants