New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Clarified behavior of MarkDuplicatesSpark when given multiple input bams #5901

Merged

jamesemery merged 2 commits into master from je_clarifyMultipleInputBamErrorMessage

Apr 24, 2019

Collaborator

jamesemery commented Apr 23, 2019

Also improved the sorting behavior if given a mix of queryname sorted and query-grouped bams.


          Improved the behavior and messging around multiple input bams for Mar…

5a034d6

…kDuplicatesSpark

jamesemery requested review from droazen and bhanugandham

April 23, 2019 19:46

Collaborator Author

jamesemery commented Apr 23, 2019

This also raises another question. There is no particularly strong reason to restrict multi-bam mode to inputs of a particular sort order. We could happily treat multiple arbitrarily unsorted inputs as just that and sort them as the first step. Perhaps thats better than restricting them all to only work at all in the best case? Let me know your thoughts @droazen.

codecov-io commented Apr 23, 2019 •

edited

Loading

Codecov Report

Merging #5901 into master will increase coverage by 0.003%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##              master    #5901       +/-   ##
==============================================
+ Coverage     86.838%   86.84%   +0.003%     
- Complexity     32325    32327        +2     
==============================================
  Files           1991     1991               
  Lines         149341   149347        +6     
  Branches       16483    16482        -1     
==============================================
+ Hits          129684   129693        +9     
+ Misses         13647    13646        -1     
+ Partials        6010     6008        -2

Impacted Files	Coverage Δ	Complexity Δ
...stitute/hellbender/engine/spark/GATKSparkTool.java	`82.778% <100%> (+0.91%)`	`78 <2> (+1)`	⬆️
.../pipelines/MarkDuplicatesSparkIntegrationTest.java	`91.367% <100%> (+0.223%)`	`42 <0> (ø)`	⬇️
...transforms/markduplicates/MarkDuplicatesSpark.java	`94.595% <100%> (+0.074%)`	`36 <6> (ø)`	⬇️
...roadinstitute/hellbender/utils/read/ReadUtils.java	`80.328% <0%> (+0.234%)`	`208% <0%> (+1%)`	⬆️

droazen self-assigned this

droazen approved these changes

View reviewed changes

Collaborator

droazen left a comment

Minor comments only -- merge after addressing them @jamesemery

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java Outdated

@@ @@ -47,6 +47,7 @@ @@
                *  <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. </li>
                *  <li>Collecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li>
                *  <li>MarkDuplicatesSpark is optimized to run locally on a single machine by leveraging core parallelism that MarkDuplicates and SortSam cannot. It will typically run faster than MarkDuplicates and SortSam by a factor of 15% over the same data at 2 cores and will scale linearly to upwards of 16 cores. This means MarkDuplicatesSpark, even without access to a Spark cluster, is faster than MarkDuplicates.</li>
+               *  <li>MarkDuplicatesSpark can be run with multiple input bams. If this is the case all of the inputs must be querygroup ordered or queryname sorted.</li>

Collaborator

droazen Apr 23, 2019

So a mix of querygroup-sorted and queryname-sorted inputs is ok? Or do they all have to be querygroup-sorted?

Collaborator

droazen Apr 23, 2019

Also: briefly explain the difference between the two orderings here (since it's not necessarily widely known).

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java Outdated

-                      // Check if we are using multiple inputs that the headers are all in the correct querygrouped ordering
+                      final SAMFileHeader header = getHeaderForReads();
+                      // Check if we are using multiple inputs that the headers are all in the correct querygrouped ordering, if so set the aggregate header to reflect this
                       Map<String, SAMFileHeader> headerMap = getReadSouceHeaderMap();

Collaborator

droazen Apr 23, 2019

Souce?

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java Outdated

@@ @@ -288,12 +289,15 @@ public int getPartition(Object key) { @@
                   @Override
                   protected void runTool(final JavaSparkContext ctx) {
-                      // Check if we are using multiple inputs that the headers are all in the correct querygrouped ordering
+                      final SAMFileHeader header = getHeaderForReads();

Collaborator

droazen Apr 23, 2019

This gets a merged header -- what will the underlying SamFileHeaderMerger do in the case where the sort order does not agree across headers? Will it throw in that case?

Collaborator

droazen Apr 23, 2019

Also, rename header to mergedHeader

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java Outdated

                       Map<String, SAMFileHeader> headerMap = getReadSouceHeaderMap();
                       if (headerMap.size() > 1) {
                           headerMap.entrySet().stream().forEach(h -> {if(!ReadUtils.isReadNameGroupedBam(h.getValue())) {
-                              throw new UserException("Multiple inputs to MarkDuplicatesSpark detected but input "+h.getKey()+" was sorted in "+h.getValue().getSortOrder()+" order");
+                              throw new UserException("Multiple inputs to MarkDuplicatesSpark detected. MarkDuplicatesSpark requires all inputs be queryname sorted or querygrouped for multi-input processing but input "+h.getKey()+" was sorted in "+h.getValue().getSortOrder()+" order");

Collaborator

droazen Apr 23, 2019

be -> to be
querygrouped -> querygroup-sorted

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java Outdated

		}});
		header.setGroupOrder(SAMFileHeader.GroupOrder.query);

Collaborator

droazen Apr 23, 2019

Doesn't SAMFileHeader.GroupOrder.query correspond to querygrouped ordering? Is it ok to set this here if not all inputs are necessarily querygrouped?

Collaborator Author

jamesemery Apr 23, 2019

Query grouped bams are a superset of queryname sorted bams. So long as all the inputs are querygrouped (and I suppose also that they don't have overlapping readnames...) the inputs are valid.

droazen assigned jamesemery and unassigned droazen


          responding to review comments

0e49157

jamesemery merged commit bd8ad14 into master

jamesemery deleted the je_clarifyMultipleInputBamErrorMessage branch

April 24, 2019 14:04

bhanugandham reviewed

View reviewed changes

Contributor

bhanugandham left a comment

This is great! Thank you James.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet