Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability for MarkDuplicatesSpark to accept multiple inputs #5430

Merged
merged 9 commits into from
Jan 3, 2019

Conversation

jamesemery
Copy link
Collaborator

I'm not sure how much I like the changes to GATKSparkTool.

Resolves #5398

doNotMerge,
concatMerge,
mergeAndSort
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mergeAndSort policy is not used or implemented, so shouldn't be added yet. The walker version doesn't have such a policy as far as I can see, it effectively implements concatMerge, so just implementing the equivalent would be sufficient, no?

@@ -158,6 +160,21 @@ public boolean requiresReads() {
return false;
}

/**
* Does this tool support multiple inputs? Tools that do should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfinished sentence.

@@ -273,21 +293,36 @@ public SAMFileHeader getHeaderForReads() {
traversalParameters = null; // no intervals were specified so return all reads (mapped and unmapped)
}

// TODO: This if statement is a temporary hack until #959 gets resolved.
if (readInput.endsWith(".adam")) {
// TODO: This if statement is a temporary hack until #959 gets resolve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't apply to the next line - it should be moved to the code in getGatkReadJavaRDD.

* @return doNotMerge by default
*/
public ReadInputMergingPolicy getReadInputMergingPolicy() {
return ReadInputMergingPolicy.doNotMerge;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not default to concatMerge and support multiple inputs for all Spark tools?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated this with @lbergelson. It seems like an undesirable behavior for tools that tailor their behavior to the header sort order to union the RDDs of multiple inputs potentially invalidating any assumptions about input ordering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of tools opting in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

headers.addAll(readInputs.values());

SamFileHeaderMerger headerMerger = new SamFileHeaderMerger(identifySortOrder(headers), headers, true);
return headerMerger;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for headerMerger variable.

*/
private SamFileHeaderMerger createHeaderMerger() {
List<SAMFileHeader> headers = new ArrayList<>(readInputs.size());
headers.addAll(readInputs.values());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a list? You are using a LinkedHashMap, so you know values() will be ordered correctly. SamFileHeaderMerger takes a Collection, and identifySortOrder could too.


@Test (expectedExceptions = UserException.class)
public void testAssertCorrectSortOrderMultipleBams() {
// This test asserts that the handling of two read pairs with the same start positions but on different in such a way
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment wording is a bit unclear.

@jamesemery
Copy link
Collaborator Author

@tomwhite Can I get another round of review on this branch?

@codecov-io
Copy link

codecov-io commented Nov 28, 2018

Codecov Report

Merging #5430 into master will decrease coverage by 0.003%.
The diff coverage is 86.25%.

@@               Coverage Diff               @@
##              master     #5430       +/-   ##
===============================================
- Coverage     87.085%   87.082%   -0.003%     
- Complexity     31222     31251       +29     
===============================================
  Files           1915      1915               
  Lines         144079    144178       +99     
  Branches       15891     15910       +19     
===============================================
+ Hits          125471    125553       +82     
- Misses         12837     12839        +2     
- Partials        5771      5786       +15
Impacted Files Coverage Δ Complexity Δ
...pelines/metrics/QualityScoreDistributionSpark.java 95.238% <100%> (ø) 16 <0> (ø) ⬇️
...transforms/markduplicates/MarkDuplicatesSpark.java 94.872% <100%> (+0.427%) 38 <9> (+4) ⬆️
...pipelines/metrics/CollectMultipleMetricsSpark.java 92.593% <100%> (ø) 9 <0> (ø) ⬇️
...k/pipelines/metrics/MetricsCollectorSparkTool.java 75% <100%> (ø) 3 <0> (ø) ⬇️
...ark/pipelines/metrics/MeanQualityByCycleSpark.java 90.816% <100%> (ø) 11 <0> (ø) ⬇️
...s/metrics/CollectBaseDistributionByCycleSpark.java 87.037% <100%> (ø) 9 <0> (ø) ⬇️
...stitute/hellbender/engine/spark/GATKSparkTool.java 82.286% <76.923%> (-0.147%) 78 <9> (+10)
.../pipelines/MarkDuplicatesSparkIntegrationTest.java 91.144% <93.103%> (-0.262%) 42 <1> (+6)
...hellbender/utils/runtime/RuntimeUtilsUnitTest.java 87.5% <0%> (-12.5%) 4% <0%> (+1%)
...te/hellbender/tools/funcotator/OutputRenderer.java 92.857% <0%> (-7.143%) 4% <0%> (ø)
... and 17 more

@droazen
Copy link
Collaborator

droazen commented Dec 10, 2018

@tomwhite Looks like your latest comments have been addressed -- could you have another look at this branch when you get a chance?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Looks good to me. A few nitpicks but feel free to disregard them.

@@ -442,7 +474,17 @@ public boolean useVariantAnnotations() {
* Returns the name of the source of reads data. It can be a file name or URL.
*/
protected String getReadSourceName(){
return readInput;
if (readInputs.size() > 1) {
throw new GATKException("Multiple ReadsDataSources specificed but a single source requested by the tool");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not ideal maybe but I don't know what else to do about this method... maybe change it to getReadSourceNames and return a list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I wanted to avoid returning a list because there are a bunch of places in the code where tools expect there to be only one read source kicking around and I didn't want to risk breaking something or having to uproot everything... I agree its pretty gross... Theoretically it shouldn't be a problem for most tools which don't accept multiple inputs anyway

Copy link
Collaborator

@cmnbroad cmnbroad Jan 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change it to a list, or even get rid of it. All of the existing consumers except one use it to generate an output name for saving metrics (relatively easy to change); the other one (getRecommendedNumReducers) uses it to determine the number of reducers, which won't work correctly on multiple inputs with this implementation. I think getRecommendedNumReducers should be updated either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Alright, I'll return a list. getRecommendedNumReducers has already been updated in this branch to sum over the all the read input files.

* @return doNotMerge by default
*/
public ReadInputMergingPolicy getReadInputMergingPolicy() {
return ReadInputMergingPolicy.doNotMerge;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of tools opting in.

}
@VisibleForTesting
static SAMFileHeader.SortOrder identifySortOrder(final Collection<SAMFileHeader> headers){
final Set<SAMFileHeader.SortOrder> sortOrders = headers.stream().map(SAMFileHeader::getSortOrder).collect(Collectors.toSet());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever way to check this.

*/
private SamFileHeaderMerger createHeaderMerger() {
return new SamFileHeaderMerger(identifySortOrder(readInputs.values()), readInputs.values(), true);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a newline here:

Suggested change
}
}

readsHeader = createHeaderMerger().getMergedHeader();
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many newlines here...

@lbergelson lbergelson dismissed tomwhite’s stale review December 10, 2018 21:17

James responded to Tom's review already.

@lbergelson lbergelson assigned jamesemery and unassigned tomwhite Dec 10, 2018
@jamesemery jamesemery force-pushed the je_makeMarkDuplicatesSparkAcceptMultipleInputs branch from b558fb2 to bf58fd3 Compare January 2, 2019 20:36
@jamesemery jamesemery merged commit 1fb980c into master Jan 3, 2019
@jamesemery jamesemery deleted the je_makeMarkDuplicatesSparkAcceptMultipleInputs branch January 3, 2019 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants