-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the ability for MarkDuplicatesSpark to accept multiple inputs #5430
Added the ability for MarkDuplicatesSpark to accept multiple inputs #5430
Conversation
doNotMerge, | ||
concatMerge, | ||
mergeAndSort | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mergeAndSort
policy is not used or implemented, so shouldn't be added yet. The walker version doesn't have such a policy as far as I can see, it effectively implements concatMerge
, so just implementing the equivalent would be sufficient, no?
@@ -158,6 +160,21 @@ public boolean requiresReads() { | |||
return false; | |||
} | |||
|
|||
/** | |||
* Does this tool support multiple inputs? Tools that do should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfinished sentence.
@@ -273,21 +293,36 @@ public SAMFileHeader getHeaderForReads() { | |||
traversalParameters = null; // no intervals were specified so return all reads (mapped and unmapped) | |||
} | |||
|
|||
// TODO: This if statement is a temporary hack until #959 gets resolved. | |||
if (readInput.endsWith(".adam")) { | |||
// TODO: This if statement is a temporary hack until #959 gets resolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't apply to the next line - it should be moved to the code in getGatkReadJavaRDD
.
* @return doNotMerge by default | ||
*/ | ||
public ReadInputMergingPolicy getReadInputMergingPolicy() { | ||
return ReadInputMergingPolicy.doNotMerge; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not default to concatMerge
and support multiple inputs for all Spark tools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated this with @lbergelson. It seems like an undesirable behavior for tools that tailor their behavior to the header sort order to union the RDDs of multiple inputs potentially invalidating any assumptions about input ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of tools opting in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
headers.addAll(readInputs.values()); | ||
|
||
SamFileHeaderMerger headerMerger = new SamFileHeaderMerger(identifySortOrder(headers), headers, true); | ||
return headerMerger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for headerMerger
variable.
*/ | ||
private SamFileHeaderMerger createHeaderMerger() { | ||
List<SAMFileHeader> headers = new ArrayList<>(readInputs.size()); | ||
headers.addAll(readInputs.values()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a list? You are using a LinkedHashMap
, so you know values()
will be ordered correctly. SamFileHeaderMerger
takes a Collection
, and identifySortOrder
could too.
|
||
@Test (expectedExceptions = UserException.class) | ||
public void testAssertCorrectSortOrderMultipleBams() { | ||
// This test asserts that the handling of two read pairs with the same start positions but on different in such a way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment wording is a bit unclear.
@tomwhite Can I get another round of review on this branch? |
Codecov Report
@@ Coverage Diff @@
## master #5430 +/- ##
===============================================
- Coverage 87.085% 87.082% -0.003%
- Complexity 31222 31251 +29
===============================================
Files 1915 1915
Lines 144079 144178 +99
Branches 15891 15910 +19
===============================================
+ Hits 125471 125553 +82
- Misses 12837 12839 +2
- Partials 5771 5786 +15
|
@tomwhite Looks like your latest comments have been addressed -- could you have another look at this branch when you get a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Looks good to me. A few nitpicks but feel free to disregard them.
@@ -442,7 +474,17 @@ public boolean useVariantAnnotations() { | |||
* Returns the name of the source of reads data. It can be a file name or URL. | |||
*/ | |||
protected String getReadSourceName(){ | |||
return readInput; | |||
if (readInputs.size() > 1) { | |||
throw new GATKException("Multiple ReadsDataSources specificed but a single source requested by the tool"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not ideal maybe but I don't know what else to do about this method... maybe change it to getReadSourceNames and return a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I wanted to avoid returning a list because there are a bunch of places in the code where tools expect there to be only one read source kicking around and I didn't want to risk breaking something or having to uproot everything... I agree its pretty gross... Theoretically it shouldn't be a problem for most tools which don't accept multiple inputs anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change it to a list, or even get rid of it. All of the existing consumers except one use it to generate an output name for saving metrics (relatively easy to change); the other one (getRecommendedNumReducers
) uses it to determine the number of reducers, which won't work correctly on multiple inputs with this implementation. I think getRecommendedNumReducers
should be updated either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Alright, I'll return a list. getRecommendedNumReducers
has already been updated in this branch to sum over the all the read input files.
* @return doNotMerge by default | ||
*/ | ||
public ReadInputMergingPolicy getReadInputMergingPolicy() { | ||
return ReadInputMergingPolicy.doNotMerge; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of tools opting in.
} | ||
@VisibleForTesting | ||
static SAMFileHeader.SortOrder identifySortOrder(final Collection<SAMFileHeader> headers){ | ||
final Set<SAMFileHeader.SortOrder> sortOrders = headers.stream().map(SAMFileHeader::getSortOrder).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever way to check this.
*/ | ||
private SamFileHeaderMerger createHeaderMerger() { | ||
return new SamFileHeaderMerger(identifySortOrder(readInputs.values()), readInputs.values(), true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a newline here:
} | |
} | |
readsHeader = createHeaderMerger().getMergedHeader(); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many newlines here...
James responded to Tom's review already.
… if they are queryname sorted
b558fb2
to
bf58fd3
Compare
I'm not sure how much I like the changes to GATKSparkTool.
Resolves #5398