Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an option to opt out of the protections to sorting multiple bam inputs #5974

Merged
merged 6 commits into from
May 31, 2019

Conversation

jamesemery
Copy link
Collaborator

An advanced argument to skip multiple inputs. We still want to discourage this behavior for performance reasons so its not the default behavior.

Fixes #5973

@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #5974 into master will decrease coverage by 0.002%.
The diff coverage is 88.095%.

@@               Coverage Diff               @@
##              master     #5974       +/-   ##
===============================================
- Coverage     86.931%   86.929%   -0.002%     
- Complexity     32715     32721        +6     
===============================================
  Files           2013      2013               
  Lines         151268    151306       +38     
  Branches       16604     16610        +6     
===============================================
+ Hits          131499    131529       +30     
- Misses         13716     13720        +4     
- Partials        6053      6057        +4
Impacted Files Coverage Δ Complexity Δ
.../pipelines/MarkDuplicatesSparkIntegrationTest.java 91.837% <100%> (+0.47%) 44 <2> (+2) ⬆️
...transforms/markduplicates/MarkDuplicatesSpark.java 90.625% <80.769%> (-3.97%) 41 <5> (+5)
...tools/walkers/haplotypecaller/graphs/SeqGraph.java 87.059% <0%> (-2.353%) 23% <0%> (-1%)
...lotypecaller/readthreading/ReadThreadingGraph.java 88.725% <0%> (-0.245%) 159% <0%> (ø)

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I think this should work slightly differently. Good to go after comments are addressed or let me know why I'm wrong :)

@@ -140,6 +141,11 @@
fullName = StandardArgumentDefinitions.METRICS_FILE_LONG_NAME)
protected String metricsFile;

@Advanced
@Argument(doc = "Override to allow non-queryname sorted inputs for multiple input bams.", optional=true,
fullName = "allow-multiple-sort-orders-in-input")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets change this to treat-unsorted-files-as-queryname-grouped. Make it apply to both single and multiple files. It should only overrule the case where the files are listed as unsorted or don't have a given sort order. A file that says it's coordinate sorted should still be treated as such.

The comment needs to mention that it will silently give incorrect results if the file's are not actually queryname grouped, and that this is an unsafe workaround for legacy files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't do that. What this change actually does is override the hard failure that used to throw an exception if all the multiple input files didin't have query grouped sort order.

If that override fails, the header associated with the input gets set to "unsorted" with no GO tag, which means the tool treats the input as being non-queryname sorted. Thus it will just happily sort the entire input for you at the outset. What the above check was doing before was silently asserting that the input was GO:query if all the inputs were indeed at least GO:query ordered according to their original headers.

if (!allowMultipleSortOrders) {
headerMap.entrySet().stream().forEach(h -> {
if (!ReadUtils.isReadNameGroupedBam(h.getValue())) {
throw new UserException("Multiple inputs to MarkDuplicatesSpark detected. MarkDuplicatesSpark requires all inputs to be queryname sorted or querygroup-sorted for multi-input processing but input " + h.getKey() + " was sorted in " + h.getValue().getSortOrder() + " order. Try running with '--allow-multiple-sort-orders-in-input' to run by sorting all the input.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a constant for the fullname of the argument and use it in the comment here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -482,4 +482,15 @@ public void testAssertCorrectSortOrderMultipleBams() {
args.addInput(new File(TEST_DATA_DIR,"example.chr1.1-1K.unmarkedDups.noDups.bam"));
runCommandLine(args);
}

@Test
public void testAssertCorrectSortOrderMultipleBamsOverriding() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test case for a single unsorted bam, and one that rejects when there's a coordinate sorted bam along side an unsorted bam

@jamesemery
Copy link
Collaborator Author

@lbergelson I added a second argument per your request. I still disagree that this is the right argument to expose because it is very dangerous.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I have a few last nitpicky dock changes.

@@ -127,6 +128,8 @@
programGroup = ReadDataManipulationProgramGroup.class)
public final class MarkDuplicatesSpark extends GATKSparkTool {
private static final long serialVersionUID = 1L;
public static final String ALLOW_MULTIPLE_SORT_ORDERS_IN_INPUT_ARG = "allow-multiple-sort-orders-in-input";
public static final String TREAT_UNSORTED_AS_OREDED = "treat-unsorted-as-querygroup-ordered-for-multiple-inputs";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo OREDED -> ORDERED

@@ -140,6 +143,16 @@
fullName = StandardArgumentDefinitions.METRICS_FILE_LONG_NAME)
protected String metricsFile;

@Advanced
@Argument(doc = "Override to allow non-queryname sorted inputs for multiple input bams.", optional=true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Argument(doc = "Override to allow non-queryname sorted inputs for multiple input bams.", optional=true,
@Argument(doc = "Allow non-queryname sorted inputs when specifying multiple input bams.", optional=true,

protected boolean allowMultipleSortOrders = false;

@Advanced
@Argument(doc = "Treat unsorted files as query-group orderd files. NOTE: this may result in mark duplicates crashing if the file is unordered", optional=true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Argument(doc = "Treat unsorted files as query-group orderd files. NOTE: this may result in mark duplicates crashing if the file is unordered", optional=true,
@Argument(doc = "Treat files that are marked as unsorted or have no defined sort order files as query-group ordered files. WARNING: This option disables a basic safety check and may result in crashes or incorrect output if the files are not actually queryname grouped", optional=true,

I might make this warning a bit stronger.

mergedHeader.setGroupOrder(SAMFileHeader.GroupOrder.query);
if (!allowMultipleSortOrders) {
headerMap.entrySet().stream().forEach(h -> {
if (!ReadUtils.isReadNameGroupedBam(h.getValue()) && (!treatUnsortedAsOrdered && h.getValue().getSortOrder().equals(SAMFileHeader.SortOrder.unsorted))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

treatUnsortedAsOrdered should work for either SortOrder.unknown or SortOrder.unsorted

I would probably pull out a function here that does this check to make it a little clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treat unsortedAsOrdered should also apply when using a single input bam.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

throw new UserException("Multiple inputs to MarkDuplicatesSpark detected. MarkDuplicatesSpark requires all inputs to be queryname sorted or querygroup-sorted for multi-input processing but input "+h.getKey()+" was sorted in "+h.getValue().getSortOrder()+" order");
}});
mergedHeader.setGroupOrder(SAMFileHeader.GroupOrder.query);
if (!allowMultipleSortOrders) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this is set then multiple bams will always be sorted even if they are all coordinate grouped already? It probably shouldn't do that.

@lbergelson
Copy link
Member

Ack, I posted a bunch of outdated comments by accident, ignore the ones that make no sense now.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery A few tiny doc changes and I think this is good to go.

@@ -313,7 +313,7 @@ protected void runTool(final JavaSparkContext ctx) {
if (headerMap.size() > 1) {
final Optional<Map.Entry<String, SAMFileHeader>> badlySorted = headerMap.entrySet()
.stream()
.filter(h -> treatAsReadGroupOrdered(h.getValue(), treatUnsortedAsOrdered))
.filter(h -> !treatAsReadGroupOrdered(h.getValue(), treatUnsortedAsOrdered))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, sorry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats worse is that I actually caught this yesterday but forgot to push the branch until today... sigh...

@jamesemery jamesemery merged commit 44ca2e9 into master May 31, 2019
@jamesemery jamesemery deleted the je_enableMixedInputTypesForMarkDuplicatesSpark branch May 31, 2019 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the option for mixed-sorting inputs to MarkDuplicatesSpark
2 participants