Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding --sort-order option to SortSamSpark #4545

Merged
merged 7 commits into from
Jun 11, 2018
Merged

Conversation

lbergelson
Copy link
Member

adding a --sort-order option to SortSamSpark to let users specify the what order to sort in
enabling disabled tests
fixing the tests which weren't actually asserting anything

closes #1260

@codecov-io
Copy link

codecov-io commented Mar 20, 2018

Codecov Report

Merging #4545 into master will increase coverage by 0.287%.
The diff coverage is 87.097%.

@@               Coverage Diff               @@
##              master     #4545       +/-   ##
===============================================
+ Coverage     80.355%   80.642%   +0.287%     
- Complexity     17714     18478      +764     
===============================================
  Files           1088      1089        +1     
  Lines          63975     66116     +2141     
  Branches       10313     10913      +600     
===============================================
+ Hits           51407     53317     +1910     
- Misses          8555      8661      +106     
- Partials        4013      4138      +125
Impacted Files Coverage Δ Complexity Δ
...ender/engine/spark/datasources/ReadsSparkSink.java 77.027% <ø> (ø) 33 <0> (ø) ⬇️
...hellbender/tools/spark/pipelines/SortSamSpark.java 100% <100%> (+29.412%) 6 <3> (+2) ⬆️
...broadinstitute/hellbender/utils/test/BaseTest.java 65.541% <40%> (-0.432%) 39 <1> (+2)
...der/engine/spark/datasources/ReadsSparkSource.java 80.208% <87.5%> (+0.663%) 31 <0> (ø) ⬇️
...itute/hellbender/utils/test/SamAssertionUtils.java 70.256% <0%> (-2.564%) 42% <0%> (-4%)
...nder/tools/funcotator/TranscriptSelectionMode.java 89.326% <0%> (-1.15%) 1% <0%> (ø)
...tools/funcotator/DataSourceFuncotationFactory.java 89.655% <0%> (-0.345%) 21% <0%> (+4%)
...adinstitute/hellbender/utils/R/RScriptLibrary.java 100% <0%> (ø) 11% <0%> (+5%) ⬆️
...te/hellbender/tools/funcotator/FuncotationMap.java 84.211% <0%> (ø) 28% <0%> (?)
...org/broadinstitute/hellbender/engine/GATKTool.java 91.463% <0%> (+0.392%) 108% <0%> (+9%) ⬆️
... and 21 more

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review complete, back to @lbergelson for changes. There is an issue with the tool unconditionally using ReadCoordinateComparator, and the tests not being valid due to using the input as the expected output.

protected String outputFile;
private String outputFile;

@Argument(doc="the order to sort the file into", fullName = SORT_ORDER_LONG_NAME, optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the order to sort the file into" -> "sort order of output file" (the latter seems more grammatical to me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -27,17 +29,29 @@
public final class SortSamSpark extends GATKSparkTool {
private static final long serialVersionUID = 1L;

public static final String SORT_ORDER_LONG_NAME = "sort-order";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line after this declaration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@Override
public List<ReadFilter> getDefaultReadFilters() {
return Collections.singletonList(ReadFilterLibrary.ALLOW_ALL_READS);
}

@Override
protected void onStartup() {
if( sortOrder.getComparatorInstance() == null){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unbalanced paren spacing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. I'm sure intellij will find a different place to unbalance though

@Override
protected void onStartup() {
if( sortOrder.getComparatorInstance() == null){
throw new UserException.BadInput("Cannot sort a file in " + sortOrder + " order. That ordering doesnt define a valid comparator. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"That ordering doesnt define a valid comparator" -> "There is no comparator defined for that ordering"


@Override
public List<ReadFilter> getDefaultReadFilters() {
return Collections.singletonList(ReadFilterLibrary.ALLOW_ALL_READS);
}

@Override
protected void onStartup() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call super.onStartup() in the first line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On startup is documented as having an empty default implementation. I can add it anyway though to future proof.

}
args.add("--num-reducers"); args.add("1");
args.addArgument(GATKSparkTool.NUM_REDUCERS_LONG_NAME, "1");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does num reducers need to be 1 for this test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... Any number should be fine I think.

// {"count_reads.bam", "count_reads.bam", null, ".bam", "queryname"},
// {"count_reads.cram", "count_reads.cram", "count_reads.fasta", ".cram", "queryname"},
{"count_reads.bam", "count_reads.bam", null, ".bam", "queryname"},
{"count_reads.cram", "count_reads.cram", "count_reads.fasta", ".cram", "queryname"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid test? The inputs and outputs are the same! I'm not convinced that queryname sorting is actually working in this branch, since you use a ReadCoordinateComparator in the tool no matter what.

args.add("--num-reducers"); args.add("1");
args.addInput(unsortedBam);
args.addOutput(outputBam);
args.addArgument(GATKSparkTool.NUM_REDUCERS_LONG_NAME, "1");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why set the number of reducers to 1?

@@ -61,13 +59,32 @@ public void test() throws Exception {
final File sortedBam = new File(getTestDataDir(), "count_reads_sorted.bam");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test method needs a better name than just test(). Also, why do we need this test case in addition to testSortBAMs()? Can they be unified?

@@ -7,7 +7,7 @@
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
@RG ID:0 SM:Hi,Mom!
@RG ID:0 SM:Hi,Mom! PL:ILLUMINA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you need to make this change?

adding a --sort-order option to SortSamSpark to let users specify the what order to sort in
enabling disabled tests
fixing the tests which weren't actually asserting anything

closes #1260

work in progress

in progress

refactoring

using new SparkUtils method
@jamesemery
Copy link
Collaborator

@lbergelson looks like the tests passed this time around. We should open a ticket in Hadoop-Bam to fix the issue

@lbergelson
Copy link
Member Author

@jamesemery I've opened a ticket in hadoop bam.
@droazen Could you re-review this when you get a chance? It's super useful functionality. If we don't trust the fix for sharded files, we could globally disable them for cram/sam instead. I think the problem is not unique to this branch, it's just the first to test the file order or reading sharded cram/sam.


if( splits.stream().allMatch(split -> split instanceof FileVirtualSplit || split instanceof FileSplit)) {
splits.sort(Comparator.comparing(split -> {
if (split instanceof FileVirtualSplit) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mix the virtual and non-virtual splits here. This makes me uncomfortable as I don't know that its guaranteed either is going to return the same thing for getPath(). Could you use the filename instead? As that seems to be what is being sorted here anyway.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@lbergelson lbergelson dismissed droazen’s stale review June 11, 2018 22:20

james took over the review

@lbergelson lbergelson merged commit 1751c85 into master Jun 11, 2018
@lbergelson lbergelson deleted the lb_sort_order_option branch June 11, 2018 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SortBamSpark is missing SORT_ORDER parameter
4 participants