New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

adding --sort-order option to SortSamSpark #4545

Merged

lbergelson merged 7 commits into master from lb_sort_order_option

Jun 11, 2018

Member

lbergelson commented Mar 20, 2018

adding a --sort-order option to SortSamSpark to let users specify the what order to sort in
enabling disabled tests
fixing the tests which weren't actually asserting anything

closes #1260

lbergelson assigned droazen

lbergelson requested a review from droazen

March 20, 2018 20:05

codecov-io commented Mar 20, 2018 •

edited

Loading

Codecov Report

Merging #4545 into master will increase coverage by 0.287%.
The diff coverage is 87.097%.

@@               Coverage Diff               @@
##              master     #4545       +/-   ##
===============================================
+ Coverage     80.355%   80.642%   +0.287%     
- Complexity     17714     18478      +764     
===============================================
  Files           1088      1089        +1     
  Lines          63975     66116     +2141     
  Branches       10313     10913      +600     
===============================================
+ Hits           51407     53317     +1910     
- Misses          8555      8661      +106     
- Partials        4013      4138      +125

Impacted Files	Coverage Δ	Complexity Δ
...ender/engine/spark/datasources/ReadsSparkSink.java	`77.027% <ø> (ø)`	`33 <0> (ø)`	⬇️
...hellbender/tools/spark/pipelines/SortSamSpark.java	`100% <100%> (+29.412%)`	`6 <3> (+2)`	⬆️
...broadinstitute/hellbender/utils/test/BaseTest.java	`65.541% <40%> (-0.432%)`	`39 <1> (+2)`
...der/engine/spark/datasources/ReadsSparkSource.java	`80.208% <87.5%> (+0.663%)`	`31 <0> (ø)`	⬇️
...itute/hellbender/utils/test/SamAssertionUtils.java	`70.256% <0%> (-2.564%)`	`42% <0%> (-4%)`
...nder/tools/funcotator/TranscriptSelectionMode.java	`89.326% <0%> (-1.15%)`	`1% <0%> (ø)`
...tools/funcotator/DataSourceFuncotationFactory.java	`89.655% <0%> (-0.345%)`	`21% <0%> (+4%)`
...adinstitute/hellbender/utils/R/RScriptLibrary.java	`100% <0%> (ø)`	`11% <0%> (+5%)`	⬆️
...te/hellbender/tools/funcotator/FuncotationMap.java	`84.211% <0%> (ø)`	`28% <0%> (?)`
...org/broadinstitute/hellbender/engine/GATKTool.java	`91.463% <0%> (+0.392%)`	`108% <0%> (+9%)`	⬆️
... and 21 more

droazen previously requested changes

View reviewed changes

Collaborator

droazen left a comment

Review complete, back to @lbergelson for changes. There is an issue with the tool unconditionally using ReadCoordinateComparator, and the tests not being valid due to using the input as the expected output.

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java Outdated

-                  protected String outputFile;
+                  private String outputFile;
+                  @Argument(doc="the order to sort the file into", fullName = SORT_ORDER_LONG_NAME, optional = true)

Collaborator

droazen Mar 21, 2018

"the order to sort the file into" -> "sort order of output file" (the latter seems more grammatical to me)

Member Author

lbergelson Apr 27, 2018

done

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

@@ @@ -27,17 +29,29 @@ @@
               public final class SortSamSpark extends GATKSparkTool {
                   private static final long serialVersionUID = 1L;
+                  public static final String SORT_ORDER_LONG_NAME = "sort-order";

Collaborator

droazen Mar 21, 2018

Add a blank line after this declaration

Member Author

lbergelson Apr 27, 2018

done

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java Outdated

                   @Override
                   public List<ReadFilter> getDefaultReadFilters() {
                       return Collections.singletonList(ReadFilterLibrary.ALLOW_ALL_READS);
                   }
+                  @Override
+                  protected void onStartup() {
+                      if( sortOrder.getComparatorInstance() == null){

Collaborator

droazen Mar 21, 2018

unbalanced paren spacing

Member Author

lbergelson Apr 27, 2018

fixed. I'm sure intellij will find a different place to unbalance though

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java Outdated

+                  @Override
+                  protected void onStartup() {
+                      if( sortOrder.getComparatorInstance() == null){
+                          throw new UserException.BadInput("Cannot sort a file in " + sortOrder + " order.  That ordering doesnt define a valid comparator.  "

Collaborator

droazen Mar 21, 2018

"That ordering doesnt define a valid comparator" -> "There is no comparator defined for that ordering"

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

                   @Override
                   public List<ReadFilter> getDefaultReadFilters() {
                       return Collections.singletonList(ReadFilterLibrary.ALLOW_ALL_READS);
                   }
+                  @Override
+                  protected void onStartup() {

Collaborator

droazen Mar 21, 2018

Call super.onStartup() in the first line?

Member Author

lbergelson Apr 27, 2018

On startup is documented as having an empty default implementation. I can add it anyway though to future proof.

...st/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSparkIntegrationTest.java Outdated

                       }
-                      args.add("--num-reducers"); args.add("1");
+                      args.addArgument(GATKSparkTool.NUM_REDUCERS_LONG_NAME, "1");

Collaborator

droazen Mar 21, 2018

Why does num reducers need to be 1 for this test?

Member Author

lbergelson Apr 27, 2018

I don't know... Any number should be fine I think.

...st/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSparkIntegrationTest.java Outdated

-              //                {"count_reads.bam", "count_reads.bam", null, ".bam", "queryname"},
-              //                {"count_reads.cram", "count_reads.cram", "count_reads.fasta", ".cram", "queryname"},
+                              {"count_reads.bam", "count_reads.bam", null, ".bam", "queryname"},
+                              {"count_reads.cram", "count_reads.cram", "count_reads.fasta", ".cram", "queryname"},

Collaborator

droazen Mar 21, 2018

Is this a valid test? The inputs and outputs are the same! I'm not convinced that queryname sorting is actually working in this branch, since you use a ReadCoordinateComparator in the tool no matter what.

...st/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSparkIntegrationTest.java Outdated

    
                      args.add("--num-reducers"); args.add("1");

                      args.addInput(unsortedBam);

                      args.addOutput(outputBam);

                      args.addArgument(GATKSparkTool.NUM_REDUCERS_LONG_NAME, "1");

Collaborator

droazen Mar 21, 2018

Again, why set the number of reducers to 1?

...st/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSparkIntegrationTest.java Outdated

		@@ -61,13 +59,32 @@ public void test() throws Exception {
		final File sortedBam = new File(getTestDataDir(), "count_reads_sorted.bam");

Collaborator

droazen Mar 21, 2018

This test method needs a better name than just test(). Also, why do we need this test case in addition to testSortBAMs()? Can they be unified?

src/test/resources/org/broadinstitute/hellbender/tools/count_reads_sorted.sam

@@ @@ -7,7 +7,7 @@ @@
               @SQ	SN:chr6	LN:101
               @SQ	SN:chr7	LN:404
               @SQ	SN:chr8	LN:202
-              @RG	ID:0	SM:Hi,Mom!
+              @RG	ID:0	SM:Hi,Mom!	PL:ILLUMINA

Collaborator

droazen Mar 21, 2018

Why did you need to make this change?

droazen assigned lbergelson and unassigned droazen

lbergelson force-pushed the lb_sort_order_option branch from 74bcb52 to 9f8c7c5 Compare

May 25, 2018 22:07

lbergelson added 3 commits

June 5, 2018 14:55


          adding --sort-order option to SortSamSpark

a29ab53

adding a --sort-order option to SortSamSpark to let users specify the what order to sort in
enabling disabled tests
fixing the tests which weren't actually asserting anything

closes #1260

work in progress

in progress

refactoring

using new SparkUtils method


          fix tests

d20579b


          more test output to try to figure out what's going on in travis

4d4265b

jamesemery force-pushed the lb_sort_order_option branch from 0722c0f to 4d4265b Compare

June 5, 2018 18:55

lbergelson added 2 commits

June 5, 2018 15:54


          trying to sort splits

8855b4a


          trying again with better casts, but still bad

Collaborator

jamesemery commented Jun 6, 2018

@lbergelson looks like the tests passed this time around. We should open a ticket in Hadoop-Bam to fix the issue

jamesemery mentioned this pull request

Removed check in PrintReadsSpark for coordinate sorted bam #4853

Merged


          adding doc to SplitSortingSamInputFormat

558a13f

Member Author

lbergelson commented Jun 6, 2018

@jamesemery I've opened a ticket in hadoop bam.
@droazen Could you re-review this when you get a chance? It's super useful functionality. If we don't trust the fix for sharded files, we could globally disable them for cram/sam instead. I think the problem is not unique to this branch, it's just the first to test the file order or reading sharded cram/sam.

jamesemery reviewed

View reviewed changes

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSource.java Outdated

+                          if( splits.stream().allMatch(split -> split instanceof FileVirtualSplit || split instanceof FileSplit)) {
+                              splits.sort(Comparator.comparing(split -> {
+                                  if (split instanceof FileVirtualSplit) {

Collaborator

jamesemery Jun 7, 2018

You mix the virtual and non-virtual splits here. This makes me uncomfortable as I don't know that its guaranteed either is going to return the same thing for getPath(). Could you use the filename instead? As that seems to be what is being sorted here anyway.

jamesemery mentioned this pull request

Added an option to ReadsSparkSink specifying whether to sort the reads on output. #4874

Merged


          responding to james

72997c5

jamesemery approved these changes

View reviewed changes

Collaborator

jamesemery left a comment

Looks good 👍

lbergelson dismissed droazen’s stale review

June 11, 2018 22:20

james took over the review

lbergelson merged commit 1751c85 into master

lbergelson deleted the lb_sort_order_option branch

June 11, 2018 22:22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet