Added an option to ReadsSparkSink specifying whether to sort the reads on output. #4874

jamesemery · 2018-06-10T20:46:42Z

Depends on #4545

jamesemery · 2018-07-06T14:48:05Z

@lbergelson Can we revisit this branch? We need to clean up the current sorting behavior as it is totally inconsistent and confusing. Here was an attempt to clean up the behavior so it will be more uniform.

droazen · 2018-08-27T19:21:14Z

@tomwhite Can you please have a look at this PR when you have time?

jamesemery · 2018-08-27T19:24:24Z

The idea behind this branch: make the output to readsSparkSort consistent and configurable. So that if a tool alters reads without changing their sort order then no sort will be performed by default. It also means that if you request sharded output there is the ability to ask reasSparkSource to sort the file for you.

tomwhite

Overall change looks like a good one, but the test isn't quite done.

tomwhite · 2018-09-04T09:31:57Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

@@ -166,7 +166,7 @@ public SparkHeaderlessCRAMOutputFormat() {
    public static void writeReads(
            final JavaSparkContext ctx, final String outputFile, final String referenceFile, final JavaRDD<GATKRead> reads,
            final SAMFileHeader header, ReadsWriteFormat format) throws IOException {
-        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null);
+        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null,format==ReadsWriteFormat.SINGLE);


Nit: missing space before comma

Lets add a comment about how this is here to maintain old weird behavior.

tomwhite · 2018-09-04T09:37:17Z

...java/org/broadinstitute/hellbender/tools/spark/pipelines/PrintReadsSparkIntegrationTest.java

@@ -242,6 +242,26 @@ public void testUnSorted() throws Exception {
        SamAssertionUtils.assertSamsEqual(outBam, inBam);
    }

+    @Test( groups = "spark")
+    public void testUnSorted() throws Exception {


This test looks like it's not finished? It shares a name with an existing test, and print_reads.mismatchedHeader.sam is unused.

lbergelson

@jamesemery The basic change looks ok to me. It's failing to compile tests because of the duplicated test name that tom pointed out. I have a few minor comments, but once the test is fixed I think it should be good to go assuming that nothing else is failing.

Should we open a new ticket after this to revisit the different treatment between sharded and non-sharded output by default? It seems like we should get rid of that because it's confusing, but I assume you left that bit in in order to keep from breaking things and get the option to control it in faster.

lbergelson · 2018-09-12T16:38:51Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

@@ -166,7 +166,7 @@ public SparkHeaderlessCRAMOutputFormat() {
    public static void writeReads(
            final JavaSparkContext ctx, final String outputFile, final String referenceFile, final JavaRDD<GATKRead> reads,
            final SAMFileHeader header, ReadsWriteFormat format) throws IOException {
-        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null);
+        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null,format==ReadsWriteFormat.SINGLE);


Lets add a comment about how this is here to maintain old weird behavior.

lbergelson · 2018-09-12T16:41:12Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

@@ -201,19 +202,20 @@ public static void writeReads(
        // SAMRecords, this will effectively be a no-op. The SAMRecords will be headerless
        // for efficient serialization.
        final JavaRDD<SAMRecord> samReads = reads.map(read -> read.convertToSAMRecord(null));
+        final JavaRDD<SAMRecord> readsToUse = sortReadsToHeader ? sortSamRecordsToMatchHeader(samReads, header, numReducers) : samReads;


maybe readsToUse -> readsToOutput would be clearer?

jamesemery · 2019-02-08T20:29:52Z

@lbergelson @droazen This branch has been brought to speed with master now, I need somebody to review it however....

codecov-io · 2019-02-08T21:22:46Z

Codecov Report

Merging #4874 into master will increase coverage by 0.005%.
The diff coverage is 90.909%.

@@               Coverage Diff               @@
##              master     #4874       +/-   ##
===============================================
+ Coverage     87.044%   87.049%   +0.005%     
- Complexity     31693     31697        +4     
===============================================
  Files           1938      1938               
  Lines         146041    146097       +56     
  Branches       16124     16128        +4     
===============================================
+ Hits          127120    127176       +56     
  Misses         13036     13036               
  Partials        5885      5885

Impacted Files	Coverage Δ	Complexity Δ
...nder/tools/spark/pathseq/PathSeqPipelineSpark.java	`81.25% <ø> (ø)`	`7 <0> (ø)`	⬇️
...lbender/tools/spark/pathseq/PathSeqScoreSpark.java	`57.407% <ø> (ø)`	`7 <0> (ø)`	⬇️
...llbender/engine/spark/DataprocIntegrationTest.java	`1.786% <0%> (+0.062%)`	`1 <0> (ø)`	⬇️
...hellbender/tools/spark/pipelines/SortSamSpark.java	`100% <100%> (ø)`	`5 <0> (-1)`	⬇️
...ellbender/tools/spark/pathseq/PathSeqBwaSpark.java	`67.391% <100%> (ø)`	`7 <0> (ø)`	⬇️
...ls/ExtractOriginalAlignmentRecordsByNameSpark.java	`90.909% <100%> (ø)`	`10 <0> (ø)`	⬇️
...stitute/hellbender/tools/spark/RevertSamSpark.java	`83.895% <100%> (ø)`	`86 <0> (ø)`	⬇️
...institute/hellbender/tools/spark/bwa/BwaSpark.java	`78.947% <100%> (+1.17%)`	`7 <0> (ø)`	⬇️
...stitute/hellbender/engine/spark/GATKSparkTool.java	`82.418% <100%> (ø)`	`78 <1> (ø)`	⬇️
...nder/tools/spark/pipelines/ReadsPipelineSpark.java	`90.741% <100%> (ø)`	`13 <0> (ø)`	⬇️
... and 12 more

lbergelson

I think you have a bug, and a few random comments.

lbergelson · 2019-02-11T15:18:30Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

-        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null);
+        // NOTE, we must include 'format==ReadsWriteFormat.SINGLE' to preserve the old default behavior for writing spark output
+        // which would not sort the bam if outputting to ReadsWriteFormat.SINGLE. Please use the overload for different sroting behavior.
+        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null, format==ReadsWriteFormat.SINGLE);


I'd like to remove this particular wierdness. How hard is it to just eliminate this behavior?

I think the only remaining use is in BwaSpark and that can be changed to just always sort I think.

Or to take a parameter to sort I guess.

Either way...

lbergelson · 2019-02-11T15:19:09Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

-            final SAMFileHeader header, ReadsWriteFormat format, final int numReducers, final String outputPartsDir) throws IOException {
-        writeReads(ctx, outputFile, referenceFile, reads, header, format, numReducers, outputPartsDir, true, true);
+            final SAMFileHeader header, ReadsWriteFormat format, final int numReducers, final String outputPartsDir, final boolean sortReadsToHeader) throws IOException {
+        writeReads(ctx, outputFile, referenceFile, reads, header, format, numReducers, outputPartsDir, true, true, true);


this seems like a bug... you added a boolean parameter to the method and then set it to be true always.

lbergelson · 2019-02-11T15:22:39Z

...test/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSinkUnitTest.java

@@ -48,9 +48,6 @@ private void shutdownMiniCluster() {
    public Object[][] loadReadsBAM() {
        return new Object[][]{
                {testDataDir + "tools/BQSR/HiSeq.1mb.1RG.2k_lines.bam", "ReadsSparkSinkUnitTest1", null, ".bam", true, true},
-                {testDataDir + "tools/BQSR/HiSeq.1mb.1RG.2k_lines.bam", "ReadsSparkSinkUnitTest1", null, ".bam", true, false}, // write BAI, don't write SBI


Did you mean to delete these test cases?

lbergelson · 2019-02-11T15:23:11Z

...test/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSinkUnitTest.java

@@ -149,7 +146,7 @@ private void assertSingleShardedWritingWorks(String inputBam, String referenceFi
        JavaRDD<GATKRead> rddParallelReads = readSource.getParallelReads(inputBam, referenceFile);
        SAMFileHeader header = readSource.getHeader(inputBam, referenceFile);

-        ReadsSparkSink.writeReads(ctx, outputPath, referenceFile, rddParallelReads, header, ReadsWriteFormat.SINGLE, 0, outputPartsPath, writeBai, writeSbi);
+        ReadsSparkSink.writeReads(ctx, outputPath, referenceFile, rddParallelReads, header, ReadsWriteFormat.SINGLE, 0, outputPartsPath, writeBai, writeSbi, true);


you should have a test case for writing unsorted reads in this class I think.

lbergelson · 2019-02-11T15:24:31Z

...java/org/broadinstitute/hellbender/tools/spark/pipelines/PrintReadsSparkIntegrationTest.java

+        }
+        final File outBam = GATKBaseTest.createTempFile("print_reads", ".bam");
+        ArgumentsBuilder args = new ArgumentsBuilder();
+        args.add("--" + StandardArgumentDefinitions.INPUT_LONG_NAME);


We have convenience methods for this..

jamesemery · 2019-02-11T16:59:08Z

@lbergelson Responded to your comments, I left the behavior in bwaSpark for sharded writing as I suspect that was the specific use case for which it was introduced. Back to you...

…automatically sorts the output or not.

lbergelson · 2019-02-11T17:16:45Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

-        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null);
+        // NOTE, we must include 'format==ReadsWriteFormat.SINGLE' to preserve the old default behavior for writing spark output
+        // which would not sort the bam if outputting to ReadsWriteFormat.SINGLE. Please use the overload for different sorting behavior.
+        writeReads(ctx, outputFile, referenceFile, reads, header, format, 0, null, true);


If you think this behavior is important to bwaspark, push it down there. This is going to bite us in the future if we leave it in.

Oh, you did, you just forgot the comment.

lbergelson · 2019-02-11T17:20:11Z

src/main/java/org/broadinstitute/hellbender/tools/spark/bwa/BwaSpark.java

@@ -74,7 +74,7 @@ protected void runTool(final JavaSparkContext ctx) {
            }
            try {
                ReadsSparkSink.writeReads(ctx, output, null, reads, bwaEngine.getHeader(),
-                                            shardedOutput ? ReadsWriteFormat.SHARDED : ReadsWriteFormat.SINGLE);
+                                            shardedOutput ? ReadsWriteFormat.SHARDED : ReadsWriteFormat.SINGLE, getRecommendedNumReducers(), shardedPartsDir, shardedOutput);


Maybe comment here so people know that something weird is going on?

jamesemery · 2019-02-11T18:08:52Z

@lbergelson Added a comment to BwaSpark. Can this be merged?

lbergelson

👍

It's been approved by louis

jamesemery requested a review from lbergelson June 10, 2018 20:46

jamesemery force-pushed the je_parameterizeReadsSparkSourceSort branch from 6d4f8fc to 12b502d Compare June 14, 2018 16:11

jamesemery force-pushed the je_parameterizeReadsSparkSourceSort branch from 12b502d to 538658a Compare July 6, 2018 14:51

jamesemery assigned lbergelson Jul 6, 2018

droazen requested a review from tomwhite August 27, 2018 19:20

droazen assigned tomwhite and unassigned lbergelson Aug 27, 2018

tomwhite previously requested changes Sep 4, 2018

View reviewed changes

lbergelson requested changes Sep 12, 2018

View reviewed changes

jamesemery force-pushed the je_parameterizeReadsSparkSourceSort branch from 36f78b0 to c130faf Compare September 20, 2018 14:56

droazen assigned lbergelson Oct 1, 2018

jamesemery mentioned this pull request Oct 9, 2018

Replace Hadoop-BAM with Disq #5138

Merged

jamesemery force-pushed the je_parameterizeReadsSparkSourceSort branch from 99c3f07 to 2aef438 Compare February 8, 2019 20:29

lbergelson requested changes Feb 11, 2019

View reviewed changes

jamesemery added 6 commits February 11, 2019 11:59

Changed the ReadsSparkSink behavior to be configurable in whether it …

1bf9848

…automatically sorts the output or not.

Responded to comments

31d9e04

resolving conflicts

f398fae

better to avoid actually fixing the test

ea087ac

Resolved conflicts between this and master

e85d47e

responded to comments agian

34c8e34

jamesemery force-pushed the je_parameterizeReadsSparkSourceSort branch from a874279 to 34c8e34 Compare February 11, 2019 16:59

lbergelson reviewed Feb 11, 2019

View reviewed changes

removing an out-of-date comment

098088a

lbergelson reviewed Feb 11, 2019

View reviewed changes

responding to last comments

bdb6913

jamesemery added 2 commits February 11, 2019 13:17

forgot to flag a test

c10c6be

fixing a broken test

0f73f5f

lbergelson approved these changes Feb 11, 2019

View reviewed changes

jamesemery merged commit 9c22c34 into master Feb 12, 2019

jamesemery deleted the je_parameterizeReadsSparkSourceSort branch February 12, 2019 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an option to ReadsSparkSink specifying whether to sort the reads on output. #4874

Added an option to ReadsSparkSink specifying whether to sort the reads on output. #4874

jamesemery commented Jun 10, 2018

jamesemery commented Jul 6, 2018

droazen commented Aug 27, 2018

jamesemery commented Aug 27, 2018

tomwhite left a comment

tomwhite Sep 4, 2018

lbergelson Sep 12, 2018

tomwhite Sep 4, 2018

lbergelson left a comment

lbergelson Sep 12, 2018

lbergelson Sep 12, 2018

jamesemery commented Feb 8, 2019

codecov-io commented Feb 8, 2019 •

edited

Loading

lbergelson left a comment

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

jamesemery commented Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

lbergelson Feb 11, 2019

jamesemery commented Feb 11, 2019

lbergelson left a comment

Added an option to ReadsSparkSink specifying whether to sort the reads on output. #4874

Added an option to ReadsSparkSink specifying whether to sort the reads on output. #4874

Conversation

jamesemery commented Jun 10, 2018

jamesemery commented Jul 6, 2018

droazen commented Aug 27, 2018

jamesemery commented Aug 27, 2018

tomwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Feb 8, 2019

codecov-io commented Feb 8, 2019 • edited Loading

Codecov Report

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Feb 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Feb 11, 2019

lbergelson left a comment

Choose a reason for hiding this comment

codecov-io commented Feb 8, 2019 •

edited

Loading