Replace Hadoop-BAM with Disq #5138

tomwhite · 2018-08-27T13:08:19Z

Preview of the changes needed to use Disq.

tomwhite · 2018-08-30T08:49:01Z

There are currently two failing tests, both of which need fixes in htsjdk.

CountVariantsSparkIntegrationTest. This is a concurrency issue where VCFCodec (which isn't threadsafe) is being shared by multiple tasks in each Spark executor. The fix is for each task (partition) to have its own VCFCodec. This needs a couple of small changes in htsjdk to make it possible to access the codec's version and header fields so the codec can be recreated. See Expose VCFCodec version and header fields. samtools/htsjdk#1176.
CpxVariantReInterpreterSparkIntegrationTest. I tracked this down to a problem with the buffer in htsjdk's SeekableBufferedStream. See Invalid buffer access in SeekableBufferedStream samtools/htsjdk#1175 for the fix.

jamesemery · 2018-09-04T19:45:12Z

@tomwhite After spending some time searching for this feature for my testing purposes, it would be helpful to expose the NIO adapter toggle directly from the command line in this branch.

tomwhite · 2018-09-18T10:29:49Z

Htsjdk and the Disq snapshot have been updated and now the previously failing tests are passing.

codecov-io · 2018-10-08T14:38:40Z

Codecov Report

Merging #5138 into master will decrease coverage by 13.875%.
The diff coverage is 77.992%.

@@               Coverage Diff                @@
##              master     #5138        +/-   ##
================================================
- Coverage     86.988%   73.113%   -13.875%     
+ Complexity     31224     24503      -6721     
================================================
  Files           1914      1813       -101     
  Lines         144264    134638      -9626     
  Branches       15956     14915      -1041     
================================================
- Hits          125492     98438     -27054     
- Misses         13003     31489     +18486     
+ Partials        5769      4711      -1058

Impacted Files	Coverage Δ	Complexity Δ
...der/tools/walkers/CombineGVCFsIntegrationTest.java	`0.873% <ø> (-86.574%)`	`2 <0> (-22)`
...er/tools/walkers/GenotypeGVCFsIntegrationTest.java	`4.878% <ø> (-76.014%)`	`2 <0> (-23)`
...ers/annotator/VariantAnnotatorIntegrationTest.java	`2.484% <ø> (-95.031%)`	`2 <0> (-53)`
.../hellbender/utils/variant/writers/HomRefBlock.java	`88.158% <0%> (-5.085%)`	`35 <0> (-1)`
...ellbender/utils/iterators/PushPullTransformer.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...dinstitute/hellbender/tools/spark/PileupSpark.java	`0% <0%> (-98.148%)`	`0 <0> (-15)`
...der/tools/spark/ApplyBQSRSparkIntegrationTest.java	`8.451% <0%> (-66.197%)`	`2 <0> (-7)`
...der/tools/spark/CreateHadoopBamSplittingIndex.java	`0% <0%> (-88.462%)`	`0 <0> (-14)`
...bender/tools/spark/PileupSparkIntegrationTest.java	`2.041% <0%> (-97.959%)`	`2 <0> (-13)`
.../CreateHadoopBamSplittingIndexIntegrationTest.java	`4.688% <0%> (-84.375%)`	`1 <0> (-12)`
... and 875 more

jamesemery

Generally speaking the changes are good and clear up some of the clutter in our spark methods. Unfortunately there are some unrelated changes bundled into this branch that make it difficult to evaluate properly. I will take a closer look at the changes made in the GVCF code soon.

build.gradle

jamesemery · 2018-10-09T18:19:42Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java


        // The underlying reads are required to be in SAMRecord format in order to be
        // written out, so we convert them to SAMRecord explicitly here. If they're already
        // SAMRecords, this will effectively be a no-op. The SAMRecords will be headerless
        // for efficient serialization.
+        // TODO: add header here


jamesemery · 2018-10-09T18:35:24Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSink.java

+            final JavaSparkContext ctx, final String outputFile, final String referenceFile, final JavaRDD<SAMRecord> reads,
+            final SAMFileHeader header, final int numReducers, final WriteOption... writeOptions) throws IOException {
+
+        final JavaRDD<SAMRecord> sortedReads = sortSamRecordsToMatchHeader(reads, header, numReducers);


Did this not break tests? There are tests in the codebase right now that forcing a sort on every sharded output should and does break, #4874 ran into this problem.

Additionally, there should be something written before and after this stage to stdout, as we found when we added the sort that this step resulted in an opaque final step for spark outputting where it was apparently not providing any output.

A line like logger.info("Finished sorting the bam file and dumping read shards to disk, proceeding to merge the shards into a single file using the master thread"); from the old incarnation of this method should be preserved in the appropriate case.

Though on second thought this may or may not be necessary given that we stay within spark for the next several stages...

No tests have been removed, and they all pass. Do you think there is another case that needs covering?

We can add logging to Disq for some of these operations.

jamesemery · 2018-10-09T18:45:52Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSource.java

-        }
-        else {
-            if (null == referenceName) {
+    static String checkCramReference(final JavaSparkContext ctx, final String filePath, final String referencePath) {


This needs a more informative javadoc that enumerates which cases are accepted and which aren't.

jamesemery · 2018-10-09T19:04:18Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/VariantsSparkSink.java

        if (outputFile.endsWith(IOUtil.BCF_FILE_EXTENSION) || outputFile.endsWith(IOUtil.BCF_FILE_EXTENSION + ".gz")) {
            throw new UserException.UnimplementedFeature("It is currently not possible to write a BCF file on spark.  See https://github.com/broadinstitute/gatk/issues/4303 for more details .");
        }
-
-        if (outputFile.endsWith(BGZFCodec.DEFAULT_EXTENSION) || outputFile.endsWith(".gz")) {


Are we no longer handling vcf.gz extension files?

We do, this case is now handled in Disq. See e.g. https://github.com/disq-bio/disq/blob/master/src/test/java/org/disq_bio/disq/HtsjdkVariantsRddTest.java#L32.

jamesemery · 2018-10-09T19:12:19Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/VariantsSparkSource.java

@@ -65,16 +59,13 @@ public VariantsSparkSource(JavaSparkContext ctx) {
     * @return JavaRDD<VariantContext> of variants from all files.
     */
    public JavaRDD<VariantContext> getParallelVariantContexts(final String vcf, final List<SimpleInterval> intervals) {
-        Configuration conf = new Configuration();
-        conf.setStrings("io.compression.codecs", BGZFEnhancedGzipCodec.class.getCanonicalName(),


These codecs? Where did they go?

Handled by Disq.

jamesemery · 2018-10-09T19:42:38Z

...test/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSinkUnitTest.java

@@ -152,7 +153,7 @@ private void assertSingleShardedWritingWorks(String inputBam, String referenceFi

        // check that a splitting bai file is created
        if (IOUtils.isBamFileName(outputPath)) {
-            Assert.assertTrue(Files.exists(IOUtils.getPath(outputPath + SplittingBAMIndexer.OUTPUT_FILE_EXTENSION)));
+            //Assert.assertTrue(Files.exists(IOUtils.getPath(outputPath + SBIIndex.FILE_EXTENSION)));


commented out code should probably be removed

I opened disq-bio/disq#45 to do this. It shouldn't block this being merged though.

jamesemery · 2018-10-09T20:26:54Z

src/main/java/org/broadinstitute/hellbender/utils/variant/writers/GVCFBlockMergingIterator.java

+
+import java.util.*;
+
+public class GVCFBlockMergingIterator extends PushToPullIterator<VariantContext> {


I would add a comment here making explicit the fact that you are reusing the PushToPullIterator which talks about downsampling in its comments and indeed appears to have that hardwired into its behavior. Or I would make the commenting on PushToPullIterator more generic (referring to downsampling as an example) to avoid confusion.

I've reworked the javadoc in #5311 to address this.

jamesemery

We should pull out the GVCFBlockMergingIterator refactor here into another PR so it can be given its own review.

jamesemery · 2018-10-15T15:41:29Z

src/main/java/org/broadinstitute/hellbender/utils/variant/writers/GVCFBlockCombiner.java

+import static htsjdk.variant.vcf.VCFConstants.MAX_GENOTYPE_QUAL;
+import static org.broadinstitute.hellbender.utils.variant.writers.GVCFWriter.GVCF_BLOCK;
+
+public final class GVCFBlockCombiner implements PushPullTransformer<VariantContext> {


Could you actually spin all of the push/pull iterator code into a separate branch? It deserves a separate PR from the rest of the code in this branch.

Agreed - the push/pull stuff needs to be reviewed separately.

tomwhite · 2018-10-16T16:03:14Z

Opened #5311 for the push/pull part.

jamesemery

These changes look good to me.

…e-sorted, but it was not. Used 'sort -k1,1 -s' to sort by QNAME field.

lbergelson · 2018-12-04T15:45:25Z

Yay!

This was referenced Aug 30, 2018

Invalid buffer access in SeekableBufferedStream samtools/htsjdk#1175

Merged

Expose VCFCodec version and header fields. samtools/htsjdk#1176

Merged

tomwhite force-pushed the tw_disq branch 2 times, most recently from 0be0976 to e7afd79 Compare September 10, 2018 10:50

tomwhite force-pushed the tw_disq branch from e7afd79 to ef89ee0 Compare September 18, 2018 08:28

tomwhite changed the title ~~Disq (DO NOT MERGE)~~ Replace Hadoop-BAM with Disq Sep 18, 2018

tomwhite requested a review from droazen September 18, 2018 10:29

tomwhite mentioned this pull request Sep 25, 2018

investigate use of spark-bam to improve spark bam reading #3993

Closed

droazen requested a review from jamesemery October 1, 2018 17:33

droazen assigned jamesemery Oct 1, 2018

tomwhite mentioned this pull request Oct 8, 2018

Hook up new Disq library to GATK, replacing Hadoop-BAM #5271

Closed

tomwhite force-pushed the tw_disq branch from ef89ee0 to fddf5d3 Compare October 8, 2018 13:38

jamesemery reviewed Oct 9, 2018

View reviewed changes

jamesemery requested changes Oct 15, 2018

View reviewed changes

tomwhite mentioned this pull request Oct 16, 2018

Refactor GVCFWriter to allow push/pull iteration. #5311

Merged

tomwhite force-pushed the tw_disq branch from fddf5d3 to a1c3507 Compare October 17, 2018 10:33

tomwhite mentioned this pull request Oct 22, 2018

Got "Too many open files" when use BaseRecalibratorSpark #5316

Open

jamesemery approved these changes Oct 22, 2018

View reviewed changes

tomwhite force-pushed the tw_disq branch 2 times, most recently from b9d467a to b64ce48 Compare November 2, 2018 12:41

lbergelson mentioned this pull request Nov 5, 2018

CreateHadoopBamSplittingIndex on cram? #4506

Closed

tomwhite force-pushed the tw_disq branch from b64ce48 to cf280cc Compare December 3, 2018 10:39

tomwhite and others added 3 commits December 3, 2018 14:23

Disq

8c1d96a

Move VCFHeaderReader

eb08398

Fix PileupSparkIntegrationTest

6f574be

Fix incorrectly sorted test SAM file: the header said it was querynam…

8aab8dd

…e-sorted, but it was not. Used 'sort -k1,1 -s' to sort by QNAME field.

tomwhite force-pushed the tw_disq branch from cf280cc to 8aab8dd Compare December 3, 2018 14:30

tomwhite merged commit 39578f8 into master Dec 4, 2018

lbergelson deleted the tw_disq branch December 4, 2018 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Hadoop-BAM with Disq #5138

Replace Hadoop-BAM with Disq #5138

tomwhite commented Aug 27, 2018

tomwhite commented Aug 30, 2018

jamesemery commented Sep 4, 2018

tomwhite commented Sep 18, 2018

codecov-io commented Oct 8, 2018 •

edited

Loading

jamesemery left a comment

jamesemery Oct 9, 2018

tomwhite Oct 17, 2018

jamesemery Oct 9, 2018

jamesemery Oct 9, 2018

tomwhite Oct 17, 2018

jamesemery Oct 9, 2018

tomwhite Oct 17, 2018

jamesemery Oct 9, 2018

tomwhite Oct 17, 2018

jamesemery Oct 9, 2018

tomwhite Oct 17, 2018

jamesemery Oct 9, 2018

tomwhite Oct 17, 2018

jamesemery Oct 9, 2018

tomwhite Oct 17, 2018

jamesemery left a comment

jamesemery Oct 15, 2018

droazen Oct 15, 2018

tomwhite commented Oct 16, 2018

jamesemery left a comment

lbergelson commented Dec 4, 2018


		import java.util.*;

		public class GVCFBlockMergingIterator extends PushToPullIterator<VariantContext> {

Replace Hadoop-BAM with Disq #5138

Replace Hadoop-BAM with Disq #5138

Conversation

tomwhite commented Aug 27, 2018

tomwhite commented Aug 30, 2018

jamesemery commented Sep 4, 2018

tomwhite commented Sep 18, 2018

codecov-io commented Oct 8, 2018 • edited Loading

Codecov Report

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Oct 16, 2018

jamesemery left a comment

Choose a reason for hiding this comment

lbergelson commented Dec 4, 2018

codecov-io commented Oct 8, 2018 •

edited

Loading