Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416

tomwhite · 2018-11-15T18:05:41Z

There are tests for both the strict and non-strict modes for HaplotypeCallerSpark and for ExampleAssemblyRegionWalkerSpark. The strict modes produce identical results to the walker versions.

Strict mode is off by default and still needs work to scale to exome-sized data (another PR). This is an improvement over the current HaplotypeCallerSpark implementation since it fixes two bugs (reads overlapping more than two intervals; and editIntervals not being picked up) that caused the output to differ significantly from the walker version. While the output does still differ in non-strict mode, the difference is a lot less (compare expected.testVCFMode.gatk4.vcf and expected.testVCFMode.gatk4.nonstrict.vcf).

tomwhite · 2018-11-15T18:13:13Z

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

+        JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.map((Function<Shard<GATKRead>, AssemblyRegion>) shard -> toAssemblyRegion(shard, header));
+
+        // 7. Add reference and feature context.
+        return assemblyRegions.mapPartitions(getAssemblyRegionWalkerContextFunction(referenceFileName, bFeatureManager));


These 7 steps make up the salient part of the strict algorithm.

codecov-io · 2018-11-16T08:54:53Z

Codecov Report

Merging #5416 into master will increase coverage by 0.032%.
The diff coverage is 86.755%.

@@               Coverage Diff               @@
##              master     #5416       +/-   ##
===============================================
+ Coverage     86.984%   87.016%   +0.032%     
- Complexity     31208     31272       +64     
===============================================
  Files           1909      1914        +5     
  Lines         144202    144440      +238     
  Branches       15954     15982       +28     
===============================================
+ Hits          125433    125686      +253     
+ Misses         12999     12963       -36     
- Partials        5770      5791       +21

Impacted Files	Coverage Δ	Complexity Δ
.../hellbender/engine/spark/SparkSharderUnitTest.java	`93.846% <100%> (+0.926%)`	`8 <1> (+1)`	⬆️
...bender/engine/spark/AssemblyRegionWalkerSpark.java	`100% <100%> (+22.917%)`	`13 <0> (-2)`	⬇️
...nstitute/hellbender/engine/spark/SparkSharder.java	`90.789% <100%> (-0.367%)`	`28 <3> (-3)`
...der/tools/HaplotypeCallerSparkIntegrationTest.java	`68.293% <100%> (+9.563%)`	`16 <5> (+4)`	⬆️
...nder/tools/spark/pipelines/ReadsPipelineSpark.java	`90.741% <100%> (+0.175%)`	`13 <0> (ø)`	⬇️
...nstitute/hellbender/engine/ShardBoundaryShard.java	`100% <100%> (ø)`	`5 <1> (+1)`	⬆️
...ampleAssemblyRegionWalkerSparkIntegrationTest.java	`100% <100%> (+88.889%)`	`4 <1> (+2)`	⬆️
...tute/hellbender/engine/ReadlessAssemblyRegion.java	`46.667% <46.667%> (ø)`	`4 <4> (?)`
...stitute/hellbender/engine/spark/GATKSparkTool.java	`82.432% <60%> (-0.444%)`	`68 <4> (ø)`
...stitute/hellbender/tools/HaplotypeCallerSpark.java	`82.759% <69.231%> (-5.703%)`	`20 <1> (-2)`
... and 20 more

jamesemery

Some comments for now. I think we really need to think long and hard about the downsampling, as I suspect its a key source of the differences you have observed between spark and the base tool.

jamesemery · 2018-11-15T21:01:03Z

src/main/java/org/broadinstitute/hellbender/engine/ActivityProfileStateIterator.java

+
+        // We wrap our LocusIteratorByState inside an IntervalAlignmentContextIterator so that we get empty loci
+        // for uncovered locations. This is critical for reproducing GATK 3.x behavior!
+        LocusIteratorByState libs = new LocusIteratorByState(readShard.iterator(), DownsamplingMethod.NONE, false, ReadUtils.getSamplesFromHeader(readHeader), readHeader, includeReadsWithDeletionsInIsActivePileups);


DownsamplingMethod.NONE doesn't seem right here. In the AssemblyRegionWalker we use the positionalDownampler over each shard as we do the bandpass filtering, so at the stage where the active regions are developed. Perhaps we should look into whether there is a way to keep the downsampling consistent between the two approaches, especially if we are creating assembly regions with no reads and adding them back in later. Perhaps this accounts for some of the difference.

ActivityProfileStateIterator is the part of AssemblyRegionIterator that just does the iteration over activity profiles. It's the same code and should arguably be refactored to share code (e.g. AssemblyRegionIterator might compose ActivityProfileStateIterator and AssemblyRegionFromActivityProfileStateIterator, although it's not that simple, since AssemblyRegionIterator does read caching, which we don't want in the Spark case.

Also, I don't think we see downsampling kicking in for the tests, so it's probably not the reason for difference in the tests. That seems to be read shard boundary artifacts.

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

src/test/java/org/broadinstitute/hellbender/engine/spark/SparkSharderUnitTest.java

...ava/org/broadinstitute/hellbender/engine/AssemblyRegionFromActivityProfileStateIterator.java

src/test/java/org/broadinstitute/hellbender/tools/HaplotypeCallerSparkIntegrationTest.java

jamesemery · 2018-11-16T17:34:28Z

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

+        // TODO: interfaces could be improved to avoid casting
+        ReadlessAssemblyRegion readlessAssemblyRegion = (ReadlessAssemblyRegion) ((ShardBoundaryShard<GATKRead>) shard).getShardBoundary();
+        int extension = Math.max(shard.getInterval().getStart() - shard.getPaddedInterval().getStart(), shard.getPaddedInterval().getEnd() - shard.getInterval().getEnd());
+        AssemblyRegion assemblyRegion = new AssemblyRegion(shard.getInterval(), Collections.emptyList(), readlessAssemblyRegion.isActive(), extension, header);


I see the issue with this interface... It looks like you may be able to make regular assemblyRegion also extend ShardBoundaryShard. I'm not sure that helps anything really... but then at least AssemblyRegion and ReadlessAssemblyRegion are related in the class hirerarchy.

that is though an abstract interface ofcourse

Thanks for the suggestion. I need to spend a bit more time trying to improve this. (Could be done in another issue or PR.)

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

jamesemery · 2018-11-16T17:47:01Z

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

+
+            return Utils.stream(shardedReadIterator)
+                    .map(shardedRead -> new ShardToMultiIntervalShardAdapter<>(shardedRead))
+                    // TODO: reinstate downsampling (not yet working)


We should think about how we want to do this. Notably however, as of this implementation you need to add the same downsampling to the reads at step 5, as this is currently only hooked up to downsample for the band pass filtering but not on the reads when they get re-added to the filter. An obvious solution might involve persisiting the underlying bam but obviously thats expensive, the other is that we make the random seed in some way caclulatable, like base it on a hash of the read if necessary, that way we could get an exact mach with the regular haplotype caller.

The reason that I excluded this is because I found an underlying bug in the downsampling code where it would not do the downsampling correctly unless the iterator was created twice. (This needs further investigation.)

But you are right that the reads downsampling needs to have been carried out for the reads too, which isn't currently happening in this code. Let's explore your ideas further.

jamesemery · 2018-11-16T17:48:04Z

src/main/java/org/broadinstitute/hellbender/engine/ReadlessAssemblyRegion.java

+ * A cut-down version of {@link AssemblyRegion} that doesn't store reads, used in the strict implementation of
+ * {@link org.broadinstitute.hellbender.engine.spark.FindAssemblyRegionsSpark}.
+ */
+public class ReadlessAssemblyRegion extends ShardBoundary {


There should probably be a common interface linking this to AssemblyRegion, or a different name distinguishing it.

tomwhite · 2018-11-19T10:23:10Z

Thanks for the review @jamesemery. I've addressed your comments. A few outstanding issues:

ActivityProfileStateIterator and AssemblyRegionFromActivityProfileStateIterator duplicate parts of AssemblyRegionIterator, so it would be nice to remove the code duplication. Not totally straightforward as the latter does read caching, but the first two don't (Spark shouldn't be caching reads).
Downsampling needs more work. I would be OK doing that separately, since I've only ever seen it when running on a full genome, and the new strict code needs more work to work on a full genome (I've only got it running on an exome so far).
There are some improvements we could make to ReadlessAssemblyRegion regarding Java interface design and generics, but I'm not sure what they are yet.

I'm not sure if these are blockers, since the strict codepath is a new option (off by default), but would like to know what you and @jonn-smith think.

jonn-smith

I have some very minor comments. Most of the meat has already been addressed from what I can tell.

This will be a very good starting point for HaplotypeCallerSpark correctness.

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

jonn-smith · 2018-11-27T20:46:46Z

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

+                                assemblyRegionArgs.maxProbPropagationDistance, includeReadsWithDeletionsInIsActivePileups);
+                        return Utils.stream(assemblyRegionIter).map(assemblyRegion ->
+                                new AssemblyRegionWalkerContext(assemblyRegion,
+                                        new ReferenceContext(reference, assemblyRegion.getExtendedSpan()),


This is actually the same code that was running before, so it really should have been commented already...

It looks like the padding is getting passed in via assemblyRegionArgs.assemblyRegionPadding when creating assemblyRegionIter, so I don't think that it's a technical issue (the span without padding would be retrieved with assemblyRegion.getSpan()).

src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java

tomwhite · 2018-12-03T12:52:13Z

Thanks for the review @jonn-smith. I've addressed all your comments, and those of @jamesemery. Please approve so we can merge.

jamesemery

Considering that this is a beta tool anyway, the impact is relatively low and it is important to benchmark our progress. I do think we should take a look at running this over a full genome should be tried.

tomwhite · 2018-12-03T17:33:51Z

Thanks @jamesemery. I've successfully run with an exome in #5475; I'll try with a genome separately.

…ervals.

… results to AssemblyRegionWalker.

tomwhite requested review from jonn-smith and jamesemery November 15, 2018 18:05

tomwhite commented Nov 15, 2018

View reviewed changes

tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from 9deb813 to 4a7e67c Compare November 16, 2018 07:52

jamesemery requested changes Nov 16, 2018

View reviewed changes

jamesemery mentioned this pull request Nov 27, 2018

Pulled the HaplotypeCallerIntegration tests into a common interface between spark and non-spark #5451

Open

jonn-smith requested changes Nov 28, 2018

View reviewed changes

droazen assigned jamesemery and jonn-smith Nov 30, 2018

tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from e4bd5da to b7eeac9 Compare December 3, 2018 11:47

tomwhite mentioned this pull request Dec 3, 2018

Get HaplotypeCallerSpark strict mode running on an exome #5475

Merged

jonn-smith approved these changes Dec 3, 2018

View reviewed changes

jamesemery mentioned this pull request Dec 3, 2018

HaplotypeCallerSpark Enable downsampling in the AssemblyRegionWalker #5476

Closed

jamesemery approved these changes Dec 3, 2018

View reviewed changes

tomwhite added 8 commits December 4, 2018 09:12

In SparkSharder handle the case where reads overlap more than two int…

a83199c

…ervals.

Ensure result of editIntervals is used.

f656528

Add strict mode to AssemblyRegionWalkerSpark which produces identical…

284d3ad

… results to AssemblyRegionWalker.

Make HaplotypeCallerSpark (strict) match walker version.

77a73c1

Address James Emery's feedback

896aad5

Fix generics issue

5abb2f0

Don't reuse the downsampler between shards since it retains state.

1347961

Address Jonn Smith's feedback

2fcc4a7

tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from b7eeac9 to 2fcc4a7 Compare December 4, 2018 10:30

Remove CRAM test for HaplotypeCallerSparkIntegrationTest

a7764d8

tomwhite merged commit 1f6a172 into master Dec 5, 2018

tomwhite deleted the tw_strict_assembly_region_walker_spark_hc branch December 5, 2018 09:47

droazen mentioned this pull request Jan 3, 2019

Make HaplotypeCallerSpark output match HaplotypeCaller output as closely as possible #5265

Closed

ldgauthier mentioned this pull request Feb 27, 2020

Haplotype Caller Spark and non Spark #5323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416

Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416

tomwhite commented Nov 15, 2018

tomwhite Nov 15, 2018

codecov-io commented Nov 16, 2018 •

edited

Loading

jamesemery left a comment

jamesemery Nov 15, 2018

tomwhite Nov 19, 2018

jamesemery Nov 16, 2018

jamesemery Nov 16, 2018

tomwhite Nov 19, 2018

jamesemery Nov 16, 2018

tomwhite Nov 19, 2018

jamesemery Nov 16, 2018

tomwhite Nov 19, 2018

tomwhite commented Nov 19, 2018

jonn-smith left a comment

jonn-smith Nov 27, 2018

tomwhite commented Dec 3, 2018

jamesemery left a comment

tomwhite commented Dec 3, 2018

Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416

Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416

Conversation

tomwhite commented Nov 15, 2018

Choose a reason for hiding this comment

codecov-io commented Nov 16, 2018 • edited Loading

Codecov Report

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Nov 19, 2018

jonn-smith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Dec 3, 2018

jamesemery left a comment

Choose a reason for hiding this comment

tomwhite commented Dec 3, 2018

codecov-io commented Nov 16, 2018 •

edited

Loading