Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416

Merged
merged 9 commits into from
Dec 5, 2018

Conversation

tomwhite
Copy link
Contributor

There are tests for both the strict and non-strict modes for HaplotypeCallerSpark and for ExampleAssemblyRegionWalkerSpark. The strict modes produce identical results to the walker versions.

Strict mode is off by default and still needs work to scale to exome-sized data (another PR). This is an improvement over the current HaplotypeCallerSpark implementation since it fixes two bugs (reads overlapping more than two intervals; and editIntervals not being picked up) that caused the output to differ significantly from the walker version. While the output does still differ in non-strict mode, the difference is a lot less (compare expected.testVCFMode.gatk4.vcf and expected.testVCFMode.gatk4.nonstrict.vcf).

JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.map((Function<Shard<GATKRead>, AssemblyRegion>) shard -> toAssemblyRegion(shard, header));

// 7. Add reference and feature context.
return assemblyRegions.mapPartitions(getAssemblyRegionWalkerContextFunction(referenceFileName, bFeatureManager));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 7 steps make up the salient part of the strict algorithm.

@tomwhite tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from 9deb813 to 4a7e67c Compare November 16, 2018 07:52
@codecov-io
Copy link

codecov-io commented Nov 16, 2018

Codecov Report

Merging #5416 into master will increase coverage by 0.032%.
The diff coverage is 86.755%.

@@               Coverage Diff               @@
##              master     #5416       +/-   ##
===============================================
+ Coverage     86.984%   87.016%   +0.032%     
- Complexity     31208     31272       +64     
===============================================
  Files           1909      1914        +5     
  Lines         144202    144440      +238     
  Branches       15954     15982       +28     
===============================================
+ Hits          125433    125686      +253     
+ Misses         12999     12963       -36     
- Partials        5770      5791       +21
Impacted Files Coverage Δ Complexity Δ
.../hellbender/engine/spark/SparkSharderUnitTest.java 93.846% <100%> (+0.926%) 8 <1> (+1) ⬆️
...bender/engine/spark/AssemblyRegionWalkerSpark.java 100% <100%> (+22.917%) 13 <0> (-2) ⬇️
...nstitute/hellbender/engine/spark/SparkSharder.java 90.789% <100%> (-0.367%) 28 <3> (-3)
...der/tools/HaplotypeCallerSparkIntegrationTest.java 68.293% <100%> (+9.563%) 16 <5> (+4) ⬆️
...nder/tools/spark/pipelines/ReadsPipelineSpark.java 90.741% <100%> (+0.175%) 13 <0> (ø) ⬇️
...nstitute/hellbender/engine/ShardBoundaryShard.java 100% <100%> (ø) 5 <1> (+1) ⬆️
...ampleAssemblyRegionWalkerSparkIntegrationTest.java 100% <100%> (+88.889%) 4 <1> (+2) ⬆️
...tute/hellbender/engine/ReadlessAssemblyRegion.java 46.667% <46.667%> (ø) 4 <4> (?)
...stitute/hellbender/engine/spark/GATKSparkTool.java 82.432% <60%> (-0.444%) 68 <4> (ø)
...stitute/hellbender/tools/HaplotypeCallerSpark.java 82.759% <69.231%> (-5.703%) 20 <1> (-2)
... and 20 more

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for now. I think we really need to think long and hard about the downsampling, as I suspect its a key source of the differences you have observed between spark and the base tool.


// We wrap our LocusIteratorByState inside an IntervalAlignmentContextIterator so that we get empty loci
// for uncovered locations. This is critical for reproducing GATK 3.x behavior!
LocusIteratorByState libs = new LocusIteratorByState(readShard.iterator(), DownsamplingMethod.NONE, false, ReadUtils.getSamplesFromHeader(readHeader), readHeader, includeReadsWithDeletionsInIsActivePileups);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DownsamplingMethod.NONE doesn't seem right here. In the AssemblyRegionWalker we use the positionalDownampler over each shard as we do the bandpass filtering, so at the stage where the active regions are developed. Perhaps we should look into whether there is a way to keep the downsampling consistent between the two approaches, especially if we are creating assembly regions with no reads and adding them back in later. Perhaps this accounts for some of the difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActivityProfileStateIterator is the part of AssemblyRegionIterator that just does the iteration over activity profiles. It's the same code and should arguably be refactored to share code (e.g. AssemblyRegionIterator might compose ActivityProfileStateIterator and AssemblyRegionFromActivityProfileStateIterator, although it's not that simple, since AssemblyRegionIterator does read caching, which we don't want in the Spark case.

Also, I don't think we see downsampling kicking in for the tests, so it's probably not the reason for difference in the tests. That seems to be read shard boundary artifacts.

// TODO: interfaces could be improved to avoid casting
ReadlessAssemblyRegion readlessAssemblyRegion = (ReadlessAssemblyRegion) ((ShardBoundaryShard<GATKRead>) shard).getShardBoundary();
int extension = Math.max(shard.getInterval().getStart() - shard.getPaddedInterval().getStart(), shard.getPaddedInterval().getEnd() - shard.getInterval().getEnd());
AssemblyRegion assemblyRegion = new AssemblyRegion(shard.getInterval(), Collections.emptyList(), readlessAssemblyRegion.isActive(), extension, header);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue with this interface... It looks like you may be able to make regular assemblyRegion also extend ShardBoundaryShard. I'm not sure that helps anything really... but then at least AssemblyRegion and ReadlessAssemblyRegion are related in the class hirerarchy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is though an abstract interface ofcourse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I need to spend a bit more time trying to improve this. (Could be done in another issue or PR.)


return Utils.stream(shardedReadIterator)
.map(shardedRead -> new ShardToMultiIntervalShardAdapter<>(shardedRead))
// TODO: reinstate downsampling (not yet working)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about how we want to do this. Notably however, as of this implementation you need to add the same downsampling to the reads at step 5, as this is currently only hooked up to downsample for the band pass filtering but not on the reads when they get re-added to the filter. An obvious solution might involve persisiting the underlying bam but obviously thats expensive, the other is that we make the random seed in some way caclulatable, like base it on a hash of the read if necessary, that way we could get an exact mach with the regular haplotype caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that I excluded this is because I found an underlying bug in the downsampling code where it would not do the downsampling correctly unless the iterator was created twice. (This needs further investigation.)

But you are right that the reads downsampling needs to have been carried out for the reads too, which isn't currently happening in this code. Let's explore your ideas further.

* A cut-down version of {@link AssemblyRegion} that doesn't store reads, used in the strict implementation of
* {@link org.broadinstitute.hellbender.engine.spark.FindAssemblyRegionsSpark}.
*/
public class ReadlessAssemblyRegion extends ShardBoundary {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should probably be a common interface linking this to AssemblyRegion, or a different name distinguishing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@tomwhite
Copy link
Contributor Author

Thanks for the review @jamesemery. I've addressed your comments. A few outstanding issues:

  • ActivityProfileStateIterator and AssemblyRegionFromActivityProfileStateIterator duplicate parts of AssemblyRegionIterator, so it would be nice to remove the code duplication. Not totally straightforward as the latter does read caching, but the first two don't (Spark shouldn't be caching reads).
  • Downsampling needs more work. I would be OK doing that separately, since I've only ever seen it when running on a full genome, and the new strict code needs more work to work on a full genome (I've only got it running on an exome so far).
  • There are some improvements we could make to ReadlessAssemblyRegion regarding Java interface design and generics, but I'm not sure what they are yet.

I'm not sure if these are blockers, since the strict codepath is a new option (off by default), but would like to know what you and @jonn-smith think.

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some very minor comments. Most of the meat has already been addressed from what I can tell.

This will be a very good starting point for HaplotypeCallerSpark correctness.

assemblyRegionArgs.maxProbPropagationDistance, includeReadsWithDeletionsInIsActivePileups);
return Utils.stream(assemblyRegionIter).map(assemblyRegion ->
new AssemblyRegionWalkerContext(assemblyRegion,
new ReferenceContext(reference, assemblyRegion.getExtendedSpan()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the same code that was running before, so it really should have been commented already...

It looks like the padding is getting passed in via assemblyRegionArgs.assemblyRegionPadding when creating assemblyRegionIter, so I don't think that it's a technical issue (the span without padding would be retrieved with assemblyRegion.getSpan()).

@tomwhite tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from e4bd5da to b7eeac9 Compare December 3, 2018 11:47
@tomwhite
Copy link
Contributor Author

tomwhite commented Dec 3, 2018

Thanks for the review @jonn-smith. I've addressed all your comments, and those of @jamesemery. Please approve so we can merge.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that this is a beta tool anyway, the impact is relatively low and it is important to benchmark our progress. I do think we should take a look at running this over a full genome should be tried.

@tomwhite
Copy link
Contributor Author

tomwhite commented Dec 3, 2018

Thanks @jamesemery. I've successfully run with an exome in #5475; I'll try with a genome separately.

@tomwhite tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from b7eeac9 to 2fcc4a7 Compare December 4, 2018 10:30
@tomwhite tomwhite merged commit 1f6a172 into master Dec 5, 2018
@tomwhite tomwhite deleted the tw_strict_assembly_region_walker_spark_hc branch December 5, 2018 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants