-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416
Conversation
JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.map((Function<Shard<GATKRead>, AssemblyRegion>) shard -> toAssemblyRegion(shard, header)); | ||
|
||
// 7. Add reference and feature context. | ||
return assemblyRegions.mapPartitions(getAssemblyRegionWalkerContextFunction(referenceFileName, bFeatureManager)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 7 steps make up the salient part of the strict algorithm.
9deb813
to
4a7e67c
Compare
Codecov Report
@@ Coverage Diff @@
## master #5416 +/- ##
===============================================
+ Coverage 86.984% 87.016% +0.032%
- Complexity 31208 31272 +64
===============================================
Files 1909 1914 +5
Lines 144202 144440 +238
Branches 15954 15982 +28
===============================================
+ Hits 125433 125686 +253
+ Misses 12999 12963 -36
- Partials 5770 5791 +21
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for now. I think we really need to think long and hard about the downsampling, as I suspect its a key source of the differences you have observed between spark and the base tool.
|
||
// We wrap our LocusIteratorByState inside an IntervalAlignmentContextIterator so that we get empty loci | ||
// for uncovered locations. This is critical for reproducing GATK 3.x behavior! | ||
LocusIteratorByState libs = new LocusIteratorByState(readShard.iterator(), DownsamplingMethod.NONE, false, ReadUtils.getSamplesFromHeader(readHeader), readHeader, includeReadsWithDeletionsInIsActivePileups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DownsamplingMethod.NONE doesn't seem right here. In the AssemblyRegionWalker we use the positionalDownampler over each shard as we do the bandpass filtering, so at the stage where the active regions are developed. Perhaps we should look into whether there is a way to keep the downsampling consistent between the two approaches, especially if we are creating assembly regions with no reads and adding them back in later. Perhaps this accounts for some of the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActivityProfileStateIterator
is the part of AssemblyRegionIterator
that just does the iteration over activity profiles. It's the same code and should arguably be refactored to share code (e.g. AssemblyRegionIterator
might compose ActivityProfileStateIterator
and AssemblyRegionFromActivityProfileStateIterator
, although it's not that simple, since AssemblyRegionIterator
does read caching, which we don't want in the Spark case.
Also, I don't think we see downsampling kicking in for the tests, so it's probably not the reason for difference in the tests. That seems to be read shard boundary artifacts.
src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/engine/spark/SparkSharderUnitTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/engine/spark/SparkSharderUnitTest.java
Show resolved
Hide resolved
...ava/org/broadinstitute/hellbender/engine/AssemblyRegionFromActivityProfileStateIterator.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/HaplotypeCallerSparkIntegrationTest.java
Show resolved
Hide resolved
// TODO: interfaces could be improved to avoid casting | ||
ReadlessAssemblyRegion readlessAssemblyRegion = (ReadlessAssemblyRegion) ((ShardBoundaryShard<GATKRead>) shard).getShardBoundary(); | ||
int extension = Math.max(shard.getInterval().getStart() - shard.getPaddedInterval().getStart(), shard.getPaddedInterval().getEnd() - shard.getInterval().getEnd()); | ||
AssemblyRegion assemblyRegion = new AssemblyRegion(shard.getInterval(), Collections.emptyList(), readlessAssemblyRegion.isActive(), extension, header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the issue with this interface... It looks like you may be able to make regular assemblyRegion also extend ShardBoundaryShard. I'm not sure that helps anything really... but then at least AssemblyRegion and ReadlessAssemblyRegion are related in the class hirerarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is though an abstract interface ofcourse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I need to spend a bit more time trying to improve this. (Could be done in another issue or PR.)
src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java
Outdated
Show resolved
Hide resolved
|
||
return Utils.stream(shardedReadIterator) | ||
.map(shardedRead -> new ShardToMultiIntervalShardAdapter<>(shardedRead)) | ||
// TODO: reinstate downsampling (not yet working) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think about how we want to do this. Notably however, as of this implementation you need to add the same downsampling to the reads at step 5, as this is currently only hooked up to downsample for the band pass filtering but not on the reads when they get re-added to the filter. An obvious solution might involve persisiting the underlying bam but obviously thats expensive, the other is that we make the random seed in some way caclulatable, like base it on a hash of the read if necessary, that way we could get an exact mach with the regular haplotype caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I excluded this is because I found an underlying bug in the downsampling code where it would not do the downsampling correctly unless the iterator was created twice. (This needs further investigation.)
But you are right that the reads downsampling needs to have been carried out for the reads too, which isn't currently happening in this code. Let's explore your ideas further.
* A cut-down version of {@link AssemblyRegion} that doesn't store reads, used in the strict implementation of | ||
* {@link org.broadinstitute.hellbender.engine.spark.FindAssemblyRegionsSpark}. | ||
*/ | ||
public class ReadlessAssemblyRegion extends ShardBoundary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should probably be a common interface linking this to AssemblyRegion, or a different name distinguishing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Thanks for the review @jamesemery. I've addressed your comments. A few outstanding issues:
I'm not sure if these are blockers, since the strict codepath is a new option (off by default), but would like to know what you and @jonn-smith think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some very minor comments. Most of the meat has already been addressed from what I can tell.
This will be a very good starting point for HaplotypeCallerSpark correctness.
src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java
Show resolved
Hide resolved
assemblyRegionArgs.maxProbPropagationDistance, includeReadsWithDeletionsInIsActivePileups); | ||
return Utils.stream(assemblyRegionIter).map(assemblyRegion -> | ||
new AssemblyRegionWalkerContext(assemblyRegion, | ||
new ReferenceContext(reference, assemblyRegion.getExtendedSpan()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the same code that was running before, so it really should have been commented already...
It looks like the padding is getting passed in via assemblyRegionArgs.assemblyRegionPadding
when creating assemblyRegionIter
, so I don't think that it's a technical issue (the span without padding would be retrieved with assemblyRegion.getSpan()
).
src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/engine/spark/FindAssemblyRegionsSpark.java
Outdated
Show resolved
Hide resolved
e4bd5da
to
b7eeac9
Compare
Thanks for the review @jonn-smith. I've addressed all your comments, and those of @jamesemery. Please approve so we can merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that this is a beta tool anyway, the impact is relatively low and it is important to benchmark our progress. I do think we should take a look at running this over a full genome should be tried.
Thanks @jamesemery. I've successfully run with an exome in #5475; I'll try with a genome separately. |
… results to AssemblyRegionWalker.
b7eeac9
to
2fcc4a7
Compare
There are tests for both the strict and non-strict modes for HaplotypeCallerSpark and for ExampleAssemblyRegionWalkerSpark. The strict modes produce identical results to the walker versions.
Strict mode is off by default and still needs work to scale to exome-sized data (another PR). This is an improvement over the current HaplotypeCallerSpark implementation since it fixes two bugs (reads overlapping more than two intervals; and editIntervals not being picked up) that caused the output to differ significantly from the walker version. While the output does still differ in non-strict mode, the difference is a lot less (compare expected.testVCFMode.gatk4.vcf and expected.testVCFMode.gatk4.nonstrict.vcf).