-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform downsampling in AssemblyRegionWalkerSpark's strict mode #5508
Perform downsampling in AssemblyRegionWalkerSpark's strict mode #5508
Conversation
da2d437
to
85ff38a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch needs to wait on #5437 getting in I think, we can get it in quickly to unblock that if you like.
@@ -167,8 +167,14 @@ | |||
// 5. Fill in the reads. Each shard is an assembly region, with its overlapping reads. | |||
JavaRDD<Shard<GATKRead>> assemblyRegionShardedReads = SparkSharder.shard(ctx, reads, GATKRead.class, header.getSequenceDictionary(), assemblyRegionBoundaries, shardingArgs.readShardSize); | |||
|
|||
// 6. Convert shards to assembly regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you haven't reenabled the downsampling in the first step of the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already happens in getActivityProfileStatesFunction
.
@@ -225,12 +231,16 @@ public ReadlessAssemblyRegion apply(@Nullable AssemblyRegion input) { | |||
}); | |||
} | |||
|
|||
private static AssemblyRegion toAssemblyRegion(Shard<GATKRead> shard, SAMFileHeader header) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this affected any of the test outputs? Could we try building a test where we reduce the number of reads per start position to 1 just to demonstrate this is behaving consistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it hasn't affected the tests. Will need to build something like you suggest.
@@ -225,12 +231,16 @@ public ReadlessAssemblyRegion apply(@Nullable AssemblyRegion input) { | |||
}); | |||
} | |||
|
|||
private static AssemblyRegion toAssemblyRegion(Shard<GATKRead> shard, SAMFileHeader header) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this affected any of the test outputs? Could we try building a test where we reduce the number of reads per start position to 1 just to demonstrate this is behaving consistently?
// with the downsampling done in step 1, since it is deterministic by locus. | ||
JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.mapPartitions((FlatMapFunction<Iterator<Shard<GATKRead>>, AssemblyRegion>) shardedReadIterator -> { | ||
final ReadsDownsampler readsDownsampler = assemblyRegionArgs.maxReadsPerAlignmentStart > 0 ? | ||
new PositionalDownsampler(assemblyRegionArgs.maxReadsPerAlignmentStart, header) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When #5437 goes in, you want to build this positional downsampler with the deterministic arguments enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
Codecov Report
@@ Coverage Diff @@
## master #5508 +/- ##
===============================================
+ Coverage 87.063% 87.065% +0.003%
- Complexity 31262 31266 +4
===============================================
Files 1922 1922
Lines 144312 144317 +5
Branches 15918 15918
===============================================
+ Hits 125642 125650 +8
+ Misses 12892 12888 -4
- Partials 5778 5779 +1
|
Superceded by #5721 |
Fixes #5476. This change requires #5437 to work properly.
Not sure how to test this yet.