Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform downsampling in AssemblyRegionWalkerSpark's strict mode #5508

Closed

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented Dec 10, 2018

Fixes #5476. This change requires #5437 to work properly.

Not sure how to test this yet.

@tomwhite tomwhite force-pushed the tw_strict_assembly_region_walker_spark_downsampling branch from da2d437 to 85ff38a Compare December 10, 2018 15:32
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch needs to wait on #5437 getting in I think, we can get it in quickly to unblock that if you like.

@@ -167,8 +167,14 @@
// 5. Fill in the reads. Each shard is an assembly region, with its overlapping reads.
JavaRDD<Shard<GATKRead>> assemblyRegionShardedReads = SparkSharder.shard(ctx, reads, GATKRead.class, header.getSequenceDictionary(), assemblyRegionBoundaries, shardingArgs.readShardSize);

// 6. Convert shards to assembly regions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you haven't reenabled the downsampling in the first step of the pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already happens in getActivityProfileStatesFunction.

@@ -225,12 +231,16 @@ public ReadlessAssemblyRegion apply(@Nullable AssemblyRegion input) {
});
}

private static AssemblyRegion toAssemblyRegion(Shard<GATKRead> shard, SAMFileHeader header) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this affected any of the test outputs? Could we try building a test where we reduce the number of reads per start position to 1 just to demonstrate this is behaving consistently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it hasn't affected the tests. Will need to build something like you suggest.

@@ -225,12 +231,16 @@ public ReadlessAssemblyRegion apply(@Nullable AssemblyRegion input) {
});
}

private static AssemblyRegion toAssemblyRegion(Shard<GATKRead> shard, SAMFileHeader header) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this affected any of the test outputs? Could we try building a test where we reduce the number of reads per start position to 1 just to demonstrate this is behaving consistently?

// with the downsampling done in step 1, since it is deterministic by locus.
JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.mapPartitions((FlatMapFunction<Iterator<Shard<GATKRead>>, AssemblyRegion>) shardedReadIterator -> {
final ReadsDownsampler readsDownsampler = assemblyRegionArgs.maxReadsPerAlignmentStart > 0 ?
new PositionalDownsampler(assemblyRegionArgs.maxReadsPerAlignmentStart, header) : null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When #5437 goes in, you want to build this positional downsampler with the deterministic arguments enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@codecov-io
Copy link

codecov-io commented Dec 10, 2018

Codecov Report

Merging #5508 into master will increase coverage by 0.003%.
The diff coverage is 85.714%.

@@               Coverage Diff               @@
##              master     #5508       +/-   ##
===============================================
+ Coverage     87.063%   87.065%   +0.003%     
- Complexity     31262     31266        +4     
===============================================
  Files           1922      1922               
  Lines         144312    144317        +5     
  Branches       15918     15918               
===============================================
+ Hits          125642    125650        +8     
+ Misses         12892     12888        -4     
- Partials        5778      5779        +1
Impacted Files Coverage Δ Complexity Δ
...lbender/engine/spark/FindAssemblyRegionsSpark.java 85.294% <85.714%> (-0.42%) 22 <3> (+1)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 80.488% <0%> (+0.61%) 42% <0%> (ø) ⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java 80% <0%> (+30%) 3% <0%> (+2%) ⬆️

@droazen droazen self-requested a review January 3, 2019 19:23
@droazen droazen self-assigned this Jan 3, 2019
@tomwhite
Copy link
Contributor Author

Superceded by #5721

@tomwhite tomwhite closed this Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants