implement heuristics that align haplotypes to reads without calling Smith-Waterman #6015

jonathoncohen98 · 2019-06-21T19:08:02Z

added counts for Smith-Waterman and non-Smith_Waterman calls
implemented oneMismatch heuristic that aligns reads to haplotypes given a read that only has 1 SNP
added tests for the oneMismatch heuristic

print number of SW and nonSW alignments added oneMismatch heuristic and tests added count for oneMismatch heuristic

davidbenjamin

Congratulations on your first PR @jonathoncohen98! I have a lot of comments but they're mainly about style. No fundamental changes.

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

src/test/java/org/broadinstitute/hellbender/utils/UtilsUnitTest.java

added more tests for atMostOneMismatch heuristic

codecov · 2019-06-26T14:33:40Z

Codecov Report

Merging #6015 into master will decrease coverage by 42.829%.
The diff coverage is 49.73%.

@@               Coverage Diff                @@
##              master     #6015        +/-   ##
================================================
- Coverage     86.927%   44.098%   -42.829%     
+ Complexity     32765     20104     -12661     
================================================
  Files           2016      2011         -5     
  Lines         151466    151102       -364     
  Branches       16628     16160       -468     
================================================
- Hits          131665     66633     -65032     
- Misses         13737     79480     +65743     
+ Partials        6064      4989      -1075

Impacted Files	Coverage Δ	Complexity Δ
...nstitute/hellbender/utils/read/AlignmentUtils.java	`60.701% <ø> (-17.199%)`	`126 <0> (-44)`
...broadinstitute/hellbender/utils/UtilsUnitTest.java	`0.176% <0%> (-93.432%)`	`1 <0> (-99)`
...walkers/haplotypecaller/HaplotypeCallerEngine.java	`78.767% <75%> (+0.22%)`	`74 <0> (ø)`	⬇️
...ava/org/broadinstitute/hellbender/utils/Utils.java	`48.466% <82.609%> (-33.065%)`	`115 <17> (-43)`
.../utils/smithwaterman/SmithWatermanJavaAligner.java	`91.469% <96.226%> (-2.648%)`	`50 <0> (+4)`
...ls/variant/writers/GVCFBlockCombiningIterator.java	`0% <0%> (-100%)`	`0% <0%> (-1%)`
...ls/walkers/genotyper/HeterogeneousPloidyModel.java	`0% <0%> (-100%)`	`0% <0%> (-14%)`
...nder/utils/downsampling/FractionalDownsampler.java	`0% <0%> (-100%)`	`0% <0%> (-17%)`
...park/pathseq/MarkedOpticalDuplicateReadFilter.java	`0% <0%> (-100%)`	`0% <0%> (-4%)`
...otypecaller/RandomLikelihoodCalculationEngine.java	`0% <0%> (-100%)`	`0% <0%> (-6%)`
... and 1215 more

droazen · 2019-06-26T18:43:56Z

@jonathoncohen98 Can I suggest that the new heuristics not be on by default until we've run on full-size data? (Ie., add an optional argument to turn them on)

davidbenjamin

Looks good. I'm nearly satisfied. The main thing is to put in the toggle that @droazen requested.

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SmithWatermanJavaAligner.java

src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SWNativeAlignerWrapper.java

…ct to HaplotypeCallerEngine

davidbenjamin

Back to @jonathoncohen98

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SmithWatermanJavaAligner.java

…matches parameter

jamesemery · 2019-07-25T18:10:05Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

@@ -1041,128 +1048,334 @@ public static boolean xor(final boolean x, final boolean y) {
     * @param reference the reference sequence
     * @param query the query sequence
     */
-    public static int lastIndexOf(final byte[] reference, final byte[] query) {


I would leave lastIndexOf intact and make lastIndexOfAtMostTwoMismatches() its own method that is seperate in the same class.

Or at the very least rename it.

jamesemery · 2019-07-25T18:10:49Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

@@ -1041,128 +1048,334 @@ public static boolean xor(final boolean x, final boolean y) {
     * @param reference the reference sequence
     * @param query the query sequence
     */
-    public static int lastIndexOf(final byte[] reference, final byte[] query) {
+    public static int lastIndexOfAtMostTwoMismatches(final byte[] reference, final byte[] query, final int allowedMismatches, int refIndexBound) {


Add javadoc explaining the new arguments

jamesemery · 2019-07-25T18:14:01Z

src/test/java/org/broadinstitute/hellbender/utils/UtilsUnitTest.java

            final int expected = new String(reference).lastIndexOf(new String(query));
            Assert.assertEquals(result, expected);
        }
    }

+    @Test
+    public void atMostOneIndel(){


I would change this to a data provider that has the firlds reference, query and, expected result.

jamesemery · 2019-07-25T18:14:47Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

+     * Global Alignment
+     *
+     * Returns the index and size of the indel (as a 2 element int array) or -1 and 0 if an indel less than 4 bases is not found
+     *


Add a note about tiebreaking here.

Specify also that this approach is predicated off of the assumption that the two strings SHOULD be the same length. Also I assume that you account for two indels that happen to be cheaper than one?

jamesemery · 2019-07-25T18:16:02Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

+     *
+     * @param reference the reference sequence
+     * @param query the query sequence
+     * @param maxIndelLength the maximum length indel we look for


Add @return and an explanation of the two pair values to this javadoc.

jamesemery · 2019-07-25T18:51:24Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

+                                int matchingBases = refIndexBack - alignmentOffset + 1;
+                                int indelSize = queryIndexBack - matchingBases + 1;
+                                if(indelSize <= insertion.getIndelSize()){
+                                    insertion.setAlignmentOffset(alignmentOffset);


make these Indel objects immutable and just construct new ones if you find a shorter indel.

jamesemery · 2019-07-25T18:52:03Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

+                //check for deletion code
+                //************************************
+                if(!skipDeletion){
+                    byte[] ref = new byte[refIndexBack + 1];


Make this its own method.

jamesemery · 2019-07-25T18:52:53Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

+                    System.arraycopy(reference,0, ref, 0, refIndexBack + 1);
+                    byte[] que = new byte[queryIndexBack + 1];
+                    System.arraycopy(query, 0, que, 0, queryIndexBack + 1);
+                    int matchIndex = lastIndexOfAtMostTwoMismatches(ref, que, 0, ref.length - que.length - maxDeletionSize);


lastIndexOfAtMostTwoMismatches -> lastIndexOfAtMostNMismatches

jamesemery · 2019-07-25T18:54:30Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

+            }
+        }
+
+        if(insertion.getAlignmentOffset() != -1 && deletion.getAlignmentOffset() == -1){


Add comments explaining all of this tiebreaking/failure detection

jamesemery · 2019-07-25T18:58:57Z

src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SmithWatermanJavaAligner.java

@@ -48,6 +50,10 @@ public static SmithWatermanJavaAligner getInstance() {
     */
    private SmithWatermanJavaAligner(){}

+    public SmithWatermanJavaAligner(boolean haplotypeToref){


Hmm... come to think of it, i think the safest thing to do with this code is to make an interface for SmithWatermanJavaAligner that contains all of the common code and then two subclasses, one being the current implementation and the other being your optimized version. Then you can add to SmithWatermanAligner a new implementation JAVA_OPTIMIZED that contains your optimizations. This will help encapsulate your changes and allow this branch to get into the gatk much sooner.

ldgauthier · 2020-12-01T19:55:41Z

@droazen can we assign this to somebody else since Jonathon's internship is (long) over?

added counts for SW and non-SW calls

7cdde7c

print number of SW and nonSW alignments added oneMismatch heuristic and tests added count for oneMismatch heuristic

jonathoncohen98 requested review from davidbenjamin and jamesemery June 21, 2019 19:08

davidbenjamin requested changes Jun 24, 2019

View reviewed changes

jonathoncohen98 added 4 commits June 25, 2019 10:02

add another aligner object for testing

0ef2b14

edited for David Benjamin's comments

b456351

added more tests for atMostOneMismatch heuristic

added test for multiple oneMismatches

8ff93a2

stylistic change with brackes

323072a

jonathoncohen98 self-assigned this Jun 26, 2019

removed test code

5a7344a

jonathoncohen98 assigned davidbenjamin Jun 26, 2019

include correct paramaters before heuristic execution

e175714

davidbenjamin requested changes Jun 27, 2019

View reviewed changes

jonathoncohen98 added 7 commits July 2, 2019 15:11

edited for David Benjamin's final comments for stage 1

c9b20ec

added atMostOneIndel heuristic

21532cb

added test for atMostOneIndel

4735712

added oneIndel heuristic to alignment and haplotypeToRef aligner obje…

16753db

…ct to HaplotypeCallerEngine

completed indel alignment implementation

6713458

for indel detection, traverse backwards first and then forward

6f2fd00

removed test code

07e6968

davidbenjamin requested changes Jul 5, 2019

View reviewed changes

jonathoncohen98 added 6 commits July 8, 2019 10:52

consolidated all mismatch heuristics into one method using allowedMis…

2e5f3e9

…matches parameter

extracted private indel method to avoid code duplication

3d6cf81

edited for david Benjamin's comments on indel stage

a7e5ef4

fixed allowed indel length

056ac96

cleaned up align method

07fdbb3

added method that aligns read-bestHap given 1 indel

6eef7be

jamesemery requested changes Jul 25, 2019

View reviewed changes

jonathoncohen98 added 3 commits August 5, 2019 13:30

control code to measure runtime

ccfdb9c

fixed formatting error

c8f4af8

fixed totalCallsToSW count

46d6299

davidbenjamin removed their assignment Jun 9, 2020

droazen mentioned this pull request Aug 30, 2021

Consider restoring CigarUtils optimization for short-circuiting to M-only CIGARs. #7441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement heuristics that align haplotypes to reads without calling Smith-Waterman #6015

implement heuristics that align haplotypes to reads without calling Smith-Waterman #6015

jonathoncohen98 commented Jun 21, 2019

davidbenjamin left a comment

codecov bot commented Jun 26, 2019 •

edited

Loading

droazen commented Jun 26, 2019

davidbenjamin left a comment

davidbenjamin left a comment

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

jamesemery Jul 25, 2019

ldgauthier commented Dec 1, 2020

implement heuristics that align haplotypes to reads without calling Smith-Waterman #6015

Are you sure you want to change the base?

implement heuristics that align haplotypes to reads without calling Smith-Waterman #6015

Conversation

jonathoncohen98 commented Jun 21, 2019

davidbenjamin left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 26, 2019 • edited Loading

Codecov Report

droazen commented Jun 26, 2019

davidbenjamin left a comment

Choose a reason for hiding this comment

davidbenjamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ldgauthier commented Dec 1, 2020

codecov bot commented Jun 26, 2019 •

edited

Loading