-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement heuristics that align haplotypes to reads without calling Smith-Waterman #6015
base: master
Are you sure you want to change the base?
Conversation
print number of SW and nonSW alignments added oneMismatch heuristic and tests added count for oneMismatch heuristic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations on your first PR @jonathoncohen98! I have a lot of comments but they're mainly about style. No fundamental changes.
src/test/java/org/broadinstitute/hellbender/utils/UtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/utils/UtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/utils/UtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/utils/UtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/utils/UtilsUnitTest.java
Outdated
Show resolved
Hide resolved
added more tests for atMostOneMismatch heuristic
Codecov Report
@@ Coverage Diff @@
## master #6015 +/- ##
================================================
- Coverage 86.927% 44.098% -42.829%
+ Complexity 32765 20104 -12661
================================================
Files 2016 2011 -5
Lines 151466 151102 -364
Branches 16628 16160 -468
================================================
- Hits 131665 66633 -65032
- Misses 13737 79480 +65743
+ Partials 6064 4989 -1075
|
@jonathoncohen98 Can I suggest that the new heuristics not be on by default until we've run on full-size data? (Ie., add an optional argument to turn them on) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'm nearly satisfied. The main thing is to put in the toggle that @droazen requested.
src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SmithWatermanJavaAligner.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SWNativeAlignerWrapper.java
Outdated
Show resolved
Hide resolved
…ct to HaplotypeCallerEngine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to @jonathoncohen98
src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SmithWatermanJavaAligner.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SmithWatermanJavaAligner.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/smithwaterman/SmithWatermanJavaAligner.java
Outdated
Show resolved
Hide resolved
…matches parameter
@@ -1041,128 +1048,334 @@ public static boolean xor(final boolean x, final boolean y) { | |||
* @param reference the reference sequence | |||
* @param query the query sequence | |||
*/ | |||
public static int lastIndexOf(final byte[] reference, final byte[] query) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave lastIndexOf intact and make lastIndexOfAtMostTwoMismatches() its own method that is seperate in the same class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at the very least rename it.
@@ -1041,128 +1048,334 @@ public static boolean xor(final boolean x, final boolean y) { | |||
* @param reference the reference sequence | |||
* @param query the query sequence | |||
*/ | |||
public static int lastIndexOf(final byte[] reference, final byte[] query) { | |||
public static int lastIndexOfAtMostTwoMismatches(final byte[] reference, final byte[] query, final int allowedMismatches, int refIndexBound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc explaining the new arguments
final int expected = new String(reference).lastIndexOf(new String(query)); | ||
Assert.assertEquals(result, expected); | ||
} | ||
} | ||
|
||
@Test | ||
public void atMostOneIndel(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this to a data provider that has the firlds reference, query and, expected result.
* Global Alignment | ||
* | ||
* Returns the index and size of the indel (as a 2 element int array) or -1 and 0 if an indel less than 4 bases is not found | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note about tiebreaking here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify also that this approach is predicated off of the assumption that the two strings SHOULD be the same length. Also I assume that you account for two indels that happen to be cheaper than one?
* | ||
* @param reference the reference sequence | ||
* @param query the query sequence | ||
* @param maxIndelLength the maximum length indel we look for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add @return and an explanation of the two pair values to this javadoc.
int matchingBases = refIndexBack - alignmentOffset + 1; | ||
int indelSize = queryIndexBack - matchingBases + 1; | ||
if(indelSize <= insertion.getIndelSize()){ | ||
insertion.setAlignmentOffset(alignmentOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make these Indel objects immutable and just construct new ones if you find a shorter indel.
//check for deletion code | ||
//************************************ | ||
if(!skipDeletion){ | ||
byte[] ref = new byte[refIndexBack + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this its own method.
System.arraycopy(reference,0, ref, 0, refIndexBack + 1); | ||
byte[] que = new byte[queryIndexBack + 1]; | ||
System.arraycopy(query, 0, que, 0, queryIndexBack + 1); | ||
int matchIndex = lastIndexOfAtMostTwoMismatches(ref, que, 0, ref.length - que.length - maxDeletionSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastIndexOfAtMostTwoMismatches
-> lastIndexOfAtMostNMismatches
} | ||
} | ||
|
||
if(insertion.getAlignmentOffset() != -1 && deletion.getAlignmentOffset() == -1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments explaining all of this tiebreaking/failure detection
@@ -48,6 +50,10 @@ public static SmithWatermanJavaAligner getInstance() { | |||
*/ | |||
private SmithWatermanJavaAligner(){} | |||
|
|||
public SmithWatermanJavaAligner(boolean haplotypeToref){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... come to think of it, i think the safest thing to do with this code is to make an interface for SmithWatermanJavaAligner
that contains all of the common code and then two subclasses, one being the current implementation and the other being your optimized version. Then you can add to SmithWatermanAligner
a new implementation JAVA_OPTIMIZED
that contains your optimizations. This will help encapsulate your changes and allow this branch to get into the gatk much sooner.
@droazen can we assign this to somebody else since Jonathon's internship is (long) over? |
added counts for Smith-Waterman and non-Smith_Waterman calls
implemented oneMismatch heuristic that aligns reads to haplotypes given a read that only has 1 SNP
added tests for the oneMismatch heuristic