Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement heuristics that align haplotypes to reads without calling Smith-Waterman #6015

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

jonathoncohen98
Copy link
Contributor

added counts for Smith-Waterman and non-Smith_Waterman calls
implemented oneMismatch heuristic that aligns reads to haplotypes given a read that only has 1 SNP
added tests for the oneMismatch heuristic

print number of SW and nonSW alignments

added oneMismatch heuristic and tests

added count for oneMismatch heuristic
Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations on your first PR @jonathoncohen98! I have a lot of comments but they're mainly about style. No fundamental changes.

@jonathoncohen98 jonathoncohen98 self-assigned this Jun 26, 2019
@codecov
Copy link

codecov bot commented Jun 26, 2019

Codecov Report

Merging #6015 into master will decrease coverage by 42.829%.
The diff coverage is 49.73%.

@@               Coverage Diff                @@
##              master     #6015        +/-   ##
================================================
- Coverage     86.927%   44.098%   -42.829%     
+ Complexity     32765     20104     -12661     
================================================
  Files           2016      2011         -5     
  Lines         151466    151102       -364     
  Branches       16628     16160       -468     
================================================
- Hits          131665     66633     -65032     
- Misses         13737     79480     +65743     
+ Partials        6064      4989      -1075
Impacted Files Coverage Δ Complexity Δ
...nstitute/hellbender/utils/read/AlignmentUtils.java 60.701% <ø> (-17.199%) 126 <0> (-44)
...broadinstitute/hellbender/utils/UtilsUnitTest.java 0.176% <0%> (-93.432%) 1 <0> (-99)
...walkers/haplotypecaller/HaplotypeCallerEngine.java 78.767% <75%> (+0.22%) 74 <0> (ø) ⬇️
...ava/org/broadinstitute/hellbender/utils/Utils.java 48.466% <82.609%> (-33.065%) 115 <17> (-43)
.../utils/smithwaterman/SmithWatermanJavaAligner.java 91.469% <96.226%> (-2.648%) 50 <0> (+4)
...ls/variant/writers/GVCFBlockCombiningIterator.java 0% <0%> (-100%) 0% <0%> (-1%)
...ls/walkers/genotyper/HeterogeneousPloidyModel.java 0% <0%> (-100%) 0% <0%> (-14%)
...nder/utils/downsampling/FractionalDownsampler.java 0% <0%> (-100%) 0% <0%> (-17%)
...park/pathseq/MarkedOpticalDuplicateReadFilter.java 0% <0%> (-100%) 0% <0%> (-4%)
...otypecaller/RandomLikelihoodCalculationEngine.java 0% <0%> (-100%) 0% <0%> (-6%)
... and 1215 more

@droazen
Copy link
Collaborator

droazen commented Jun 26, 2019

@jonathoncohen98 Can I suggest that the new heuristics not be on by default until we've run on full-size data? (Ie., add an optional argument to turn them on)

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm nearly satisfied. The main thing is to put in the toggle that @droazen requested.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1041,128 +1048,334 @@ public static boolean xor(final boolean x, final boolean y) {
* @param reference the reference sequence
* @param query the query sequence
*/
public static int lastIndexOf(final byte[] reference, final byte[] query) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave lastIndexOf intact and make lastIndexOfAtMostTwoMismatches() its own method that is seperate in the same class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at the very least rename it.

@@ -1041,128 +1048,334 @@ public static boolean xor(final boolean x, final boolean y) {
* @param reference the reference sequence
* @param query the query sequence
*/
public static int lastIndexOf(final byte[] reference, final byte[] query) {
public static int lastIndexOfAtMostTwoMismatches(final byte[] reference, final byte[] query, final int allowedMismatches, int refIndexBound) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc explaining the new arguments

final int expected = new String(reference).lastIndexOf(new String(query));
Assert.assertEquals(result, expected);
}
}

@Test
public void atMostOneIndel(){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to a data provider that has the firlds reference, query and, expected result.

* Global Alignment
*
* Returns the index and size of the indel (as a 2 element int array) or -1 and 0 if an indel less than 4 bases is not found
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note about tiebreaking here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify also that this approach is predicated off of the assumption that the two strings SHOULD be the same length. Also I assume that you account for two indels that happen to be cheaper than one?

*
* @param reference the reference sequence
* @param query the query sequence
* @param maxIndelLength the maximum length indel we look for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @return and an explanation of the two pair values to this javadoc.

int matchingBases = refIndexBack - alignmentOffset + 1;
int indelSize = queryIndexBack - matchingBases + 1;
if(indelSize <= insertion.getIndelSize()){
insertion.setAlignmentOffset(alignmentOffset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make these Indel objects immutable and just construct new ones if you find a shorter indel.

//check for deletion code
//************************************
if(!skipDeletion){
byte[] ref = new byte[refIndexBack + 1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this its own method.

System.arraycopy(reference,0, ref, 0, refIndexBack + 1);
byte[] que = new byte[queryIndexBack + 1];
System.arraycopy(query, 0, que, 0, queryIndexBack + 1);
int matchIndex = lastIndexOfAtMostTwoMismatches(ref, que, 0, ref.length - que.length - maxDeletionSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastIndexOfAtMostTwoMismatches -> lastIndexOfAtMostNMismatches

}
}

if(insertion.getAlignmentOffset() != -1 && deletion.getAlignmentOffset() == -1){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments explaining all of this tiebreaking/failure detection

@@ -48,6 +50,10 @@ public static SmithWatermanJavaAligner getInstance() {
*/
private SmithWatermanJavaAligner(){}

public SmithWatermanJavaAligner(boolean haplotypeToref){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... come to think of it, i think the safest thing to do with this code is to make an interface for SmithWatermanJavaAligner that contains all of the common code and then two subclasses, one being the current implementation and the other being your optimized version. Then you can add to SmithWatermanAligner a new implementation JAVA_OPTIMIZED that contains your optimizations. This will help encapsulate your changes and allow this branch to get into the gatk much sooner.

@davidbenjamin davidbenjamin removed their assignment Jun 9, 2020
@ldgauthier
Copy link
Contributor

@droazen can we assign this to somebody else since Jonathon's internship is (long) over?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants