Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization in CigarUtils to shortcut to M-only CIGAR when provably optimal #5466

Merged
merged 1 commit into from
Nov 30, 2018

Conversation

davidbenjamin
Copy link
Contributor

Closes #5459.

This change speeds up Mutect2 and HaplotypeCaller by ~5% when not using hardware Smith-Waterman acceleration. In order to show that the shortcut is only triggered when Smith-Waterman would have given the same result, I modified this branch to throw an exception whenever the shortcut condition is met but Smith-Waterman doesn't agree and ran Mutect2 using this modified version on 60 extremely deep (1000x - 20,000x) mitochondria bams using minimal pruning in order to yield a lot of very thorny haplotypes, as Laura suggested. Over the several hundred thousand haplotypes in these bams the exception was never thrown. Therefore, we can be quite confident that this change has no effect on outputs.

@ldgauthier do you mind taking a look?

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we talked I didn't realize this was going into the general purpose utility method. HC-generated haplotypes get CIGARs based on the classic SW (since you changed the params, which was wonderful), but there are lots of weird cigars in the wild. For instance, if you did get a sequence with a degenerate CIGAR (like deletion of an A and insertion of an A or some more complex things in repeats), then it would be converted to an all-M CIGAR and lose information about those events. I'd at least like a note about degenerate CIGARs. There may be future applications for which they're more interesting and we should note that we "correct" them.

@davidbenjamin
Copy link
Contributor Author

if you did get a sequence with a degenerate CIGAR (like deletion of an A and insertion of an A or some more complex things in repeats), then it would be converted to an all-M CIGAR and lose information about those events.

This PR isn't so aggressive about making all Ms -- it only does that if the all-M CIGAR yields two or fewer mismatches. For an arbitrary insertion + deletion sequence, which is frameshifted between the indels, this is extremely unlikely. Furthermore, in the event that it does occur, two substitution is almost certainly a better explanation than two indels. The only possible exception is if you had some really strange parameters where the SNV penalty was greater than the indel start penalty.

@ldgauthier
Copy link
Contributor

Please make a note. This method is making assumptions about local alignment parameters and assumptions should be noted.

@davidbenjamin
Copy link
Contributor Author

Hmmm, rather than making assumptions, the aligner could expose its SW parameters and then it comes down to a line of arithmetic.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've expanded the entirety of the method I can see that we use your NEW_SW_PARAMETERS for the default alignment anyway, which would prefer matches. Never mind. :shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants