Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using cigar complexity to break ties in uninformative reads' best haplotypes #5359

Merged
merged 2 commits into from
Nov 28, 2018

Conversation

davidbenjamin
Copy link
Contributor

@ldgauthier this finishes what we started in #4858 and is necessary for the pileup-calls-on-bamouts MC3 validation. The cause is the same, in that Pair-HMM has a tiny bias in favor of shorter haplotypes and thus it prefers deletion haplotypes when reads end inside STRs. In #4858 we broke near-ties in favor of the reference; this PR fixes the case where two alt haplotypes share a SNV and one of them has a spurious deletion.

One important sanity check was that when I set cigarTerm to zero in AssemblyBasedCallerUtils.java no tests broke. This means that the refactoring needed to set up the change didn't affect behavior.

I looked at most of the sites where PLs and/or DPs changed in the integration test vcfs and in every case the difference was from a fake deletion that this PR fixed. I also went through the diff of the bamouts in IGV and found the same thing.

Finally, the changes to test vcfs in GenotypeGVCFsIntegrationTest and GenomicsDBImporterIntegrationTest are a consequence of changes to the HaplotypeCallerIntegrationTest vcfs.

@davidbenjamin
Copy link
Contributor Author

@ldgauthier Reminder that this is needed for the M2 paper evaluation. I forgot about this PR amid all the other MC3 stuff and would have reminded you earlier had I remembered.

@@ -421,11 +422,12 @@ private void normalizeLikelihoodsPerRead(final double maximumBestAltLikelihoodDi
* @param sampleIndex including sample index.
* @param readIndex target read index.
*
* @param useReferenceIfUninformative
* @param priorities An array of allele priorities (lower score is higher priority) to be used, if present, to break ties for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be having a dyslexic moment, but is lower score higher priority? If we prefer the reference and more complex haplotypes have more negative scores, then higher scores should have higher priorities, yes? Later in the code you update the best allele if the candidatePriority > bestPriority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. The code is right and the comment is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

@codecov-io
Copy link

Codecov Report

Merging #5359 into master will increase coverage by 0.099%.
The diff coverage is 87.5%.

@@               Coverage Diff               @@
##              master     #5359       +/-   ##
===============================================
+ Coverage     86.903%   87.002%   +0.099%     
- Complexity     30311     31188      +877     
===============================================
  Files           1849      1908       +59     
  Lines         140507    144073     +3566     
  Branches       15475     15937      +462     
===============================================
+ Hits          122105    125347     +3242     
- Misses         12793     12965      +172     
- Partials        5609      5761      +152
Impacted Files Coverage Δ Complexity Δ
...kers/haplotypecaller/AssemblyBasedCallerUtils.java 76.563% <85.714%> (+0.14%) 35 <9> (+3) ⬆️
...te/hellbender/utils/genotyper/ReadLikelihoods.java 89.981% <87.879%> (-0.16%) 150 <7> (+7)
...ls/funcotator/metadata/VcfFuncotationMetadata.java 71.429% <0%> (-28.571%) 8% <0%> (+3%)
...bender/utils/GATKProtectedVariantContextUtils.java 67.005% <0%> (-3.325%) 66% <0%> (+1%)
...der/tools/HaplotypeCallerSparkIntegrationTest.java 58.73% <0%> (-3.035%) 12% <0%> (-1%)
...ols/walkers/contamination/ContaminationRecord.java 88.235% <0%> (-2.876%) 5% <0%> (-1%)
...lbender/utils/read/SAMRecordToGATKReadAdapter.java 91.606% <0%> (-2.095%) 144% <0%> (+6%)
...nder/tools/funcotator/TranscriptSelectionMode.java 89.72% <0%> (-1.869%) 1% <0%> (ø)
...tools/funcotator/DataSourceFuncotationFactory.java 86.957% <0%> (-1.68%) 17% <0%> (ø)
...hellbender/tools/walkers/mutect/Mutect2Engine.java 89.881% <0%> (-1.258%) 63% <0%> (+3%)
... and 126 more

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 :shipit:
Thanks for the test change explanation and the painstaking effort of looking at changed likelihoods.

@davidbenjamin davidbenjamin merged commit 7226ad9 into master Nov 28, 2018
@davidbenjamin davidbenjamin deleted the db_best_alleles branch November 28, 2018 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants