Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HaplotypeCaller genotype and output spanning deletions #4963

Merged
merged 14 commits into from
Sep 5, 2018

Conversation

cwhelan
Copy link
Member

@cwhelan cwhelan commented Jun 28, 2018

This PR modifies HaplotypeCaller so that it can output and genotype spanning deletion alleles represented by the * allele.

Currently, the output of HaplotypeCaller will not include spanning deletion alleles when run in single sample VCF mode or in genotype given alleles mode, even when that genotype would be more appropriate. In the joint calling workflow GenotypeGVCFs adds genotypes for spanning deletions, although the input likelihoods will not be broken out to specifically account for spanning deletion alleles.

Some implementation notes:

  • I also fixed some behavior specific to GGA mode that I encountered while testing this bug. In particular, when GGA mode was run with multiple variants with the same start position or with spanning events, HaplotypeCaller used to emit the warning "Multiple valid VCF records detected in the alleles input file at site " + loc + ", only considering the first record" for each such site. This was a bit of a misleading message, since the other variants were in fact taken into account UNLESS HC decided to emit an empty variant context, for example due to zero coverage.
  • I rewrote the createAlleleMapper method in AssemblyBasedCallerGenotypingEngine. The old version had a very brittle mapping scheme that depended heavily on the ordering of alleles in the variant context created by AssemblyBasedCallerUtils.makeMergedVariantContext and getEventsAtThisLoc. This proved to be difficult to ensure when spanning deletions were added in, and there was an ominous TODO in the old method saying that the logic was not good enough, so I ended up re-writing it from scratch. The new version is longer but I hope it is easier to read and less fragile, but let me know if I've missed anything.

Test currently fail on this branch and therefore it should not be merged. To make them pass we need a fix to #4716 which is currently being worked on in #4645. However, since that PR is taking a while to make it through code review, I thought it might be good to start the review process for these changes.

@cwhelan
Copy link
Member Author

cwhelan commented Jun 28, 2018

Not sure who the best reviewer(s) would be but it would be good to get some experienced eyes on this because it touches a few core methods of haplotype caller.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! This is so much more correct.


@VisibleForTesting
protected static VariantContext composeGivenAllelesVariantContextFromVariantList(final List<VariantContext> variantContextsInFeatureContext, final Locatable loc, final boolean snpsOnly, final boolean keepFiltered) {
final List<VariantContext> rodVcsAtLoc = variantContextsInFeatureContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like "ROD" is a linguistic artifact of GATK 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was trying to step lightly on renaming variables in this PR but since you want them to go I'm more than happy to.. renamed.

.stream()
.filter(vc -> vc.getStart() == loc.getStart() &&
(keepFiltered || vc.isNotFiltered()) &&
(!snpsOnly || vc.isSNP()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snpsOnly strikes me as a weird argument to have, and it's only ever false. Why not eliminate it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must have been a legacy thing. Removed.

}
final List<String> haplotypeSources = rodVcsAtLoc.stream().map(VariantContext::getSource).collect(Collectors.toList());
final VariantContext mergedVc = GATKVariantContextUtils.simpleMerge(rodVcsAtLoc, haplotypeSources,
GATKVariantContextUtils.FilteredRecordMergeType.KEEP_IF_ANY_UNFILTERED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be keepFiltered ? KEEP_UNCONDITIONAL : KEEP_IF_ANY_UNFILTERED?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. Changed to your suggestion.

@Override
public int hashCode() {
int result = loc;
result = 31 * result + (alleles != null ? alleles.hashCode() : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems better as a one-liner return 31 * loc + (alleles != null ? alleles.hashCode() : 0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} else { // we are in GGA mode!
int compCount = 0;
for( final VariantContext compVC : activeAllelesToGenotype ) {
if( compVC.getStart() == loc ) {
if( compVC.getStart() <= loc && compVC.getEnd() >= loc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all "comp" names should be renamed to something clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, more weird old legacy names.. done, although I did keep the "Comp" in the vcSourceName in case anyone depended on that downstream.

@@ -442,7 +442,7 @@ public void writeHeader( final VariantContextWriter vcfWriter, final SAMSequence
public ActivityProfileState isActive( final AlignmentContext context, final ReferenceContext ref, final FeatureContext features ) {

if ( hcArgs.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES ) {
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(features, ref.getInterval(), false, hcArgs.genotypeFilteredAlleles, logger, hcArgs.alleles);
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(features, ref.getInterval(), false, hcArgs.genotypeFilteredAlleles, hcArgs.alleles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it you can rename variables with "rod" in their name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -230,7 +227,7 @@ public void shutdown() {
@Override
public ActivityProfileState isActive(final AlignmentContext context, final ReferenceContext ref, final FeatureContext featureContext) {
if ( MTAC.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES ) {
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(featureContext, ref.getInterval(), false, MTAC.genotypeFilteredAlleles, logger, MTAC.alleles);
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(featureContext, ref.getInterval(), false, MTAC.genotypeFilteredAlleles, MTAC.alleles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to change "rod" names in Mutect code, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -395,6 +396,10 @@ public String toString() {
return startPosKeySet;
}

public Iterator<VariantContext> getSpanningEvents(final int loc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this getOverlappingEvents because otherwise it suggests that you're only getting spanning deletions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public class GenotypingGivenAllelesUtilsUnitTest {

@Test
public void testComposeGivenAllelesVariantContextFromRod() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More griping about "ROD" names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Test
public void testComposeGivenAllelesVariantContextFromRod() {

final SimpleInterval loc = new SimpleInterval("20", 10093568, 10093568);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These loci don't need to be 8-digit numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cwhelan
Copy link
Member Author

cwhelan commented Jul 3, 2018

@davidbenjamin Thanks for your review; I've tried to address your comments but let me know if anything still looks like it could use some work or if you spot anything new. Still waiting for tests to pass on this update (bar the GenomicsDB tests that I expect to fail).

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! 👍 to merge. (That is, once all the tests are resolved).

final Map<Allele, Allele> spanningEventAlleleMappingToMergedVc
= GATKVariantContextUtils.createAlleleMapping(mergedVC.getReference(), spanningEvent, new ArrayList<>());
final Allele remappedSpanningEventAltAllele = spanningEventAlleleMappingToMergedVc.get(spanningEvent.getAlternateAllele(0));
// in the case of GGA mode the spanning event might not match an allele in the mergedVC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does sound correct, and I see that you hooked up SomaticGenotypingEngine to the version of this method that doesn't get the GGA alleles, which is correct because they have already been injected into the haplotypes. So, yes, I think everything is okay.

@cwhelan
Copy link
Member Author

cwhelan commented Jul 9, 2018

I've rebased on the most recent GenomicsDB changes but am still having trouble getting tests to pass. Not sure yet what the solution is but here is the issue for documentation:

With these new changes HaplotypeCaller produces the following GVCF records on our chr20 test data:

20	10068158	.	GTGTATATATATA	G,<NON_REF>	66.73	.	BaseQRankSum=-0.652;ClippingRankSum=0.000;DP=29;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.328;RAW_MQ=93364.00;ReadPosRankSum=-0.253	GT:AD:DP:GQ:PL:SB	0/1:3,4,0:7:57:104,0,57,114,69,183:0,3,2,2
20	10068160	.	GTATATATATATGTA	G,*,<NON_REF>	697.73	.	DP=28;ExcessHet=3.0103;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;RAW_MQ=87005.00	GT:AD:DP:GQ:PL:SB	1/2:0,2,4,0:6:53:735,162,131,102,0,53,507,174,108,472:0,0,2,4

If I run this through CombineGVCFs like this:

./gatk CombineGVCFs -V src/test/resources/org/broadinstitute/hellbender/tools/haplotypecaller/expected.testGVCFMode.gatk4.g.vcf -O test_gdb_import_combine.g.vcf -R src/test/resources/large/human_g1k_v37.20.21.fasta

The resulting GVCF has these records:

20      10068158        .       GTGTATATATATA   G,<NON_REF>     .       .       BaseQRankSum=-6.520e-01;ClippingRankSum=0.00;DP=29;ExcessHet=3.01;MQRankSum=0.328;RAW_MQ=93364.00;ReadPosRankSum=-2.530e-01     GT:AD:DP:GQ:PL:SB       ./.:3,4,0:7:57:104,0,57,114,69,183:0,3,2,2
20      10068159        .       T       *,<NON_REF>     .       .       DP=29   GT:AD:DP:GQ:PL:SB       ./.:3,4,0:7:57:104,0,57,114,69,183:0,3,2,2
20      10068160        .       GTATATATATATGTA G,*,<NON_REF>   .       .       DP=28;ExcessHet=3.01;RAW_MQ=87005.00    GT:AD:DP:GQ:PL:SB       ./.:0,2,4,0:6:53:735,162,131,102,0,53,507,174,108,472:0,0,2,4
20      10068161        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068162        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068163        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068164        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068165        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068166        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068167        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068168        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068169        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068170        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068171        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068172        .       G       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068173        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068174        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,4,0:6:53:735,102,53,507,108,472:0,0,2,4
20      10068175        .       T       <NON_REF>       .       .       .       GT:DP:GQ:MIN_DP:PL      ./.:20:17:20:0,17,729

When the original GVCF is imported into GenomicsDB and then extracted:

./gatk GenomicsDBImport -R src/test/resources/large/human_g1k_v37.20.21.fasta -L 20 -V src/test/resources/org/broadinstitute/hellbender/tools/haplotypecaller/expected.testGVCFMode.gatk4.g.vcf -genomicsdb-workspace-path spanDelWorkspace
./gatk SelectVariants -V gendb://spanDelWorkspace -R src/test/resources/large/human_g1k_v37.20.21.fasta -O test.g.vcf -L 20

It contains the following records in this region:

20      10068158        .       GTGTATATATATA   G,<NON_REF>     .       .       BaseQRankSum=-6.520e-01;ClippingRankSum=0.00;DP=29;ExcessHet=3.01;MQRankSum=0.328;RAW_MQ=93364.00;ReadPosRankSum=-2.530e-0
1     GT:AD:DP:GQ:PL:SB       ./.:3,4,0:7:57:104,0,57,114,69,183:0,3,2,2
20      10068159        .       T       *,<NON_REF>     .       .       DP=29   GT:AD:DP:GQ:PL:SB       ./.:3,4,0:7:57:104,0,57,114,69,183:0,3,2,2
20      10068160        .       GTATATATATATGTA G,*,<NON_REF>   .       .       DP=28;ExcessHet=3.01;RAW_MQ=87005.00    GT:AD:DP:GQ:PL:SB       ./.:0,2,4,0:6:53:735,162,131,102,0,53,507,174,108,472:0,0,
2,4
20      10068161        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068162        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068163        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068164        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068165        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068166        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068167        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068168        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068169        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068170        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068171        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068172        .       G       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068173        .       T       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068174        .       A       *,<NON_REF>     .       .       DP=28   GT:AD:DP:GQ:PL:SB       ./.:0,2,0:6:53:735,162,131,507,174,472:0,0,2,4
20      10068175        .       T       <NON_REF>       .       .       .       GT:DP:GQ:MIN_DP:PL      ./.:20:17:20:0,17,729

Note that format DP and AD are different for sites after 10068161. Not sure what the right answer should be here, to my eye it looks like both CombineGVCFs and GenomicsDB are wrong and DP should be 6 for the star alleles in those sites.

@ldgauthier
Copy link
Contributor

I agree that AD should be 6 for the * allele. It doesn't look like it's entirely trivial to implement in the GATK code since when we do the allele index remapping we expect dropped alleles to be low quality and don't aggregate their ADs. But that issue aside, I'm concerned that the CombineGVCF and GenomicsDB results don't match. @kgururaj is it possible #4645 changed something relevant to this bug?

@kgururaj
Copy link
Collaborator

kgururaj commented Jul 11, 2018

I'll try to explain my understanding of how spanning deletions and the associated attributes are computed in GenomicsDB - you can let me know what part of the logic I should fix.

  1. The spanning deletion allele at 20 : 10068160 corresponds to the deletion at 20 : 10068158
  2. The spanning deletion alleles at positions 20 : 10068160-10068175 correspond to the deletion GTATATATATATGTA -> G at 20 : 10068160.
  3. In GenomicsDB:
    1. ALT allele GTATATATATATGTA -> G at 20 : 10068160 is considered a deletion.
    2. GTATATATATATGTA -> * at 20 : 10068160 is NOT considered a deletion. Should it be considered a deletion ALT allele and be part of the min PL computation?
    3. Hence, GTATATATATATGTA -> G is the deletion ALT allele with min PL value. All the AD and PL values in the subsequent positions correspond to this ALT allele.

Let me know if I should fix 3.ii

@cwhelan cwhelan force-pushed the cw_haplotypecaller_spanning_deletions branch from f5ff180 to 97d3898 Compare July 27, 2018 20:12
@ldgauthier
Copy link
Contributor

Yes, the * at 10068160 should be considered a deletion. It looks like that must be necessary to match the CombineGVCFs results.

@kgururaj
Copy link
Collaborator

Thanks Laura, the next release will incorporate both your suggestions.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with the GGA code anymore, but nothing jumped out at me as being amiss. I looked at all the integration test changes and they're all improvements.


/**
* Compendium of utils to work in GENOTYPE_GIVEN_ALLELES mode.
*/
public final class GenotypingGivenAllelesUtils {

/**
* Composes the given allele variant-context providing information about the rods and reference location.
* Composes the given allele variant-context providing information about the given allele variants and reference location.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would say "providing information about the given variant alleles"

@@ -1124,8 +1124,8 @@
20 10068149 . T <NON_REF> . . END=10068151 GT:DP:GQ:MIN_DP:PL 0/0:31:63:31:0,63,945
20 10068152 . A <NON_REF> . . END=10068155 GT:DP:GQ:MIN_DP:PL 0/0:29:57:28:0,57,855
20 10068156 . A <NON_REF> . . END=10068157 GT:DP:GQ:MIN_DP:PL 0/0:27:51:27:0,51,765
20 10068158 . GTGTATATATATA G,<NON_REF> 66.73 . AS_RAW_BaseQRankSum=|-0.7,1|NaN;AS_RAW_MQ=5282.00|8882.00|0.00;AS_RAW_MQRankSum=|0.3,1|NaN;AS_RAW_ReadPosRankSum=|0.5,1|NaN;AS_SB_TABLE=0,3|2,2|0,0;BaseQRankSum=-0.652;DP=29;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.328;RAW_MQ=93364.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:3,4,0:7:57:0|1:10068158_GTGTATATATATA_G:104,0,57,114,69,183:0,3,2,2
20 10068160 . GTATATATATATGTA G,<NON_REF> 7.57 . AS_RAW_BaseQRankSum=|-0.6,1|NaN;AS_RAW_MQ=9723.00|1682.00|0.00;AS_RAW_MQRankSum=|-0.8,1|NaN;AS_RAW_ReadPosRankSum=|0.0,1|NaN;AS_SB_TABLE=2,3|0,2|0,0;BaseQRankSum=-0.566;DP=28;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.712;RAW_MQ=87005.00;ReadPosRankSum=0.000 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:5,2,0:7:44:1|0:10068158_GTGTATATATATA_G:44,0,134,60,141,201:2,3,0,2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are all great.

kgururaj added a commit to Intel-HLS/GenomicsDB that referenced this pull request Aug 21, 2018
* A spanning deletion is also considered a deletion allele - fix as
directed by broadinstitute/gatk#4963 (comment)
@ldgauthier
Copy link
Contributor

@cwhelan my understanding is that this is blocked because tests won't pass until there's a GDB update -- yes?

@cwhelan
Copy link
Member Author

cwhelan commented Aug 24, 2018

@ldgauthier Yes, this was just waiting on some resolution to the double-spanning deletion issue reported above. If a new GDB update makes the output of GenomicsDBImport concordant with that of CombineGVCFs in terms of the DP and AD's listed above, I think I can update the test files to reflect the new behavior. I still think that the behavior of Combine and GDBI is not exactly right as I mentioned in the last line of my long comment above, but maybe that's something that could be spun out into its own issue to fix independently, as long as we're OK with running with the current CombineGVCFs behavior in the meantime.

@ldgauthier
Copy link
Contributor

I know you don't need this for your project anymore, but I'd rather get this in and leave the AD correction for overlapping deletions for another issue.

@ldgauthier ldgauthier removed their assignment Aug 24, 2018
@cwhelan cwhelan force-pushed the cw_haplotypecaller_spanning_deletions branch from 97d3898 to 0284666 Compare August 31, 2018 14:39
@cwhelan
Copy link
Member Author

cwhelan commented Sep 4, 2018

@ldgauthier I've updated GenomicsDBImportIntegrationTest.testGenomicsDBImportFileInputsAgainstCombineGVCFWithNonDiploidData to use its own expected output VCF rather than running CombineGVCFS and comparing to the output of that. Let me know if you want to take one more look, or have someone else do so, before giving this a final thumb.

@cwhelan cwhelan changed the title Make HaplotypeCaller genotype and output spanning deletions -- DO NOT MERGE YET Make HaplotypeCaller genotype and output spanning deletions Sep 4, 2018
@ldgauthier
Copy link
Contributor

If you add an issue for the combining of AD for multiple spanning deletions (which will ultimately improve your new expected results here) then I'll consider this 👍

@cwhelan cwhelan merged commit 33af6de into master Sep 5, 2018
@cwhelan cwhelan deleted the cw_haplotypecaller_spanning_deletions branch September 6, 2018 13:48
kgururaj added a commit that referenced this pull request Nov 8, 2018
from earlier positions) as deletions in the min PL value computation.
This behavior now matches the behavior of CombineGVCFs.

A more detailed description of the issue is provided in
#4963

* Deleted a couple of files which are no longer necessary.
* Fixed the index of newMQcalc.combined.g.vcf
kgururaj added a commit that referenced this pull request Nov 16, 2018
from earlier positions) as deletions in the min PL value computation.
This behavior now matches the behavior of CombineGVCFs.

A more detailed description of the issue is provided in
#4963

* Deleted a couple of files which are no longer necessary.
* Fixed the index of newMQcalc.combined.g.vcf
kgururaj added a commit that referenced this pull request Nov 19, 2018
from earlier positions) as deletions in the min PL value computation.
This behavior now matches the behavior of CombineGVCFs.

A more detailed description of the issue is provided in
#4963

* Deleted a couple of files which are no longer necessary.
* Fixed the index of newMQcalc.combined.g.vcf
droazen pushed a commit that referenced this pull request Jan 8, 2019
from earlier positions) as deletions in the min PL value computation.
This behavior now matches the behavior of CombineGVCFs.

A more detailed description of the issue is provided in
#4963

* Deleted a couple of files which are no longer necessary.
* Fixed the index of newMQcalc.combined.g.vcf
droazen pushed a commit that referenced this pull request Jan 16, 2019
* The newest release of GenomicsDB treats spanning deletions (spanning
from earlier positions) as deletions in the min PL value computation.
This behavior now matches the behavior of CombineGVCFs.

A more detailed description of the issue is provided in
#4963

* Deleted a couple of files which are no longer necessary.

* Fixed the index of newMQcalc.combined.g.vcf

* Fix for #5300 when multiple
reader-threads are used in the importer. Not a race condition in
GenomicsDB - InitializedQueryWrapper wasn't written for multiple
intervals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants