Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The rest of the mitochondria joint calling pipeline #5673

Merged
merged 9 commits into from
Feb 15, 2019

Conversation

ldgauthier
Copy link
Contributor

Merging and genotyping "somatic" GVCFs from Mutect2

…ct2 branch

Move annotations for filtering to format; update with fixed LODs
@codecov-io
Copy link

codecov-io commented Feb 13, 2019

Codecov Report

Merging #5673 into master will decrease coverage by 49.984%.
The diff coverage is 58.663%.

@@               Coverage Diff               @@
##             master     #5673        +/-   ##
===============================================
- Coverage     87.05%   37.066%   -49.984%     
+ Complexity    31708     17626     -14082     
===============================================
  Files          1940      1940                
  Lines        146142    146585       +443     
  Branches      16128     16215        +87     
===============================================
- Hits         127216     54333     -72883     
- Misses        13041     87349     +74308     
+ Partials       5885      4903       -982
Impacted Files Coverage Δ Complexity Δ
.../tools/walkers/mutect/SomaticGenotypingEngine.java 90.05% <ø> (ø) 76 <0> (ø) ⬇️
...r/tools/walkers/annotator/PerAlleleAnnotation.java 68.75% <ø> (-6.25%) 14 <0> (-2)
...ute/hellbender/utils/variant/GATKVCFConstants.java 75% <ø> (ø) 4 <0> (ø) ⬇️
...der/tools/walkers/CombineGVCFsIntegrationTest.java 1.258% <0%> (-86.19%) 2 <0> (-22)
.../tools/walkers/annotator/ReadPositionUnitTest.java 5.882% <0%> (-94.118%) 2 <0> (-3)
...er/tools/walkers/GenotypeGVCFsIntegrationTest.java 4.186% <1.754%> (-76.827%) 2 <0> (-23)
...e/hellbender/utils/variant/GATKVCFHeaderLines.java 96.089% <100%> (+0.183%) 12 <2> (+2) ⬆️
...llbender/tools/walkers/annotator/ReadPosition.java 87.5% <100%> (ø) 8 <1> (ø) ⬇️
...bender/tools/walkers/annotator/FragmentLength.java 100% <100%> (ø) 7 <1> (ø) ⬇️
...ender/tools/walkers/annotator/AnnotationUtils.java 85.714% <100%> (ø) 3 <0> (ø) ⬇️
... and 1248 more

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to @ldgauthier

@@ -80,6 +78,9 @@

public static final String BP_RES_LONG_NAME = "convert-to-base-pair-resolution";
public static final String BREAK_BANDS_LONG_NAME = "break-bands-at-multiples-of";
public static final String USE_SOMATIC_LONG_NAME = "input-is-somatic";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOMATIC_INPUT seems better than USE_SOMATIC, IMO.

if (!dropSomaticFilteringAnnotations) {
//standard M2 INFO annotations for filtering will get moved to FORMAT field
//TODO: I don't have a great solution for using the same descriptions other than copying
headerLines.add(new VCFFormatHeaderLine(GATKVCFConstants.MEDIAN_BASE_QUALITY_KEY, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "median base quality"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of the PerAlleleAnnotations, in that class there is a method:

public List<VCFInfoHeaderLine> getDescriptions() {
        return Arrays.asList(new VCFInfoHeaderLine(getVcfKey(), includeRefAllele() ? VCFHeaderLineCount.R : VCFHeaderLineCount.A, VCFHeaderLineType.Integer, getDescription()));
    }

To that you could add a method

public VCFFormatHeaderLine getFormatHeaderLine() {
        return new VCFInfoHeaderLine(getVcfKey(), includeRefAllele() ? VCFHeaderLineCount.R : VCFHeaderLineCount.A, VCFHeaderLineType.Integer, getDescription());
    }

And then invoke as headerLines.add(new BaseQuality().getFormatHeaderLine()) etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the annotations that are not of that type but are in GATKVCFHeaderLines, you could add a method

public static VCFFormatHeaderLine getEquivalentFormatHeaderLine(final String infoLineKey) {
  final VCFInfoHeaderLine infoLine = getInfoLine(infoLineKey);
   return new VCFFormatHeaderLine(infoLine.getID(), infoLine.isFixedCount() ? infoLine.getCount() : infoLine.getCountType(), infoLine.getType(), infoLine.getDescription());
}

You could invoke as
headerLines.add(GATKVCFHeaderLines.getEquivalentFormatHeaderLine(EVENT_COUNT_IN_HAPLOTYPE_KEY)); etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solutions get around the issue of having two independent definitions of the corresponding INFO and FORMAT lines that could get out of sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the getEquivalentFormatHeaderLine idea, and since that takes a string, which I already have in a list (which I moved to Mutect2FilteringEngine, but am open to suggestions) I think I have a nice elegant solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely elegant because I still had to add your header lines to the master list, but not too bad. Much better than the reflection rabbit hole I started down.


import java.io.File;
import java.util.*;
import java.util.stream.Collectors;

import static org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.ALLELE_FRACTION_DELTA_LONG_NAME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import static is usually IntelliJ being overzealous. . .

@@ -227,8 +263,8 @@ private void setupVCFWriter(VCFHeader inputVCFHeader, SampleList samples) {
vcfWriter = createVCFWriter(outputFile);

final Set<String> sampleNameSet = samples.asSetOfSamples();
final VCFHeader vcfHeader = new VCFHeader(headerLines, new TreeSet<>(sampleNameSet));
vcfWriter.writeHeader(vcfHeader);
outputHeader = new VCFHeader(headerLines, new TreeSet<>(sampleNameSet));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you inline this and remove it from the class's members?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I need it for some of the per-allele annotation subsetting magic later on.

@@ -238,7 +274,12 @@ public void apply(final Locatable loc, List<VariantContext> variants, ReadsConte

ref.setWindow(10, 10); //TODO this matches the gatk3 behavior but may be unnecessary
final VariantContext mergedVC = merger.merge(variantsToProcess, loc, includeNonVariants ? ref.getBase() : null, !includeNonVariants, false);
final VariantContext regenotypedVC = regenotypeVC(mergedVC, ref, features, includeNonVariants);
final VariantContext regenotypedVC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably already knew by now that I would suggest using a ternary here. That doesn't mean you have to accept the suggestion, though.

static double[] calculateSomaticLikelihoodSums(final VariantContext vc) {
final double[] likelihoodSums = new double[vc.getNAlleles()];
for (final Genotype genotype : vc.getGenotypes().iterateInSampleNameOrder()) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something missing here?

@@ -29,12 +37,15 @@
import java.util.function.BiConsumer;
import java.util.stream.Collectors;

import static org.broadinstitute.hellbender.utils.variant.GATKVCFConstants.MEDIAN_MAPPING_QUALITY_KEY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional?

@@ -51,7 +62,7 @@
public Object[][] gvcfsToCombine() {
return new Object[][]{
// Simple Test, spanning deletions
{new File[]{getTestFile("spanningDel.1.g.vcf"),getTestFile("spanningDel.2.g.vcf")}, getTestFile("spanningDeletionRestrictToStartExpected.vcf"), NO_EXTRA_ARGS, b37_reference_20_21},
/* {new File[]{getTestFile("spanningDel.1.g.vcf"),getTestFile("spanningDel.2.g.vcf")}, getTestFile("spanningDeletionRestrictToStartExpected.vcf"), NO_EXTRA_ARGS, b37_reference_20_21},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to stay commented out?

public void testCombineSomaticGvcfs() throws Exception {
final File output = createTempFile("combinegvcfs", ".vcf");
final ArgumentsBuilder args = new ArgumentsBuilder();
args.addReference(new File(b37Reference));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you chain the builder methods?

@@ -44,6 +50,8 @@
private static final String BASE_PAIR_EXPECTED = "gvcf.basepairResolution.gatk3.7_30_ga4f720357.output.vcf";
private static final String b38_reference_20_21 = largeFileTestDir + "Homo_sapiens_assembly38.20.21.fasta";
private static final String BASE_PAIR_GVCF = "gvcf.basepairResolution.gvcf";
private static final File MITO_REF = new File(toolsTestDir, "mutect/mito/Homo_sapiens_assembly38.mt_only.fasta");
private static final double tlodThreshold = 0.8 * M2FiltersArgumentCollection.DEFAULT_TLOD_FILTER_THRESHOLD;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T_LOD_THRESHOLD

@ldgauthier
Copy link
Contributor Author

@davidbenjamin comments addressed, but this can wait until your homoplasmic fix is in.

/**
* Rather than move the per-sample INFO annotations used for filtering to the FORMAT field, drop them entirely.
* This makes the FORMAT field more readable and reduces file sizes for large cohorts.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the current plan is to filter at the single sample level and then merge with no callset level filters, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be callset-level filters. If any sample has a PASS variant the site will be PASS. If all variants are filtered the site will get all the filters. The original filters will go to the genotype filter (GF) field.

GATKVCFConstants.ORIGINAL_CONTIG_MISMATCH_KEY,
GATKVCFConstants.STRAND_ARTIFACT_AF_KEY,
GATKVCFConstants.STRAND_ARTIFACT_POSTERIOR_KEY,
GATKVCFConstants.HAPLOTYPE_CALLER_PHASING_GT_KEY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are HaplotypeCaller keys in the somatic format annotations list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's a name that's now outdated. I'm not sure what else to call it. ASSEMBLY_BASED_CALLER_PHASING_GT_KEY just doesn't have a great ring to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to leave it, I was just confused.

@@ -143,7 +194,7 @@ public VariantContext merge(final List<VariantContext> vcs, final Locatable loc,
final Allele replacement;
if ( allele.equals(Allele.NON_REF_ALLELE) ) {
replacement = allele;
} else if ( allele.length() < vc.getReference().length() ) {
} else if ( !doSomaticMerge && allele.length() < vc.getReference().length() ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not expect spanning deletions in the somatic case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really want to keep track of the allele fractions for all the deletions. If Kristen or Sarah or someone else really wants them I can spend the time.

values.add(parseNumericInfoAttributeValue(vcfInputHeader, key, value.toString()));
}
else {
String[] valueArray = value.toString().split("\\[|, |\\]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels slightly fragile. Are there constants like "," somewhere more central to VCF parsing?

if (vc.getStart() == 14872) {
Assert.assertTrue(!vc.isFiltered());
Assert.assertTrue(!vc.getGenotype(0).isFiltered());
Assert.assertTrue(vc.getGenotype(1).getFilters().contains("t_lod"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"t_lod" should be a constant from GATKVCFConstants.TUMOR_LOD_FILTER_NAME (especially if we ever plan on changing it to lod).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's actually about to change, in fact.

@davidbenjamin
Copy link
Contributor

@ldgauthier This can go in before the homoplasmic bug fix. I don't foresee it causing difficulties with the rebase.

@ldgauthier ldgauthier merged commit e5a6fed into master Feb 15, 2019
@ldgauthier ldgauthier deleted the ldg_mergeSomaticGVCFs branch February 15, 2019 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants