-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The rest of the mitochondria joint calling pipeline #5673
Conversation
…ct2 branch Move annotations for filtering to format; update with fixed LODs
Codecov Report
@@ Coverage Diff @@
## master #5673 +/- ##
===============================================
- Coverage 87.05% 37.066% -49.984%
+ Complexity 31708 17626 -14082
===============================================
Files 1940 1940
Lines 146142 146585 +443
Branches 16128 16215 +87
===============================================
- Hits 127216 54333 -72883
- Misses 13041 87349 +74308
+ Partials 5885 4903 -982
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to @ldgauthier
@@ -80,6 +78,9 @@ | |||
|
|||
public static final String BP_RES_LONG_NAME = "convert-to-base-pair-resolution"; | |||
public static final String BREAK_BANDS_LONG_NAME = "break-bands-at-multiples-of"; | |||
public static final String USE_SOMATIC_LONG_NAME = "input-is-somatic"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SOMATIC_INPUT
seems better than USE_SOMATIC
, IMO.
if (!dropSomaticFilteringAnnotations) { | ||
//standard M2 INFO annotations for filtering will get moved to FORMAT field | ||
//TODO: I don't have a great solution for using the same descriptions other than copying | ||
headerLines.add(new VCFFormatHeaderLine(GATKVCFConstants.MEDIAN_BASE_QUALITY_KEY, VCFHeaderLineCount.R, VCFHeaderLineType.Integer, "median base quality")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of the PerAlleleAnnotation
s, in that class there is a method:
public List<VCFInfoHeaderLine> getDescriptions() {
return Arrays.asList(new VCFInfoHeaderLine(getVcfKey(), includeRefAllele() ? VCFHeaderLineCount.R : VCFHeaderLineCount.A, VCFHeaderLineType.Integer, getDescription()));
}
To that you could add a method
public VCFFormatHeaderLine getFormatHeaderLine() {
return new VCFInfoHeaderLine(getVcfKey(), includeRefAllele() ? VCFHeaderLineCount.R : VCFHeaderLineCount.A, VCFHeaderLineType.Integer, getDescription());
}
And then invoke as headerLines.add(new BaseQuality().getFormatHeaderLine())
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the annotations that are not of that type but are in GATKVCFHeaderLines
, you could add a method
public static VCFFormatHeaderLine getEquivalentFormatHeaderLine(final String infoLineKey) {
final VCFInfoHeaderLine infoLine = getInfoLine(infoLineKey);
return new VCFFormatHeaderLine(infoLine.getID(), infoLine.isFixedCount() ? infoLine.getCount() : infoLine.getCountType(), infoLine.getType(), infoLine.getDescription());
}
You could invoke as
headerLines.add(GATKVCFHeaderLines.getEquivalentFormatHeaderLine(EVENT_COUNT_IN_HAPLOTYPE_KEY));
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solutions get around the issue of having two independent definitions of the corresponding INFO and FORMAT lines that could get out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the getEquivalentFormatHeaderLine idea, and since that takes a string, which I already have in a list (which I moved to Mutect2FilteringEngine, but am open to suggestions) I think I have a nice elegant solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely elegant because I still had to add your header lines to the master list, but not too bad. Much better than the reflection rabbit hole I started down.
|
||
import java.io.File; | ||
import java.util.*; | ||
import java.util.stream.Collectors; | ||
|
||
import static org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.ALLELE_FRACTION_DELTA_LONG_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import static
is usually IntelliJ being overzealous. . .
@@ -227,8 +263,8 @@ private void setupVCFWriter(VCFHeader inputVCFHeader, SampleList samples) { | |||
vcfWriter = createVCFWriter(outputFile); | |||
|
|||
final Set<String> sampleNameSet = samples.asSetOfSamples(); | |||
final VCFHeader vcfHeader = new VCFHeader(headerLines, new TreeSet<>(sampleNameSet)); | |||
vcfWriter.writeHeader(vcfHeader); | |||
outputHeader = new VCFHeader(headerLines, new TreeSet<>(sampleNameSet)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you inline this and remove it from the class's members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I need it for some of the per-allele annotation subsetting magic later on.
@@ -238,7 +274,12 @@ public void apply(final Locatable loc, List<VariantContext> variants, ReadsConte | |||
|
|||
ref.setWindow(10, 10); //TODO this matches the gatk3 behavior but may be unnecessary | |||
final VariantContext mergedVC = merger.merge(variantsToProcess, loc, includeNonVariants ? ref.getBase() : null, !includeNonVariants, false); | |||
final VariantContext regenotypedVC = regenotypeVC(mergedVC, ref, features, includeNonVariants); | |||
final VariantContext regenotypedVC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably already knew by now that I would suggest using a ternary here. That doesn't mean you have to accept the suggestion, though.
static double[] calculateSomaticLikelihoodSums(final VariantContext vc) { | ||
final double[] likelihoodSums = new double[vc.getNAlleles()]; | ||
for (final Genotype genotype : vc.getGenotypes().iterateInSampleNameOrder()) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something missing here?
@@ -29,12 +37,15 @@ | |||
import java.util.function.BiConsumer; | |||
import java.util.stream.Collectors; | |||
|
|||
import static org.broadinstitute.hellbender.utils.variant.GATKVCFConstants.MEDIAN_MAPPING_QUALITY_KEY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional?
@@ -51,7 +62,7 @@ | |||
public Object[][] gvcfsToCombine() { | |||
return new Object[][]{ | |||
// Simple Test, spanning deletions | |||
{new File[]{getTestFile("spanningDel.1.g.vcf"),getTestFile("spanningDel.2.g.vcf")}, getTestFile("spanningDeletionRestrictToStartExpected.vcf"), NO_EXTRA_ARGS, b37_reference_20_21}, | |||
/* {new File[]{getTestFile("spanningDel.1.g.vcf"),getTestFile("spanningDel.2.g.vcf")}, getTestFile("spanningDeletionRestrictToStartExpected.vcf"), NO_EXTRA_ARGS, b37_reference_20_21}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to stay commented out?
public void testCombineSomaticGvcfs() throws Exception { | ||
final File output = createTempFile("combinegvcfs", ".vcf"); | ||
final ArgumentsBuilder args = new ArgumentsBuilder(); | ||
args.addReference(new File(b37Reference)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you chain the builder methods?
@@ -44,6 +50,8 @@ | |||
private static final String BASE_PAIR_EXPECTED = "gvcf.basepairResolution.gatk3.7_30_ga4f720357.output.vcf"; | |||
private static final String b38_reference_20_21 = largeFileTestDir + "Homo_sapiens_assembly38.20.21.fasta"; | |||
private static final String BASE_PAIR_GVCF = "gvcf.basepairResolution.gvcf"; | |||
private static final File MITO_REF = new File(toolsTestDir, "mutect/mito/Homo_sapiens_assembly38.mt_only.fasta"); | |||
private static final double tlodThreshold = 0.8 * M2FiltersArgumentCollection.DEFAULT_TLOD_FILTER_THRESHOLD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T_LOD_THRESHOLD
@davidbenjamin comments addressed, but this can wait until your homoplasmic fix is in. |
/** | ||
* Rather than move the per-sample INFO annotations used for filtering to the FORMAT field, drop them entirely. | ||
* This makes the FORMAT field more readable and reduces file sizes for large cohorts. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the current plan is to filter at the single sample level and then merge with no callset level filters, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be callset-level filters. If any sample has a PASS variant the site will be PASS. If all variants are filtered the site will get all the filters. The original filters will go to the genotype filter (GF) field.
GATKVCFConstants.ORIGINAL_CONTIG_MISMATCH_KEY, | ||
GATKVCFConstants.STRAND_ARTIFACT_AF_KEY, | ||
GATKVCFConstants.STRAND_ARTIFACT_POSTERIOR_KEY, | ||
GATKVCFConstants.HAPLOTYPE_CALLER_PHASING_GT_KEY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are HaplotypeCaller
keys in the somatic format annotations list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's a name that's now outdated. I'm not sure what else to call it. ASSEMBLY_BASED_CALLER_PHASING_GT_KEY just doesn't have a great ring to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to leave it, I was just confused.
@@ -143,7 +194,7 @@ public VariantContext merge(final List<VariantContext> vcs, final Locatable loc, | |||
final Allele replacement; | |||
if ( allele.equals(Allele.NON_REF_ALLELE) ) { | |||
replacement = allele; | |||
} else if ( allele.length() < vc.getReference().length() ) { | |||
} else if ( !doSomaticMerge && allele.length() < vc.getReference().length() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not expect spanning deletions in the somatic case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really want to keep track of the allele fractions for all the deletions. If Kristen or Sarah or someone else really wants them I can spend the time.
values.add(parseNumericInfoAttributeValue(vcfInputHeader, key, value.toString())); | ||
} | ||
else { | ||
String[] valueArray = value.toString().split("\\[|, |\\]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels slightly fragile. Are there constants like "," somewhere more central to VCF parsing?
if (vc.getStart() == 14872) { | ||
Assert.assertTrue(!vc.isFiltered()); | ||
Assert.assertTrue(!vc.getGenotype(0).isFiltered()); | ||
Assert.assertTrue(vc.getGenotype(1).getFilters().contains("t_lod")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"t_lod"
should be a constant from GATKVCFConstants.TUMOR_LOD_FILTER_NAME
(especially if we ever plan on changing it to lod).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It's actually about to change, in fact.
@ldgauthier This can go in before the homoplasmic bug fix. I don't foresee it causing difficulties with the rebase. |
Merging and genotyping "somatic" GVCFs from Mutect2