Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools, filters and annotations for mitochondria pipeline #5193

Merged
merged 1 commit into from
Oct 17, 2018

Conversation

meganshand
Copy link
Contributor

The workflow for mitochondria will include running AddOriginalAlignmentTags on a bam, realigning it to the mitochondria contig only, then running Mutect2 with --annotation OriginalAlignment and --median-autosomal-coverage to get the appropriate annotations. Then running FilterMitochondrialMutectCalls.

I don't think any of these changes should effect the somatic pipeline.

@ldgauthier I didn't change the name of TLOD or tumor sample in the mitochondria vcf. Maybe we can talk next week about how to best do that?

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go over this with a fine-toothed comb, because I think there's some refactoring to be done before we worry about too many details. I'm leaning towards having separate tools for MT stuff, even if it's pretty much the same under the hood. We'll have the option for them to diverge and they can have more reasonable args.

String OAValue;
if(!read.isUnmapped()){
OAValue = String.format("%s,%s,%s,%s,%s,%s;",
read.getContig().replace(",", "_"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hg38 has lots of crazy names -- is comma the only parsing problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point of this replacement is just because , is the delimiter within the OA tag. I replaced the , with OA_SEPARATOR to be a little clearer. The , is also the only explicitly forbidden character from the RNAME section of OA tag in the PR in hts-specs.


final double[] tumorLods = GATKProtectedVariantContextUtils.getAttributeAsDoubleArray(vc, GATKVCFConstants.TUMOR_LOD_KEY, () -> null, -1);
if (tumorLods==null) {
warning.warn("One or more variant contexts is missing the 'TLOD' annotation, StrandArtifact will not be computed for these VariantContexts");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you copied this from StrandArtifact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, oops

@DocumentedFeature(groupName= HelpConstants.DOC_CAT_ANNOTATORS, groupSummary=HelpConstants.DOC_CAT_ANNOTATORS_SUMMARY, summary="Number of alt reads with an OA tag that doesn't match the current alignment contig.")
public class OriginalAlignment extends GenotypeAnnotation implements Annotation {
protected final OneShotLogger warning = new OneShotLogger(this.getClass());
public static final String OA_NOT_CURRENT_CONTIG = "OA_NOT_CURRENT_CONTIG";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this key, but I don't have any better suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ORIGINAL_CONTIG_MISMATCH any better?

return GATKProtectedVariantContextUtils.getAttributeAsIntArray(vc.getGenotype(tumorSample), key, () -> null, 0);
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(extra whitespace)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

programGroup = VariantFilteringProgramGroup.class
)
@DocumentedFeature
public final class FilterMitochondrialMutectCalls extends VariantWalker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we spin off a new walker for MT calls then this can have a more straightforward name.

*/
@Argument(fullName = TLOD_BY_DEPTH,
doc="TLOD by depth threshold for filtering variant", optional = true)
public double tlod_by_depth = .005;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to say what this number means. For example, there are a few hand wavy explanations running around for the magic 6.3 for Mutect2, like two reads with Q30 alts. That might be less intuitive for something that's odds-ratio-based rather than probability-based though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, yeah I'm not sure. We pretty much found this number empirically.

@meganshand
Copy link
Contributor Author

@ldgauthier This should be ready for more review.

@jamesemery This now has all of the argument constructors/copying, if you want to take a look.

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss subclassing various classes in person.

private static final long serialVersionUID = 1L;
@Override public boolean test(final GATKRead read) {
boolean accept = true;
if (read.hasAttribute(AddOriginalAlignmentTags.OA_TAG_NAME) & read.hasAttribute(AddOriginalAlignmentTags.MATE_CONTIG_TAG_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& -> &&.

}
}

private static void addMateContigTag(GATKRead read) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method can be a one-liner without compromising clarity:

read.setAttribute(MATE_CONTIG_TAG_NAME, !read.mateIsUnmapped() ? read.getMateContig() : "*");

}
}

private static void addMateContigTag(GATKRead read) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make final: final GATKRead read

}

//TODO: Once the OA tag is merged into the spec (https://github.com/samtools/hts-specs/pull/193) this should be moved to htsjdk
private static void addOATag(GATKRead read) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final GATKRead read

import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerArgumentCollection;

public class MitochondrialCallerArgumentCollection extends AssemblyBasedCallerArgumentCollection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should be a subclass of M2ArgumentCollection, instead of adding a new constructor to M2ArgumentCollection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it possible to override an argument? I don't see that anywhere in our code base and we don't want the MitochondrialCaller docs to talk about tumors all over the place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier Longer term I could change the Mutect docs to be more general.

Copy link
Collaborator

@cmnbroad cmnbroad Oct 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to override @Argument in a subclass, but @meganshand with some refactoring you could define an interface or base class with abstract methods that return values of interest, and then dynamically select the concrete implementation on a per-tool basis. GATK and Picard both use this to override doc strings or required attributes for command line args. See this and this for example.

tumorSample would require special handling because for M2 its provided in the CLI and for Mito its derived from the input header.

import org.broadinstitute.barclay.argparser.Argument;


public class MitochondrialFiltersArgumentCollection extends M2FiltersArgumentCollection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the way this class looks confirms my earlier suggestions to extend corresponding classes in Mutect - it's clear and clean because it's a subclass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe not. Now that I look again, this is going to give the MitochondrialFiltersArgumentCollection a tumor-segmentation argument, which I don't want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right, and now that I look at it, M2FiltersArgumentCollection extends AssemblyBasedCallerArgumentCollection, which doesn't make sense to me because those shouldn't be used in the filtering step (eg bamout).

But assuming that does make sense, does that mean I want the MitochondrialFiltersArgumentCollection to extend AssemblyBasedCallerArgumentCollection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look to me like FilterMutectCalls needs any of the AssemblyBasedCaller args either. There could be a FiltersArgumentCollection that's the parent of an M2 version and a mito version. I think that arg collection wouldn't inherit from anyone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's going to be a lot of duplication just because tumors and mitochondria need different defaults. Is there really no elegant way to do this?

@@ -224,4 +222,19 @@ public void closeTool() {
m2Engine.shutdown();
}
}

private Set<VCFHeaderLine> getMutect2VCFHeaderLines() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with this refactor. I think it's worth revisiting after the subclassing I suggested above.


if (MathUtils.arrayMax(tumorLods) < MTFAC.TUMOR_LOD_THRESHOLD) {
filterResult.addFilter(GATKVCFConstants.TUMOR_LOD_FILTER_NAME);
if(lodKey.equals(GATKVCFConstants.TUMOR_LOD_KEY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a code smell. I think Mitochondria pipeline should have its own filtering engine that inherits M2FilteringEngine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored, but still have this problem. It's only to get the correct filter name in the output. Is there a way to link the filter name with an annotation key?

@@ -352,7 +386,21 @@ public FilterResult calculateFilters(final M2FiltersArgumentCollection MTFAC, fi
return filterResult;
}

private int[] getIntArrayTumorField(final VariantContext vc, final String key) {
public FilterResult calculateMitochondrialFilters(MitochondrialFiltersArgumentCollection MTFAC, VariantContext vc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the subclass this method would be just be an override of calcualteFilters()

@meganshand
Copy link
Contributor Author

meganshand commented Sep 24, 2018

I pulled out MitochondiralFilteringEngine into its own class, but left the command line tools as is. I think this is again ready for rereview.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments, finally.

* If original alignment and mate original alignment tags exist, filter reads that were originally chimeric (mates were on different contigs).
*/
@DocumentedFeature(groupName=HelpConstants.DOC_CAT_READFILTERS, groupSummary = HelpConstants.DOC_CAT_READFILTERS_SUMMARY, summary = "Filters reads whose original alignment was chimeric.")
public static class ChimericOAFilter extends ReadFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other ReadFilters this should be named ChimericOAReadFilter.

public static class ChimericOAFilter extends ReadFilter {
private static final long serialVersionUID = 1L;
@Override public boolean test(final GATKRead read) {
boolean accept = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simplify via:

if ( attributes are present) {
  return AddOriginalAlignmentTags. . .
}
return true;

@@ -291,5 +307,6 @@ private ReadFilterLibrary(){ /*no instance*/ }
public static final SeqIsStoredReadFilter SEQ_IS_STORED = new SeqIsStoredReadFilter();
public static final ValidAlignmentStartReadFilter VALID_ALIGNMENT_START = new ValidAlignmentStartReadFilter();
public static final ValidAlignmentEndReadFilter VALID_ALIGNMENT_END = new ValidAlignmentEndReadFilter();
public static final ChimericOAFilter CHIMERIC_OA_FILTER = new ChimericOAFilter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not abbreviate here.

* <p>If reads were realigned to multiple references (for example the full human reference followed by just the
* mitochondria) and the original alignment tag was recorded before realigning, then we can count the number of alt
* reads that have been realigned from other contigs to this one. This can be useful for downstream filtering if an alt
* allele has all of most of its support originally from a different contig. In the mitochondria case this can be useful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all or most"

Utils.nonNull(vc);
Utils.nonNull(likelihoods);

final double[] lods = GATKProtectedVariantContextUtils.getAttributeAsDoubleArray(vc, GATKVCFConstants.LOD_KEY, () -> null, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication from Mutect2FilteringEngine::getDoubleArrayAttribute -- you can extract a shared method in GATKPRotectedVariantContextUtils.

public boolean dontClipITRArtifacts = true;

@Advanced
@Argument(fullName = M2ArgumentCollection.GET_AF_FROM_AD_LONG_NAME, doc="Use allelic depth to calculate allele fraction", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes in @MartonKN's PR (which will get merged imminently) can you delete this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

import org.broadinstitute.barclay.argparser.Argument;


public class MitochondrialFiltersArgumentCollection extends M2FiltersArgumentCollection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's going to be a lot of duplication just because tumors and mitochondria need different defaults. Is there really no elegant way to do this?

@@ -372,4 +365,20 @@ private void checkSampleInBamHeader(final String sample) {
private String decodeSampleNameIfNecessary(final String name) {
return samplesList.asListOfSamples().contains(name) ? name : IOUtils.urlDecode(name);
}

public String getVersion() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add these intentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, would it be better to make the fields not private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking move getMutect2VCFHeaderLines from Mutect2 to Mutect2Engine and then these fields could remain private without needing to add the getters.

//Mitochondrial M2-related filters
addFilterLine(new VCFFilterHeaderLine(CHIMERIC_ORIGINAL_ALIGNMENT_FILTER_NAME, "NuMT variant with too many ALT reads originally from autosome"));
addFilterLine(new VCFFilterHeaderLine(LOW_AVG_ALT_QUALITY_FILTER_NAME, "Low average alt quality"));
addFilterLine(new VCFFilterHeaderLine(LOW_LOD_FILTER_NAME, "Variant does not meet likelihood threshold"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could unify some of the mito and tumor filter and info fields by making the M2 names less tumor specific. If Lee and Geraldine are okay with it I have no problem with eg TLOD -> LOD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be helpful. Is that difficult for legacy reasons? (Will users complain about the change breaking their downstream scripts?)

Copy link
Contributor Author

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments @davidbenjamin! I fixed the easy things and I'm talking to comms now about the larger questions of if this should be a wrapper or not.

public boolean dontClipITRArtifacts = true;

@Advanced
@Argument(fullName = M2ArgumentCollection.GET_AF_FROM_AD_LONG_NAME, doc="Use allelic depth to calculate allele fraction", optional = true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@@ -123,4 +123,23 @@
@Argument(fullName = SMITH_WATERMAN_LONG_NAME, doc = "Which Smith-Waterman implementation to use, generally FASTEST_AVAILABLE is the right choice", optional = true)
public SmithWatermanAligner.Implementation smithWatermanImplementation = SmithWatermanAligner.Implementation.JAVA;

public AssemblyBasedCallerArgumentCollection(){}

//TODO: copy the arg collections themselves
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't. I believe @droazen and @jamesemery are aware that this todo is going in though. (If we end up going this wrapper command line tool route)

@@ -372,4 +365,20 @@ private void checkSampleInBamHeader(final String sample) {
private String decodeSampleNameIfNecessary(final String name) {
return samplesList.asListOfSamples().contains(name) ? name : IOUtils.urlDecode(name);
}

public String getVersion() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, would it be better to make the fields not private?

//Mitochondrial M2-related filters
addFilterLine(new VCFFilterHeaderLine(CHIMERIC_ORIGINAL_ALIGNMENT_FILTER_NAME, "NuMT variant with too many ALT reads originally from autosome"));
addFilterLine(new VCFFilterHeaderLine(LOW_AVG_ALT_QUALITY_FILTER_NAME, "Low average alt quality"));
addFilterLine(new VCFFilterHeaderLine(LOW_LOD_FILTER_NAME, "Variant does not meet likelihood threshold"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be helpful. Is that difficult for legacy reasons? (Will users complain about the change breaking their downstream scripts?)

@codecov-io
Copy link

codecov-io commented Oct 2, 2018

Codecov Report

Merging #5193 into master will decrease coverage by 1.14%.
The diff coverage is 92.511%.

@@              Coverage Diff               @@
##              master     #5193      +/-   ##
==============================================
- Coverage     86.766%   85.625%   -1.14%     
+ Complexity     29944     29682     -262     
==============================================
  Files           1838      1839       +1     
  Lines         138820    138742      -78     
  Branches       15285     15272      -13     
==============================================
- Hits          120448    118798    -1650     
- Misses         12804     14441    +1637     
+ Partials        5568      5503      -65
Impacted Files Coverage Δ Complexity Δ
...ute/hellbender/utils/variant/GATKVCFConstants.java 80% <ø> (ø) 4 <0> (ø) ⬇️
.../tools/walkers/mutect/SomaticGenotypingEngine.java 90.286% <0%> (+0.342%) 70 <4> (-1) ⬇️
...ls/walkers/mutect/M2FiltersArgumentCollection.java 100% <100%> (ø) 1 <0> (ø) ⬇️
...r/tools/walkers/mutect/Mutect2IntegrationTest.java 92.795% <100%> (-0.608%) 63 <2> (-2)
...e/hellbender/utils/variant/GATKVCFHeaderLines.java 95.732% <100%> (+0.107%) 10 <0> (ø) ⬇️
...ers/annotator/ReadOrientationArtifactUnitTest.java 97.248% <100%> (ø) 12 <0> (ø) ⬇️
...der/tools/walkers/annotator/OriginalAlignment.java 100% <100%> (ø) 10 <10> (?)
...r/tools/walkers/mutect/StrandArtifactUnitTest.java 100% <100%> (ø) 18 <0> (ø) ⬇️
...ols/walkers/annotator/ReadOrientationArtifact.java 83.824% <100%> (ø) 16 <0> (ø) ⬇️
...hellbender/tools/walkers/mutect/Mutect2Engine.java 91.139% <100%> (+0.056%) 60 <0> (+1) ⬆️
... and 82 more

@SusieX
Copy link

SusieX commented Oct 4, 2018

Hi,
May I ask when this mitochondria pipeline will be available?
In the meanwhile, if I use Mutect2 for calling human MT variants (including indel), which is the best choice of parameters?
I have MT bam file to start with and read coverage is in thousand scale for MT.
Thanks

@davidbenjamin
Copy link
Contributor

I defer to @meganshand regarding @SusieX's questions.

@meganshand
Copy link
Contributor Author

@SusieX,

I'm not sure when this will be merged, but I'm hoping within the next couple of weeks.

In the meantime, the parameters we've been using with Mutect2 are:

--initial-tumor-lod 0
--tumor-lod-to-emit 0
--min-pruning ceiling(Mean coverage of your sample / 500)
(-get-af-from-ad) (you only need this if you aren't using the latest version)

The min-pruning argument is the most important.

For FilterMutectCalls we turn off a bunch of filters, the only ones we use are:

low-tlod
strand-artifact
base-quality
mapping-quality
contamination

I'd recommend manually inspecting your filters if possible.

@meganshand
Copy link
Contributor Author

I talked to comms and we agreed that a "mitochondria-mode" argument to Mutect2 was the right balance of clarity (you're really running Mutect2 not a wrapper) and simplicity (you don't need a laundry list of arguments to change which mode you're in if you just want to run with optimized defaults).

@ldgauthier @davidbenjamin @takutosato @rcmajovski Could you please take another look? Removing the wrapper tools has cleaned up the code so there are fewer changes now. I also changed TLOD to LOD in this version, but I'm happy to take that out and have that be future work if anyone is worried about it being a breaking change.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in good shape. Consider my comments optional suggestions.

* If original alignment and mate original alignment tags exist, filter reads that were originally chimeric (mates were on different contigs).
*/
@DocumentedFeature(groupName=HelpConstants.DOC_CAT_READFILTERS, groupSummary = HelpConstants.DOC_CAT_READFILTERS_SUMMARY, summary = "Filters reads whose original alignment was chimeric.")
public static class NonChimericOaFilter extends ReadFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class won't be instantiated in very many places you can safely give it a verbose name like NonChimericOriginalAlignmentReadFilter

@@ -291,5 +306,6 @@ private ReadFilterLibrary(){ /*no instance*/ }
public static final SeqIsStoredReadFilter SEQ_IS_STORED = new SeqIsStoredReadFilter();
public static final ValidAlignmentStartReadFilter VALID_ALIGNMENT_START = new ValidAlignmentStartReadFilter();
public static final ValidAlignmentEndReadFilter VALID_ALIGNMENT_END = new ValidAlignmentEndReadFilter();
public static final NonChimericOaFilter NON_CHIMERIC_OA_FILTER = new NonChimericOaFilter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too, I would not abbreviate.

protected final OneShotLogger warning = new OneShotLogger(this.getClass());
public static final String KEY = GATKVCFConstants.POTENTIAL_POLYMORPHIC_NUMT_KEY;
private static final double LOWER_BOUND_PROB = .1;
private long minAutosomalHet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can all be final.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it looks like some of them don't need to be class members at all because they are local to the constructor.

@@ -81,7 +81,7 @@ public void annotate(final ReferenceContext ref,
pi.put(ART_REV, PRIOR_PROBABILITY_OF_ARTIFACT);

// We use the allele with highest LOD score
final double[] tumorLods = GATKProtectedVariantContextUtils.getAttributeAsDoubleArray(vc, GATKVCFConstants.TUMOR_LOD_KEY, () -> null, -1);
final double[] tumorLods = GATKProtectedVariantContextUtils.getAttributeAsDoubleArray(vc, GATKVCFConstants.LOD_KEY, () -> null, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, you can change "highest LOD score" to "highest log odds."

private String getTumorSampleName(){
return getHeaderForVariants().getMetaDataLine(Mutect2Engine.TUMOR_SAMPLE_KEY_IN_VCF_HEADER).getValue();
private String getTumorSampleName() {
if (!MTFAC.mitochondria) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a one-liner: return MTFAC.mitochondria ? getMitoSampleName() : getHeaderForVariants().getMetaDataLine(Mutect2Engine.TUMOR_SAMPLE_KEY_IN_VCF_HEADER).getValue();

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, you have my 👍 to merge

@@ -168,7 +169,9 @@ public void writeHeader(final VariantContextWriter vcfWriter, final Set<VCFHeade
headerInfo.add(GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.HAPLOTYPE_CALLER_PHASING_ID_KEY));
headerInfo.add(GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.HAPLOTYPE_CALLER_PHASING_GT_KEY));

headerInfo.add(new VCFHeaderLine(TUMOR_SAMPLE_KEY_IN_VCF_HEADER, tumorSample));
if (!MTAC.mitochondria) {
headerInfo.add(new VCFHeaderLine(TUMOR_SAMPLE_KEY_IN_VCF_HEADER, tumorSample));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you OK not adding the sample name to the header in mt mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the sample name is included in the last "header line" (the one with one #), the same way it looks in a germline vcf.

* Only variants with LOD divided by depth exceeding this threshold can pass filtering.
*/
@Argument(fullName = LOD_BY_DEPTH,
doc="LOD by depth threshold for filtering variant", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline

* reads exceeding this threshold can pass filtering.
*/
@Argument(fullName = NON_MT_ALT_READS_BY_ALT_READS,
doc="Known NuMT alts by total alts threshold for filtering variant", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

* If original alignment and mate original alignment tags exist, filter reads that were originally chimeric (mates were on different contigs).
*/
@DocumentedFeature(groupName=HelpConstants.DOC_CAT_READFILTERS, groupSummary = HelpConstants.DOC_CAT_READFILTERS_SUMMARY, summary = "Filters reads whose original alignment was chimeric.")
public static class NonChimericOriginalAlignmentFilter extends ReadFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with David that this should be renamed NonChimericOriginalAlignmentReadFilter to be consistent with others

@@ -136,6 +137,8 @@
public static final String ROF_POSTERIOR_KEY = "P_RO"; // For read orientation filter
public static final String ROF_PRIOR_KEY = "P_PRIOR_RO";
public static final String ROF_TYPE_KEY = "ROF_TYPE";
public static final String ORIGINAL_CONTIG_MISMATCH_KEY = "ORIGINAL_CONTIG_MISMATCH";
public static final String POTENTIAL_POLYMORPHIC_NUMT_KEY = "POTENTIAL_POLYMORPHIC_NUMT";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces

private Range<Long> autosomalHomAltRange;

public PolymorphicNuMT(final double medianAutosomalCoverage){
PoissonDistribution autosomalCoverage = new PoissonDistribution(medianAutosomalCoverage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

final Collection<ReadLikelihoods<Allele>.BestAllele> bestAlleles = likelihoods.bestAllelesBreakingTies(g.getSampleName());
final String currentContig = ref.getInterval().getContig();

long nonChrMAlt = bestAlleles.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

}
private void applyLODFilter(final M2FiltersArgumentCollection MTFAC, final VariantContext vc, final FilterResult filterResult) {
if(vc.isBiallelic()) {
Double LOD = vc.getAttributeAsDouble(GATKVCFConstants.LOD_KEY, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOD -> lod. Also make them final

final int maxFractionIndex = MathUtils.maxElementIndex(alleleFractions);
final int[] ADs = tumorGenotype.getAD();
final int altCount = ADs[maxFractionIndex + 1];
if (tumorGenotype.hasAnyAttribute(GATKVCFConstants.ORIGINAL_CONTIG_MISMATCH_KEY) & vc.isBiallelic()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& -> &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R has ruined me :/

final int[] ADs = tumorGenotype.getAD();
final int altCount = ADs[maxFractionIndex + 1];
if (tumorGenotype.hasAnyAttribute(GATKVCFConstants.ORIGINAL_CONTIG_MISMATCH_KEY) & vc.isBiallelic()) {
int nonMtOa = Integer.parseInt(tumorGenotype.getAnyAttribute(GATKVCFConstants.ORIGINAL_CONTIG_MISMATCH_KEY).toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use the util method since we have it

final int nonMtOa = GATKProtectedVariantContextUtils.getAttributeAsInt(tumorGenotype, GATKVCFConstants.ORIGINAL_CONTIG_MISMATCH_KEY, -1);

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I had these comments pending for days. Sorry. Minor changes and questions.

* Original Alignment annotation counts the number of alt reads where the original alignment contig doesn't match the current alignment contig
*
* <p>If reads were realigned to multiple references (for example the full human reference followed by just the
* mitochondria) and the original alignment tag was recorded before realigning, then we can count the number of alt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mitochondrial -> mitochondrial contig

* <p>If reads were realigned to multiple references (for example the full human reference followed by just the
* mitochondria) and the original alignment tag was recorded before realigning, then we can count the number of alt
* reads that have been realigned from other contigs to this one. This can be useful for downstream filtering if an alt
* allele has all or most of its support originally from a different contig. In the mitochondria case this can be useful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mitochondrial -> mitochondrial

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@meganshand meganshand force-pushed the ms_mito_original_alignments branch 3 times, most recently from 38f384f to b158b82 Compare October 15, 2018 19:27
@SusieX
Copy link

SusieX commented Oct 16, 2018

Could anyone please give an announcement when the mitochondria pipeline is officially available?

Thanks

@meganshand
Copy link
Contributor Author

@SusieX Sorry for the delay, we're just trying to finalize some potentially disruptive changes before we merge. I'll open an issue so I can ping you there once this PR has been both merged and released. Thanks!

@SusieX
Copy link

SusieX commented Oct 16, 2018 via email

@meganshand meganshand merged commit 21e11ed into master Oct 17, 2018
@meganshand meganshand deleted the ms_mito_original_alignments branch October 17, 2018 18:18
@SusieX
Copy link

SusieX commented Oct 18, 2018

Hi Megan,
Sorry I still have a few questions:

  1. If I have a lot of samples with MT coverage range 500-5000.
    Would you suggest use different ceiling for "--min-pruning" for each sample or use a same value for all samples?
    SNPs called will be compared across samples eventually.

  2. Also for MT data, do I also need to use
    -ploidy 1
    --do-not-run-physical-phasing true
    ?

  3. The FilterMutectCalls is going to be used after Mutect generated vcf files, right?

  4. Is there a similar way to generate joint calls as the gvcf in HaploCaller?

Thanks a lot for your help!

@SusieX,

I'm not sure when this will be merged, but I'm hoping within the next couple of weeks.

In the meantime, the parameters we've been using with Mutect2 are:

--initial-tumor-lod 0
--tumor-lod-to-emit 0
--min-pruning ceiling(Mean coverage of your sample / 500)
(-get-af-from-ad) (you only need this if you aren't using the latest version)

The min-pruning argument is the most important.

For FilterMutectCalls we turn off a bunch of filters, the only ones we use are:

low-tlod
strand-artifact
base-quality
mapping-quality
contamination

I'd recommend manually inspecting your filters if possible.

@meganshand
Copy link
Contributor Author

Hi @SusieX

For your questions:

  1. Yes you want --min-pruning to be different for each sample, even if you're eventually going to compare samples. It will increase your sensitivity.

  2. No need for -ploidy 1 or --do-not-run-physical-phasing

  3. FilterMutectCalls should be used on the output vcf from Mutect2

  4. There is no way to joint call these VCFs the way you do with GVCFs as of today. This is future work that is on our to do list, but I don't know when it will be done.

@SusieX
Copy link

SusieX commented Oct 18, 2018 via email

@jullee
Copy link

jullee commented Oct 18, 2018

Hi I'm very new to GATK and was looking into using HaplotypeCaller to generate a (g)vcf file with both variant and invariant sites for chloroplast and mitochondrial DNA (with the --emitRefConfidence BP_resolution flag). I found this thread and have looked up the parameters sent to SusieX. But I'm confused what should be used as the reference/normal genome?

It would be great if there was an option to "run in mitochondria mode".

@SusieX
Copy link

SusieX commented Oct 22, 2018

Hi Megan,
I'd like to get some more suggestion for pre-processing the bam for mitochondria calls.

Are remove-duplicate and base-quality-recalibration recommended before mutect2 call?

For 1st step of base quality recalibration, what reference file to use for --known-sites? My reads are mapped to hg38.

gatk BaseRecalibrator
-I my_reads.bam
-R reference.fasta
--known-sites sites_of_variation.vcf
-O recal_data.table

So the whole pipeline would be:
BAM -> remove dup -> BQ recalibrate -> Mutect2 call -> FilterMutectCalls

Am I missing anything?

Thanks a lot!

@meganshand
Copy link
Contributor Author

@SusieX

Marking duplicates: I do recommend removing duplicates (we run MarkDuplicates from Picard).

BQSR: The pipeline we're developing is for Whole Genome data, so our bams have gone through BQSR in the whole genome pipeline. We're using those recalibrated base qualities. I haven't tested running BQSR only on the mitochondria so I don't know how well that would work.

If you do need to run BQSR only on the mitochondria I'd start by using the phylotree sites as --known-sites, but you'd need to have those sites in vcf format. Again, I haven't tested this so I don't know how well it will perform.

If you end up using BQSR I think you're pipeline (BAM -> remove dup -> BQ recalibrate -> Mutect2 call -> FilterMutectCalls) is correct. Good luck!

@SusieX
Copy link

SusieX commented Oct 22, 2018 via email

@jullee
Copy link

jullee commented Oct 23, 2018

Hi

I have a few questions about using Mutect2 versus HaplotypeCaller to call mitochondrial (or chloroplast) variants (I'm working on plants).

  1. In article 11127, it says
    "The tool can run on unmatched tumors but this produces high rates of false positives. Technically speaking, somatic variants are both (i) different from the control sample and (ii) different from the reference."

In my case, there isn't a "normal" sample to compare the "tumour" (i.e. mitochondria) to, just a reference. Are my results likely to be prone to false positives then? Or is it that the case for mitochondria is different because we are not truly looking for somatic variants but rather variants in general that may not be 100% "pure" (because of heteroplasty?). Is the latter point the justification for Mutect vs HaplotypeCaller in the first place?

  1. I am interested in both variant and invariant sites (relative to the reference). Ultimately, I want to be able to go base for base and make a call (ref, alt, or "N"; where "N" is used where I have no confidence in the base call at that site). The goal is to have a full haplotype sequence for the mitochondria/chloroplast of each sample.

In HaplotypeCaller, I read that I could use --emit-ref-confidence BP_RESOLUTION to get the confidence of a site being homozygous reference. Does --out-mode EMIT_ALL_CONFIDENT_SITES give similar information?

  1. In this thread, it's said that the --min-pruning argument is very important. I'm very new to this and don't fully understand what this parameter is doing. Is the advice to set this parameter = ceiling(average coverage/500) general? Or specific to the project mentioned above?

  2. I don't why we don't have to set the sample ploidy to 1? Is this only for applications where we want to detect heteroplasmy? If we are after the majority haplotype, does it make sense to set this parameter?

Many thanks!

@meganshand
Copy link
Contributor Author

@jullee

  1. Yes, running in mitochondria-mode is somewhat like running in tumor-only mode because you don't have a matched normal. We have found that the false positive rate is very reasonable and comparable to other tools for mitochondria (including HaplotypeCaller). Your latter point (we're looking for calls at various allele fractions) is exactly why we chose to use Mutect instead of HaplotypeCaller.

  2. I don't think we can currently output bp resolution reference confidence in Mutect2. There is development being done to allow Mutect to output a GVCF (which would give you reference confidence) but this is still a work in progress. You might be able to use the "Genotype Given Alleles" argument to force Mutect to output qualities at each base, but I haven't tested this at all.

  3. We found that value for --min-pruning by looking at whole genome sequencing data. Both HaplotypeCaller and Mutect do a local reassembly at a potential variant site. This reassembly builds a graph where each path is a potential haplotype. The min-pruning argument tells you how many reads each path needs to have in order to be considered (otherwise it is pruned from the graph). By default this value is 1 which makes sense for 30-60X data, but if your coverage is very high (which it usually is for mitochondria) the graph will be too messy due to errors unless you turn that argument up. Because the depth is so high, you need more evidence to be sure a path in the graph isn't just noise.

  4. The ploidy argument isn't used by Mutect2 because it is always looking for variation at arbitrary allele fractions. If you are instead using HaplotypeCaller you would need to change that argument either to 1 (this would only allow you to call HomVar sites) or as high as you need to get desired resolution. In other words if you want to be able to call sites at 1% allele fraction, you will need to set the ploidy argument to 100 if you are running HaplotypeCaller. Mutect2 will look for sites at all allele fractions (both HomVar and low allele fraction sites) without changing that argument.

@jullee
Copy link

jullee commented Oct 24, 2018

Thanks for this clarification!

@SusieX
Copy link

SusieX commented Mar 22, 2019

Hello,
Just checking in to see if there is any new release for GATK mitochondria pipeline? Thanks

@soerenmueller
Copy link

Hey,

in the light of recent papers describing mtDNA as a great natural barcode to do lineage tracing in single human cells from scATAC-seq data, I think this is of relevance and I'd be very interested if there is an update on the pipeline as well.

Thanks!!

@meganshand
Copy link
Contributor Author

meganshand commented May 2, 2019

There is a copy of the 4.1.0.0 pipeline here: https://app.terra.bio/#workspaces/help-gatk/Mitochondria-SNPs-Indels-hg38

We are working on some adjustments to this pipeline to make the low allele fraction calls more reliable with more aggressive filtering and we will put those updates in that same location once they're complete.

@huwenrime
Copy link

Hi, I'm trying to generate a VCF with Mitochondrial mode of Mutect2 based on MT amplicon (PCR) sequencing. But I encountered two problems:

  1. The base site (Pos: MT:16320) has high depth and good quality, but the result of AF is low. In IGV, AF~=0.5
  2. The total depth(DP) is also quite different from the depth in bam
    I tried changing various parameters but nothing seems to make a difference? and even tried turn on --disable-tool-default-read-filters. Is there a parameter I'm missing? What are the filtering criteria? How to adapt to PCR data?

version: GATK4.1.4.1
java -Xmx16g -Djava.io.tmpdir=./tmp -jar gatk-package-4.1.4.1-local.jar Mutect2 -I tmp.bam -R hs37d5.fa -L MT.bed -O raw.vcf --min-pruning 5 --mitochondria-mode --max-reads-per-alignment-start 10000

MT 16182 . A AC,ACC . . DP=262;ECNT=7;MBQ=31,26,29;MFRL=0,0,0;MMQ=60,60,60;MPOS=44,44;OCM=0;POPAF=2.40,2.40;TLOD=84.81,46.04 GT:AD:AF:DP:F1R2:F2R1:SB 0/1/2:157,72,31:0.261,0.118:260:122,48,21:0,0,0:0,157,0,103
MT 16183 . A C . . DP=262;ECNT=7;MBQ=30,34;MFRL=0,0;MMQ=60,60;MPOS=45;OCM=0;POPAF=2.40;TLOD=531.07 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:58,204:0.780:262:42,190:0,0:0,58,0,204
MT 16188 . CT C . . DP=262;ECNT=7;MBQ=34,29;MFRL=0,0;MMQ=60,60;MPOS=50;OCM=0;POPAF=2.40;TLOD=19.25 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:243,19:0.070:262:207,15:0,0:0,243,0,19
MT 16189 . T C . . DP=262;ECNT=7;MBQ=35,34;MFRL=0,0;MMQ=60,60;MPOS=51;OCM=0;POPAF=2.40;TLOD=931.43 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:2,260:0.989:262:2,237:0,0:0,2,0,260
MT 16266 . C A . . DP=1073;ECNT=4;MBQ=7,32;MFRL=0,0;MMQ=60,60;MPOS=50;OCM=0;POPAF=2.40;TLOD=3575.47 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:14,1039:0.996:1053:4,469:0,446:0,14,481,558
MT 16274 . G A . . DP=1073;ECNT=4;MBQ=33,12;MFRL=0,0;MMQ=60,60;MPOS=53;OCM=0;POPAF=2.40;TLOD=1.89 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1057,11:5.214e-03:1068:571,1:337,4:467,590,10,1
MT 16320 . C T . . DP=643;ECNT=4;MBQ=27,36;MFRL=0,0;MMQ=60,60;MPOS=21;OCM=0;POPAF=2.40;TLOD=11.29 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:564,60:0.017:624:48,60:358,0:0|1:16320_C_T:16320:482,82,0,60
MT 16336 . G GC . . DP=643;ECNT=4;MBQ=32,36;MFRL=0,0;MMQ=60,60;MPOS=5;OCM=0;POPAF=2.40;RPA=1,2;RU=C;STR;TLOD=10.99 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:564,60:0.017:624:79,59:392,0:0|1:16320_C_T:16320:482,82,0,60

IGV-MT-16320

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants