Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GVCF "reblocking" code to compress homRef blocks #4940

Merged
merged 1 commit into from
Oct 3, 2018
Merged

Conversation

ldgauthier
Copy link
Contributor

This is a port of the GATK3 version I actually ran. Some differences in the genotyping engines and the fixed median calculation in MathUtils between GATK3 and GATK4 make the results slightly different in some cases, but still within tolerances.

@lbergelson lbergelson self-requested a review June 25, 2018 17:21
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier I did a pass on this. Sorry for the deluge of stylistic comments. They're mostly very simple changes. Thanks for porting this to gakt4!

import java.util.*;

/**
* Perform joint genotyping on gVCF files produced by HaplotypeCaller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copied comment isn't right. Needs up to date javadoc.

@DocumentedFeature
public class ReblockGVCF extends VariantWalker {

private static String GVCF_BLOCK = "GVCFBlock";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expose this as a constant in GVCFWriter and then use that constant here, and in GenotypeGVCFs as well as in rangeToVCFHeaderLine in GVCFWriter

public class ReblockGVCF extends VariantWalker {

private static String GVCF_BLOCK = "GVCFBlock";
private static int PLOIDY_TWO = 2; //FIXME: hack to avoid extra arg collections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it final in any case

programGroup = OtherProgramGroup.class,
omitFromCommandLine = true)
@DocumentedFeature
public class ReblockGVCF extends VariantWalker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be final

private static String GVCF_BLOCK = "GVCFBlock";
private static int PLOIDY_TWO = 2; //FIXME: hack to avoid extra arg collections

private VariantContextWriter vcfWriter;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the mutable fields after the argument definitions


final ArgumentsBuilder args = new ArgumentsBuilder();
args.addReference(new File(b37_reference_20_21))
.addArgument("V", getToolTestDataDir() + "NA12878.prod.chr20snippet.g.vcf")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The large g.vcf and expected result should be put into src/test/resources/large since they're pretty big. It would be good to bgzip them too.

import java.util.Collections;

/**
* Created by gauthier on 10/2/17.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we typically don't include these since git captures it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally found where to turn that off (Preferences > Editor > File and Code Templates > Includes).

return null;
//there's a use case here related to ReblockGVCFs for overlapping deletions on different haplotypes
if ( vc.getStart() <= nextAvailableStart && vc.getContig().equals(contigOfNextAvailableStart) ) {
if (vc.getEnd() <= nextAvailableStart)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a unit case that triggers the first condition but not the second one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I have new tests in GVCFWriterUnitTests.

// Remove GCVFBlocks
headerLines.removeIf(vcfHeaderLine -> vcfHeaderLine.getKey().startsWith(GVCF_BLOCK));

headerLines.remove(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.DOWNSAMPLED_KEY));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if someone adds these in annotationsToKeep?

* @param originalVC the combined genomic VC
* @return a new VariantContext or null if the site turned monomorphic and we don't want such sites
*/
protected VariantContext regenotypeVC(final ReferenceContext ref, final VariantContext originalVC) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method named / documented correctly? It seems more like reblocking them with some minimal stripping out of alleles rather than regenotyping.

@ldgauthier
Copy link
Contributor Author

@lbergelson I addressed all your comments. I had some difficulty reconciling the list of Strings for annotation names to remove with the List for the getDefaultVariantAnnotations update, so there may be a few tests failing because the RAW_MQ header disappeared. I'll deal with those when Travis is done, but could you take a look at the other changes?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier A few more comments minor cleanup comments and then I think it's good to go. It is failing a bunch of tests due to the change to the RAW_MQ tag. See integration tests and Unit test reports.

It looks like there's 1 place where you meant to add a test and forgot.

//obvious cases
if (vc == null || vc.getAlternateAlleles().isEmpty()) {
return false;
} else if (vc.isBiallelic()) {
return !(isSpanningDeletion(vc.getAlternateAllele(0)) || vc.isSymbolic());
return !(GATKVCFConstants.isSpanningDeletion(vc.getAlternateAllele(0)) || vc.isSymbolic());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this cleanup.

import java.util.*;

/**
* Perform joint genotyping on gVCF files produced by HaplotypeCaller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is still out of place?

* <pre>
* gatk ReblockGVCF \
* -R reference.fasta \
* --variant sample1.g.vcf \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just make this -V to match the others.

doc="File to which variants should be written")
private File outputFile;

@Argument(fullName=GenotypeGVCFs.ALL_SITES_LONG_NAME, shortName=GenotypeGVCFs.ALL_SITES_SHORT_NAME, doc="Include loci found to be non-variant after genotyping")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little weird to make a boolean argument default to true. Should the name be inverted so that you specify it to disable it?

public GenotypeCalculationArgumentCollection genotypeArgs = new GenotypeCalculationArgumentCollection();

@Advanced
@Argument(fullName="GVCF-GQ-band", shortName="GQB", doc="Exclusive upper bounds for reference confidence GQ bands " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use the new GQ_BAND_LONG_NAME constants

Assert.assertTrue(!newG.hasAD());
}

//TODO: these are duplicated from PosteriorProbabilitiesUtilsUnitTest but PR #4947 modifies VariantContextTestUtils, so I'll do some refactoring before the second of the two is merged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remember to resolve this when #4947 is completed.

}

@Test
public void testLowQualVariantToGQ0HomRef() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgotten test?

import java.util.Map;

public class ReblockGVCFUnitTest {
private final ReblockGVCF reblocker = new ReblockGVCF();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just create this in each test that uses it, otherwise we potentially get weird stateful issues across multiple tests.

import java.util.List;
import java.util.Map;

public class ReblockGVCFUnitTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests. Complicated functions like these have a tendency to lose bits of logic later on if there aren't direct tests for them.

final Allele alt1 = Allele.create("T", false);
final Allele ref2 = Allele.create("TACACACACACTACTA", true);
final Allele ref3 = Allele.create("C", true);
final VariantContext deletion1 = new VariantContextBuilder(null, "1", 10000, 10020, Arrays.asList(ref1, alt1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting these things up is always such a pain, thank you for writing it

@lbergelson lbergelson assigned ldgauthier and unassigned lbergelson Jul 6, 2018
@codecov-io
Copy link

codecov-io commented Jul 9, 2018

Codecov Report

Merging #4940 into master will decrease coverage by 0.098%.
The diff coverage is 87.733%.

@@               Coverage Diff               @@
##              master     #4940       +/-   ##
===============================================
- Coverage     86.753%   86.655%   -0.098%     
+ Complexity     29767     29110      -657     
===============================================
  Files           1825      1811       -14     
  Lines         137744    135032     -2712     
  Branches       15181     14982      -199     
===============================================
- Hits          119497    117012     -2485     
+ Misses         12729     12576      -153     
+ Partials        5518      5444       -74
Impacted Files Coverage Δ Complexity Δ
...otypecaller/HaplotypeCallerArgumentCollection.java 100% <ø> (ø) 2 <0> (ø) ⬇️
...org/broadinstitute/hellbender/utils/MathUtils.java 78.17% <ø> (ø) 209 <0> (ø) ⬇️
...ute/hellbender/utils/variant/GATKVCFConstants.java 80% <ø> (ø) 4 <0> (ø) ⬇️
...der/tools/walkers/annotator/RMSMappingQuality.java 93.814% <100%> (ø) 40 <2> (ø) ⬇️
...e/hellbender/utils/variant/writers/GVCFWriter.java 93.333% <100%> (+0.075%) 34 <1> (+1) ⬆️
.../hellbender/utils/variant/writers/HomRefBlock.java 93.243% <100%> (+0.188%) 36 <4> (+1) ⬆️
...ellbender/tools/walkers/annotator/QualByDepth.java 97.436% <100%> (+0.139%) 18 <2> (+1) ⬆️
...plotypecaller/HaplotypeCallerGenotypingEngine.java 78.014% <100%> (-1.856%) 32 <4> (-5)
...ellbender/tools/walkers/GenotypeGVCFsUnitTest.java 100% <100%> (ø) 21 <1> (ø) ⬇️
...s/walkers/annotator/RMSMappingQualityUnitTest.java 96.859% <100%> (ø) 26 <0> (ø) ⬇️
... and 290 more

@ldgauthier
Copy link
Contributor Author

@lbergelson this is good to go, unless my other PR gets merged first, in which case I'll need to update the expected test data.

@lbergelson
Copy link
Member

That person might have been me...

*
* <h3>Caveats</h3>
* <p>Only single-sample gVCF files produced by HaplotypeCaller can be used as input for this tool.</p>
* <p>By default this tool only passes through annotations used by VQSR. A different set of annotations can be specified with the usual -A argument. Annotation groups are not supported by this tool.</p>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think annotation groups are automatically supported and propagated to the annotation engine via makeVariantAnnotations, unless the tool manually intervenes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- that was probably just a leftover from the GATK3 version.

@lbergelson
Copy link
Member

@ldgauthier I think there's an newly introduce compile warning because of the method names that I deprecated in htsjdk.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me with a quick re-review. There's a compilation warning that's failing things due to that deprecated Indeces function, quick fix to that and I think you're good to merge 👍

@lbergelson lbergelson assigned ldgauthier and unassigned lbergelson Oct 3, 2018
@lbergelson lbergelson merged commit 7368c62 into master Oct 3, 2018
@lbergelson lbergelson deleted the ldg_ReblockGVCF branch October 3, 2018 20:11
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
* adds ReblockGVCF a tool to merge reference blocks in single-sample GVCFs for smaller file size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants