M2 pon created with GenomicsDB, skips germline, and reports stats #5675

davidbenjamin · 2019-02-14T02:07:39Z

@takutosato This PR does a few things:

The PoN creation workflow now uses GenomicsDB import in order to scale to more and larger inputs. This gets around problems that @LeeTL1220 encountered while making a REBC PoN from 250 noisy WGS samples.
PoN creation by default ignores samples that have an allele as a germline variant.
The PoN includes a summary beta distribution of allele frequency. This is not yet hooked up to M2 but it might be later and some users have asked for it.

codecov-io · 2019-02-14T02:50:44Z

Codecov Report

Merging #5675 into master will decrease coverage by 6.723%.
The diff coverage is 94.118%.

@@              Coverage Diff               @@
##              master    #5675       +/-   ##
==============================================
- Coverage     87.043%   80.32%   -6.723%     
+ Complexity     31707    30118     -1589     
==============================================
  Files           1940     1940               
  Lines         146172   146285      +113     
  Branches       16130    16137        +7     
==============================================
- Hits          127233   117496     -9737     
- Misses         13051    23101    +10050     
+ Partials        5888     5688      -200

Impacted Files	Coverage Δ	Complexity Δ
.../basicshortmutpileup/BetaBinomialDistribution.java	`65.385% <50%> (-2.797%)`	`5 <1> (+1)`
...itute/hellbender/tools/walkers/SplitIntervals.java	`85.714% <75%> (-3.175%)`	`7 <1> (+1)`
...ls/walkers/mutect/CreateSomaticPanelOfNormals.java	`93.846% <93.651%> (+6.346%)`	`21 <20> (+13)`	⬆️
...ct/CreateSomaticPanelOfNormalsIntegrationTest.java	`97.701% <97.561%> (-2.299%)`	`10 <7> (+7)`
...rs/variantutils/SelectVariantsIntegrationTest.java	`0.255% <0%> (-99.745%)`	`1% <0%> (-70%)`
...kers/filters/VariantFiltrationIntegrationTest.java	`0.826% <0%> (-99.174%)`	`1% <0%> (-25%)`
...dorientation/CollectF1R2CountsIntegrationTest.java	`0.917% <0%> (-99.083%)`	`1% <0%> (-12%)`
.../walkers/bqsr/BaseRecalibratorIntegrationTest.java	`1.031% <0%> (-98.969%)`	`1% <0%> (-7%)`
...ers/vqsr/FilterVariantTranchesIntegrationTest.java	`1.053% <0%> (-98.947%)`	`1% <0%> (-5%)`
...s/variantutils/VariantsToTableIntegrationTest.java	`1.205% <0%> (-98.795%)`	`1% <0%> (-20%)`
... and 168 more

takutosato

This looks great

takutosato · 2019-02-14T16:56:41Z

...ain/java/org/broadinstitute/hellbender/tools/walkers/mutect/CreateSomaticPanelOfNormals.java

+    }
+
+    private static final double germlineProbability(final double alleleFrequency, final int altCount, final int totalCount) {
+        final double hetPrior = alleleFrequency * (1 - alleleFrequency) / 2;


*2 instead of /2 here?

quite right!

…nd reports frequency stats

ldgauthier · 2019-02-14T19:26:54Z

...ain/java/org/broadinstitute/hellbender/tools/walkers/mutect/CreateSomaticPanelOfNormals.java

- * <h4>Step 2. Create a file ending with .args or .list extension with the paths to the VCFs from step 1, one per line.</h4>
- * <p>This approach is optional. Other extensions will error the run. </p>
+ * <h4>Step 2. Create a GenomicsDB from the normal Mutect2 calls.</h4>
+ * Note that GenomicsDBImport is currently (as of February 2019) inefficient when processing multiple intervals.  Therefore,


Depends on what your data looks like. For exomes it should be fine now if you use --merge-input-intervals #5540

@ldgauthier The M2 pon wdl I wrote for this PR scatters over contigs, making a chr1 pon from a chr1 GenomicsDB, a chr2 pon from a chr2 GenomicsDB, and then merging. Are you saying that instead I should make a single GenomicsDB with --merge-input-intervals and create a single pon from that DB? Is this true even if it's a WGS pon and there are lots of variants?

~24 contigs is probably not a big deal. (That arg also won't help WGS because each contig has to get its own interval anyway.) Exomes were the real pathological case because there's a very significant startup cost for each interval and even for scattered exomes we had O(1000) intervals.

ldgauthier · 2019-02-14T19:28:19Z

...ain/java/org/broadinstitute/hellbender/tools/walkers/mutect/CreateSomaticPanelOfNormals.java

+ *       --genomicsdb-workspace-path pon_db \
+ *       -V normal1.vcf.gz \
+ *       -V normal2.vcf.gz \
+ *       -V normal3.vcf.gz


you could also use a sample_map file if you have a lot of samples like Lee

I would hope that anyone doing this on a large scale would be running via the WDL and wouldn't have to worry about an unwieldy command. But it can't hurt to mention, I suppose.

davidbenjamin added the Mutect label Feb 14, 2019

davidbenjamin assigned takutosato Feb 14, 2019

davidbenjamin requested a review from takutosato February 14, 2019 02:07

davidbenjamin mentioned this pull request Feb 14, 2019

Allow CreateSomaticPanelOfNormals to pass --sites-only-vcf-output=false for mapping bias calculations #5649

Closed

takutosato approved these changes Feb 14, 2019

View reviewed changes

Mutect2 pon is created with genomics DB, removes germline variants, a…

98fbe29

…nd reports frequency stats

davidbenjamin force-pushed the db_m2_pon branch from a61ad59 to 98fbe29 Compare February 14, 2019 17:08

ldgauthier reviewed Feb 14, 2019

View reviewed changes

docs

2ed4c95

davidbenjamin merged commit 7083141 into master Feb 15, 2019

davidbenjamin deleted the db_m2_pon branch February 15, 2019 21:05

davidbenjamin mentioned this pull request Feb 25, 2019

Add progress meter to CreateSomaticPanelOfNormals #5629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M2 pon created with GenomicsDB, skips germline, and reports stats #5675

M2 pon created with GenomicsDB, skips germline, and reports stats #5675

davidbenjamin commented Feb 14, 2019

codecov-io commented Feb 14, 2019 •

edited

Loading

takutosato left a comment

takutosato Feb 14, 2019

davidbenjamin Feb 14, 2019

ldgauthier Feb 14, 2019 •

edited

Loading

davidbenjamin Feb 15, 2019

ldgauthier Feb 15, 2019

ldgauthier Feb 14, 2019 •

edited

Loading

davidbenjamin Feb 14, 2019

M2 pon created with GenomicsDB, skips germline, and reports stats #5675

M2 pon created with GenomicsDB, skips germline, and reports stats #5675

Conversation

davidbenjamin commented Feb 14, 2019

codecov-io commented Feb 14, 2019 • edited Loading

Codecov Report

takutosato left a comment

Choose a reason for hiding this comment

takutosato Feb 14, 2019

Choose a reason for hiding this comment

davidbenjamin Feb 14, 2019

Choose a reason for hiding this comment

ldgauthier Feb 14, 2019 • edited Loading

Choose a reason for hiding this comment

davidbenjamin Feb 15, 2019

Choose a reason for hiding this comment

ldgauthier Feb 15, 2019

Choose a reason for hiding this comment

ldgauthier Feb 14, 2019 • edited Loading

Choose a reason for hiding this comment

davidbenjamin Feb 14, 2019

Choose a reason for hiding this comment

codecov-io commented Feb 14, 2019 •

edited

Loading

ldgauthier Feb 14, 2019 •

edited

Loading

ldgauthier Feb 14, 2019 •

edited

Loading