Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M2 pon created with GenomicsDB, skips germline, and reports stats #5675

Merged
merged 2 commits into from
Feb 15, 2019

Conversation

davidbenjamin
Copy link
Contributor

@takutosato This PR does a few things:

  • The PoN creation workflow now uses GenomicsDB import in order to scale to more and larger inputs. This gets around problems that @LeeTL1220 encountered while making a REBC PoN from 250 noisy WGS samples.
  • PoN creation by default ignores samples that have an allele as a germline variant.
  • The PoN includes a summary beta distribution of allele frequency. This is not yet hooked up to M2 but it might be later and some users have asked for it.

@codecov-io
Copy link

codecov-io commented Feb 14, 2019

Codecov Report

Merging #5675 into master will decrease coverage by 6.723%.
The diff coverage is 94.118%.

@@              Coverage Diff               @@
##              master    #5675       +/-   ##
==============================================
- Coverage     87.043%   80.32%   -6.723%     
+ Complexity     31707    30118     -1589     
==============================================
  Files           1940     1940               
  Lines         146172   146285      +113     
  Branches       16130    16137        +7     
==============================================
- Hits          127233   117496     -9737     
- Misses         13051    23101    +10050     
+ Partials        5888     5688      -200
Impacted Files Coverage Δ Complexity Δ
.../basicshortmutpileup/BetaBinomialDistribution.java 65.385% <50%> (-2.797%) 5 <1> (+1)
...itute/hellbender/tools/walkers/SplitIntervals.java 85.714% <75%> (-3.175%) 7 <1> (+1)
...ls/walkers/mutect/CreateSomaticPanelOfNormals.java 93.846% <93.651%> (+6.346%) 21 <20> (+13) ⬆️
...ct/CreateSomaticPanelOfNormalsIntegrationTest.java 97.701% <97.561%> (-2.299%) 10 <7> (+7)
...rs/variantutils/SelectVariantsIntegrationTest.java 0.255% <0%> (-99.745%) 1% <0%> (-70%)
...kers/filters/VariantFiltrationIntegrationTest.java 0.826% <0%> (-99.174%) 1% <0%> (-25%)
...dorientation/CollectF1R2CountsIntegrationTest.java 0.917% <0%> (-99.083%) 1% <0%> (-12%)
.../walkers/bqsr/BaseRecalibratorIntegrationTest.java 1.031% <0%> (-98.969%) 1% <0%> (-7%)
...ers/vqsr/FilterVariantTranchesIntegrationTest.java 1.053% <0%> (-98.947%) 1% <0%> (-5%)
...s/variantutils/VariantsToTableIntegrationTest.java 1.205% <0%> (-98.795%) 1% <0%> (-20%)
... and 168 more

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great

}

private static final double germlineProbability(final double alleleFrequency, final int altCount, final int totalCount) {
final double hetPrior = alleleFrequency * (1 - alleleFrequency) / 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*2 instead of /2 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite right!

* <h4>Step 2. Create a file ending with .args or .list extension with the paths to the VCFs from step 1, one per line.</h4>
* <p>This approach is optional. Other extensions will error the run. </p>
* <h4>Step 2. Create a GenomicsDB from the normal Mutect2 calls.</h4>
* Note that GenomicsDBImport is currently (as of February 2019) inefficient when processing multiple intervals. Therefore,
Copy link
Contributor

@ldgauthier ldgauthier Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on what your data looks like. For exomes it should be fine now if you use --merge-input-intervals #5540

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier The M2 pon wdl I wrote for this PR scatters over contigs, making a chr1 pon from a chr1 GenomicsDB, a chr2 pon from a chr2 GenomicsDB, and then merging. Are you saying that instead I should make a single GenomicsDB with --merge-input-intervals and create a single pon from that DB? Is this true even if it's a WGS pon and there are lots of variants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~24 contigs is probably not a big deal. (That arg also won't help WGS because each contig has to get its own interval anyway.) Exomes were the real pathological case because there's a very significant startup cost for each interval and even for scattered exomes we had O(1000) intervals.

* --genomicsdb-workspace-path pon_db \
* -V normal1.vcf.gz \
* -V normal2.vcf.gz \
* -V normal3.vcf.gz
Copy link
Contributor

@ldgauthier ldgauthier Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also use a sample_map file if you have a lot of samples like Lee

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that anyone doing this on a large scale would be running via the WDL and wouldn't have to worry about an unwieldy command. But it can't hurt to mention, I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants