-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
M2 pon created with GenomicsDB, skips germline, and reports stats #5675
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5675 +/- ##
==============================================
- Coverage 87.043% 80.32% -6.723%
+ Complexity 31707 30118 -1589
==============================================
Files 1940 1940
Lines 146172 146285 +113
Branches 16130 16137 +7
==============================================
- Hits 127233 117496 -9737
- Misses 13051 23101 +10050
+ Partials 5888 5688 -200
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great
} | ||
|
||
private static final double germlineProbability(final double alleleFrequency, final int altCount, final int totalCount) { | ||
final double hetPrior = alleleFrequency * (1 - alleleFrequency) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*2
instead of /2
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite right!
…nd reports frequency stats
a61ad59
to
98fbe29
Compare
* <h4>Step 2. Create a file ending with .args or .list extension with the paths to the VCFs from step 1, one per line.</h4> | ||
* <p>This approach is optional. Other extensions will error the run. </p> | ||
* <h4>Step 2. Create a GenomicsDB from the normal Mutect2 calls.</h4> | ||
* Note that GenomicsDBImport is currently (as of February 2019) inefficient when processing multiple intervals. Therefore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on what your data looks like. For exomes it should be fine now if you use --merge-input-intervals
#5540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier The M2 pon wdl I wrote for this PR scatters over contigs, making a chr1 pon from a chr1 GenomicsDB, a chr2 pon from a chr2 GenomicsDB, and then merging. Are you saying that instead I should make a single GenomicsDB with --merge-input-intervals
and create a single pon from that DB? Is this true even if it's a WGS pon and there are lots of variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~24 contigs is probably not a big deal. (That arg also won't help WGS because each contig has to get its own interval anyway.) Exomes were the real pathological case because there's a very significant startup cost for each interval and even for scattered exomes we had O(1000) intervals.
* --genomicsdb-workspace-path pon_db \ | ||
* -V normal1.vcf.gz \ | ||
* -V normal2.vcf.gz \ | ||
* -V normal3.vcf.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also use a sample_map file if you have a lot of samples like Lee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope that anyone doing this on a large scale would be running via the WDL and wouldn't have to worry about an unwieldy command. But it can't hurt to mention, I suppose.
@takutosato This PR does a few things:
GenomicsDB
import in order to scale to more and larger inputs. This gets around problems that @LeeTL1220 encountered while making a REBC PoN from 250 noisy WGS samples.