-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added QC metrics to the Germline CNV workflow #6017
Conversation
Can we take this opportunity to add an example inputs json and make sure the parameter defaults match what Jack has been using? My goal is to be able to cut a release and then drop the scripts from the repo right into a featured workspace. |
@ldgauthier I think at some point we removed example JSONs in both the CNV and M2 WDL directories. I believe the reasoning was that those JSONs were mostly non-informative templates that could just as easily be generated with In contrast, providing Jack's hyperparameters for WES via JSONs will actually be informative! However, we will inevitably run into some issues touched upon in #4719. I agree that it would be desirable to set some default WES/WGS hyperparameters in the featured workspace. However, I hope this wouldn't require two separate workspaces for WES/WGS or any shenanigans like that. Ideally, this sort of thing could be covered at the tool level with argsets, as mentioned in that issue. @droazen any updates there? In any case, I'm not sure having the JSON in this repo and not covered by any tests is what we want. |
@cmnbroad Could you comment on @samuelklee 's question about argsets above? Thanks! |
Sadly, no work or progress has been done on the argset idea at all. |
Codecov Report
@@ Coverage Diff @@
## master #6017 +/- ##
===============================================
+ Coverage 86.927% 87.212% +0.285%
+ Complexity 32765 32721 -44
===============================================
Files 2016 2011 -5
Lines 151466 150954 -512
Branches 16628 16133 -495
===============================================
- Hits 131665 131650 -15
+ Misses 13737 13692 -45
+ Partials 6064 5612 -452
|
83212d3
to
e20d178
Compare
@bshifaw related to what Sam was saying - we also have a few standard resources needed to run the workflows that we would like to share with users. What is the standard procedure for doing so? Ideally they would be bundled with featured workspaces, but also accessible from outside of Terra |
@ldgauthier @mwalker174 Could you please review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, just a couple of questions.
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
Array[File] genotyped_segments_vcf | ||
Array[String] entity_ids | ||
|
||
Int? maximum_number_events = 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the general use, I think this should be a required argument with no default value, as it will depend heavily on experimental design.
@ldgauthier For production, is it typical to provide PASS/FAIL rather reporting the raw metric and letting users interpret it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.
This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I moved the maximum_number_events
argument up to the workflow level input
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
entity_ids=(${sep=" " entity_ids}) | ||
for index in ${dollar}{!genotyped_segments_vcfs_array[@]}; do | ||
NUM_SEGMENTS=$(grep -v '@' ${dollar}{genotyped_segments_vcfs_array[$index]} | wc -l) | ||
if [ $NUM_SEGMENTS -lt ${maximum_number_events} ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we defining "event"? If that includes copy-neutral segments then this script is fine, but when I hear "event" I think of DELs/DUPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's number of non comment lines in the genotyped segments VCF file, so everything including copy-neutral segments (so there are at least 24 events in each sample)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be a point of confusion for users. Can you change maximum_number_of_events
to maximum_number_of_segments
and clarify how it is defined (as you have here) in the comment at the workflow input?
@@ -242,6 +250,7 @@ workflow CNVGermlineCaseWorkflow { | |||
Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar | |||
Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf | |||
Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf | |||
Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I'd want to be able to flag failing samples in an obvious way in the workspace, like having new fields in the data model called "sample_quality" and "model_quality" with the QC status reported there. Are we violently opposed to having a Cromwell version and a Firecloud version of this WDL? (@LeeTL1220)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done what Lee recommended, however note that I had to make the task process one sample at a time
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
Array[File] genotyped_segments_vcf | ||
Array[String] entity_ids | ||
|
||
Int? maximum_number_events = 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.
This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.
I do not think you should have two versions. Here is a task example from
another workflow:
```
output {
File cnv_acs_conversion_skew = "${output_skew_filename}"
Float cnv_acs_conversion_skew_float =
read_float(output_skew_filename)
String cnv_acs_conversion_skew_string =
read_string(output_skew_filename)
}
```
While this adds some clutter, at least the task (and workflow) produces a
file, float, or string. Then you can decide which you actually want to
attach to the data model via the method configuration output.
Clutter vs. fork? I say "clutter". Also, you may only need one alternate
type.
…On Tue, Jul 2, 2019 at 9:52 AM ldgauthier ***@***.***> wrote:
***@***.**** requested changes on this pull request.
------------------------------
In scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl
<#6017 (comment)>:
> @@ -242,6 +250,7 @@ workflow CNVGermlineCaseWorkflow {
Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar
Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf
Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf
+ Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_files
Ideally I'd want to be able to flag failing samples in an obvious way in
the workspace, like having new fields in the data model called
"sample_quality" and "model_quality" with the QC status reported there. Are
we violently opposed to having a Cromwell version and a Firecloud version
of this WDL? ***@***.*** <https://github.com/LeeTL1220>)
------------------------------
In scripts/cnv_wdl/cnv_common_tasks.wdl
<#6017 (comment)>:
> @@ -453,3 +453,98 @@ task PostprocessGermlineCNVCalls {
File genotyped_segments_vcf = genotyped_segments_vcf_filename
}
}
+
+task CollectSampleQualityMetrics {
+ Array[File] genotyped_segments_vcf
+ Array[String] entity_ids
+
+ Int? maximum_number_events = 120
That's a fair point about experimental design, especially when it comes to
exome interval lists. For a small panel 60 events might still be
outrageous. I would be more comfortable with a default value if there was a
way to tie the maximum event number to the interval list. Otherwise I guess
we can provide recommendations in a readme somewhere.
This is a little different from production, but we do have some hard
pass/fail cutoffs, though those are things like coverage, contamination,
and percent chimeric reads, which won't vary based on capture.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6017?email_source=notifications&email_token=AAQNPEZKHY7B7GQF4Z2AAN3P5NMSDA5CNFSM4H3NMAQKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5IBB4Q#pullrequestreview-256905458>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQNPE2CLGOOBJ63XIBVVP3P5NMSDANCNFSM4H3NMAQA>
.
--
Lee Lichtenstein
Broad Institute
105 Broadway, Room 332
Cambridge, MA 02142
617 714 8632
|
Sorry, it's difficult for me to spot git notifications in my email.
Example JSONs with input test data are usually introduced in the gatk-workflows git repos and carried over to the featured workspaces. That isn't to say they are not welcomed from the gatk repo.
Workflow resources files that are not already in broad-references would be saved in the gatk-best-practices bucket. In the past i've separated the resources files per workflow directory (e.g. pathseq, cnn-hg38) but you can organize them a different way if the resources files would be shared by other workflows (e.g. somatic-hg38, somatic-b37). |
Thanks @bshifaw, I see that contig ploidy prior file is already in the best practices bucket! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
This adds two new tasks that perform QC checks on gCNV output, namely:
Both tasks output *qc_status.txt file, for each sample and model respectively, and the file will contain string "PASS", or a string describing the fail condition