Added QC metrics to the Germline CNV workflow #6017

asmirnov239 · 2019-06-26T00:34:17Z

This adds two new tasks that perform QC checks on gCNV output, namely:

Check that number of events in segments VCFs does not exceed preset value
Check that not all PCs are used by the model (even if one of the shards fails, the QC status will be negative)

Both tasks output *qc_status.txt file, for each sample and model respectively, and the file will contain string "PASS", or a string describing the fail condition

ldgauthier · 2019-06-26T16:22:38Z

Can we take this opportunity to add an example inputs json and make sure the parameter defaults match what Jack has been using? My goal is to be able to cut a release and then drop the scripts from the repo right into a featured workspace.

samuelklee · 2019-06-26T17:31:33Z

@ldgauthier I think at some point we removed example JSONs in both the CNV and M2 WDL directories. I believe the reasoning was that those JSONs were mostly non-informative templates that could just as easily be generated with womtool inputs; since they were also not tested (in contrast to the JSONs used by the Travis WDL tests), they had to be kept in sync manually. @davidbenjamin @LeeTL1220 can correct me if I'm wrong.

In contrast, providing Jack's hyperparameters for WES via JSONs will actually be informative! However, we will inevitably run into some issues touched upon in #4719. I agree that it would be desirable to set some default WES/WGS hyperparameters in the featured workspace. However, I hope this wouldn't require two separate workspaces for WES/WGS or any shenanigans like that. Ideally, this sort of thing could be covered at the tool level with argsets, as mentioned in that issue. @droazen any updates there?

In any case, I'm not sure having the JSON in this repo and not covered by any tests is what we want.
Maybe @bshifaw can chime in? Are the featured workspaces covered by tests elsewhere? What is the current SOP for taking workflows from this repo, turning them into featured workspaces, and populating their configurations?

droazen · 2019-06-26T18:45:41Z

@cmnbroad Could you comment on @samuelklee 's question about argsets above? Thanks!

cmnbroad · 2019-06-26T18:51:48Z

Sadly, no work or progress has been done on the argset idea at all.

codecov · 2019-06-26T19:53:54Z

Codecov Report

Merging #6017 into master will increase coverage by 0.285%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #6017       +/-   ##
===============================================
+ Coverage     86.927%   87.212%   +0.285%     
+ Complexity     32765     32721       -44     
===============================================
  Files           2016      2011        -5     
  Lines         151466    150954      -512     
  Branches       16628     16133      -495     
===============================================
- Hits          131665    131650       -15     
+ Misses         13737     13692       -45     
+ Partials        6064      5612      -452

Impacted Files	Coverage Δ	Complexity Δ
...er/utils/runtime/CapturedStreamOutputSnapshot.java	`76.923% <0%> (-1.648%)`	`4% <0%> (ø)`
...t/java/org/broadinstitute/hellbender/MainTest.java	`84.746% <0%> (-0.969%)`	`15% <0%> (ø)`
...der/cmdline/CommandLineProgramIntegrationTest.java	`90.909% <0%> (-0.758%)`	`5% <0%> (ø)`
...ellbender/tools/walkers/vqsr/CNNScoreVariants.java	`79.736% <0%> (-0.709%)`	`45% <0%> (ø)`
...ls/walkers/varianteval/util/EvaluationContext.java	`75.676% <0%> (-0.64%)`	`12% <0%> (ø)`
...te/hellbender/utils/runtime/ProcessController.java	`56.338% <0%> (-0.606%)`	`8% <0%> (ø)`
...der/utils/solver/SynchronizedUnivariateSolver.java	`81.609% <0%> (-0.413%)`	`11% <0%> (ø)`
...ools/walkers/haplotypecaller/graphs/TestGraph.java	`89.286% <0%> (-0.369%)`	`6% <0%> (ø)`
...spark/sv/utils/SingleSequenceReferenceAligner.java	`80.556% <0%> (-0.266%)`	`16% <0%> (ø)`
...gine/spark/datasources/ReadsSparkSinkUnitTest.java	`75.41% <0%> (-0.2%)`	`23% <0%> (ø)`
... and 312 more

asmirnov239 · 2019-06-27T20:44:58Z

@bshifaw related to what Sam was saying - we also have a few standard resources needed to run the workflows that we would like to share with users. What is the standard procedure for doing so? Ideally they would be bundled with featured workspaces, but also accessible from outside of Terra

asmirnov239 · 2019-06-27T20:45:42Z

@ldgauthier @mwalker174 Could you please review?

mwalker174

Mostly looks good, just a couple of questions.

mwalker174 · 2019-06-28T21:12:17Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+    Array[File] genotyped_segments_vcf
+    Array[String] entity_ids
+
+    Int? maximum_number_events = 120


For the general use, I think this should be a required argument with no default value, as it will depend heavily on experimental design.

@ldgauthier For production, is it typical to provide PASS/FAIL rather reporting the raw metric and letting users interpret it?

That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.

This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.

Okay I moved the maximum_number_events argument up to the workflow level input

mwalker174 · 2019-06-28T21:18:35Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+        entity_ids=(${sep=" " entity_ids})
+        for index in ${dollar}{!genotyped_segments_vcfs_array[@]}; do
+            NUM_SEGMENTS=$(grep -v '@' ${dollar}{genotyped_segments_vcfs_array[$index]} | wc -l)
+            if [ $NUM_SEGMENTS -lt ${maximum_number_events} ]; then


How are we defining "event"? If that includes copy-neutral segments then this script is fine, but when I hear "event" I think of DELs/DUPs.

It's number of non comment lines in the genotyped segments VCF file, so everything including copy-neutral segments (so there are at least 24 events in each sample)

I think this could be a point of confusion for users. Can you change maximum_number_of_events to maximum_number_of_segments and clarify how it is defined (as you have here) in the comment at the workflow input?

ldgauthier · 2019-07-02T13:49:01Z

scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl

@@ -242,6 +250,7 @@ workflow CNVGermlineCaseWorkflow {
        Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar
        Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf
        Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf
+        Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_files


Ideally I'd want to be able to flag failing samples in an obvious way in the workspace, like having new fields in the data model called "sample_quality" and "model_quality" with the QC status reported there. Are we violently opposed to having a Cromwell version and a Firecloud version of this WDL? (@LeeTL1220)

I've done what Lee recommended, however note that I had to make the task process one sample at a time

ldgauthier · 2019-07-02T13:51:48Z

scripts/cnv_wdl/cnv_common_tasks.wdl

+    Array[File] genotyped_segments_vcf
+    Array[String] entity_ids
+
+    Int? maximum_number_events = 120


That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.

This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.

LeeTL1220 · 2019-07-02T14:13:46Z

I do not think you should have two versions. Here is a task example from another workflow: ``` output { File cnv_acs_conversion_skew = "${output_skew_filename}" Float cnv_acs_conversion_skew_float = read_float(output_skew_filename) String cnv_acs_conversion_skew_string = read_string(output_skew_filename) } ``` While this adds some clutter, at least the task (and workflow) produces a file, float, or string. Then you can decide which you actually want to attach to the data model via the method configuration output. Clutter vs. fork? I say "clutter". Also, you may only need one alternate type.

…

On Tue, Jul 2, 2019 at 9:52 AM ldgauthier ***@***.***> wrote: ***@***.**** requested changes on this pull request. ------------------------------ In scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl <#6017 (comment)>: > @@ -242,6 +250,7 @@ workflow CNVGermlineCaseWorkflow { Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf + Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_files Ideally I'd want to be able to flag failing samples in an obvious way in the workspace, like having new fields in the data model called "sample_quality" and "model_quality" with the QC status reported there. Are we violently opposed to having a Cromwell version and a Firecloud version of this WDL? ***@***.*** <https://github.com/LeeTL1220>) ------------------------------ In scripts/cnv_wdl/cnv_common_tasks.wdl <#6017 (comment)>: > @@ -453,3 +453,98 @@ task PostprocessGermlineCNVCalls { File genotyped_segments_vcf = genotyped_segments_vcf_filename } } + +task CollectSampleQualityMetrics { + Array[File] genotyped_segments_vcf + Array[String] entity_ids + + Int? maximum_number_events = 120 That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere. This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6017?email_source=notifications&email_token=AAQNPEZKHY7B7GQF4Z2AAN3P5NMSDA5CNFSM4H3NMAQKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5IBB4Q#pullrequestreview-256905458>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAQNPE2CLGOOBJ63XIBVVP3P5NMSDANCNFSM4H3NMAQA> .

-- Lee Lichtenstein Broad Institute 105 Broadway, Room 332 Cambridge, MA 02142 617 714 8632

bshifaw · 2019-07-02T14:31:08Z

Sorry, it's difficult for me to spot git notifications in my email.

Maybe @bshifaw can chime in? Are the featured workspaces covered by tests elsewhere? What is the current SOP for taking workflows from this repo, turning them into featured workspaces, and populating their configurations?

Example JSONs with input test data are usually introduced in the gatk-workflows git repos and carried over to the featured workspaces. That isn't to say they are not welcomed from the gatk repo.

@bshifaw related to what Sam was saying - we also have a few standard resources needed to run the workflows that we would like to share with users. What is the standard procedure for doing so? Ideally they would be bundled with featured workspaces, but also accessible from outside of Terra.

Workflow resources files that are not already in broad-references would be saved in the gatk-best-practices bucket. In the past i've separated the resources files per workflow directory (e.g. pathseq, cnn-hg38) but you can organize them a different way if the resources files would be shared by other workflows (e.g. somatic-hg38, somatic-b37).

asmirnov239 · 2019-07-09T21:31:50Z

Thanks @bshifaw, I see that contig ploidy prior file is already in the best practices bucket!

ldgauthier

LGTM 👍

Added QC metrics to the Germline CNV workflow

e20d178

asmirnov239 force-pushed the as_add_gcnv_qc branch from 83212d3 to e20d178 Compare June 27, 2019 18:52

asmirnov239 requested review from mwalker174 and ldgauthier June 27, 2019 20:45

mwalker174 reviewed Jun 28, 2019

View reviewed changes

ldgauthier requested changes Jul 2, 2019

View reviewed changes

Addressed PR comments

f325990

ldgauthier approved these changes Jul 12, 2019

View reviewed changes

asmirnov239 merged commit e9dec18 into master Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added QC metrics to the Germline CNV workflow #6017

Added QC metrics to the Germline CNV workflow #6017

asmirnov239 commented Jun 26, 2019

ldgauthier commented Jun 26, 2019

samuelklee commented Jun 26, 2019

droazen commented Jun 26, 2019

cmnbroad commented Jun 26, 2019

codecov bot commented Jun 26, 2019 •

edited

Loading

asmirnov239 commented Jun 27, 2019

asmirnov239 commented Jun 27, 2019

mwalker174 left a comment

mwalker174 Jun 28, 2019

ldgauthier Jul 2, 2019

asmirnov239 Jul 9, 2019

mwalker174 Jun 28, 2019

asmirnov239 Jul 9, 2019

mwalker174 Jul 12, 2019

ldgauthier Jul 2, 2019

asmirnov239 Jul 9, 2019

ldgauthier Jul 2, 2019

LeeTL1220 commented Jul 2, 2019 via email

bshifaw commented Jul 2, 2019

asmirnov239 commented Jul 9, 2019

ldgauthier left a comment

Added QC metrics to the Germline CNV workflow #6017

Added QC metrics to the Germline CNV workflow #6017

Conversation

asmirnov239 commented Jun 26, 2019

ldgauthier commented Jun 26, 2019

samuelklee commented Jun 26, 2019

droazen commented Jun 26, 2019

cmnbroad commented Jun 26, 2019

codecov bot commented Jun 26, 2019 • edited Loading

Codecov Report

asmirnov239 commented Jun 27, 2019

asmirnov239 commented Jun 27, 2019

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LeeTL1220 commented Jul 2, 2019 via email

bshifaw commented Jul 2, 2019

asmirnov239 commented Jul 9, 2019

ldgauthier left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 26, 2019 •

edited

Loading