Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added QC metrics to the Germline CNV workflow #6017

Merged
merged 2 commits into from
Aug 6, 2019
Merged

Conversation

asmirnov239
Copy link
Collaborator

This adds two new tasks that perform QC checks on gCNV output, namely:

  • Check that number of events in segments VCFs does not exceed preset value
  • Check that not all PCs are used by the model (even if one of the shards fails, the QC status will be negative)

Both tasks output *qc_status.txt file, for each sample and model respectively, and the file will contain string "PASS", or a string describing the fail condition

@ldgauthier
Copy link
Contributor

Can we take this opportunity to add an example inputs json and make sure the parameter defaults match what Jack has been using? My goal is to be able to cut a release and then drop the scripts from the repo right into a featured workspace.

@samuelklee
Copy link
Contributor

@ldgauthier I think at some point we removed example JSONs in both the CNV and M2 WDL directories. I believe the reasoning was that those JSONs were mostly non-informative templates that could just as easily be generated with womtool inputs; since they were also not tested (in contrast to the JSONs used by the Travis WDL tests), they had to be kept in sync manually. @davidbenjamin @LeeTL1220 can correct me if I'm wrong.

In contrast, providing Jack's hyperparameters for WES via JSONs will actually be informative! However, we will inevitably run into some issues touched upon in #4719. I agree that it would be desirable to set some default WES/WGS hyperparameters in the featured workspace. However, I hope this wouldn't require two separate workspaces for WES/WGS or any shenanigans like that. Ideally, this sort of thing could be covered at the tool level with argsets, as mentioned in that issue. @droazen any updates there?

In any case, I'm not sure having the JSON in this repo and not covered by any tests is what we want.
Maybe @bshifaw can chime in? Are the featured workspaces covered by tests elsewhere? What is the current SOP for taking workflows from this repo, turning them into featured workspaces, and populating their configurations?

@droazen
Copy link
Collaborator

droazen commented Jun 26, 2019

@cmnbroad Could you comment on @samuelklee 's question about argsets above? Thanks!

@cmnbroad
Copy link
Collaborator

Sadly, no work or progress has been done on the argset idea at all.

@codecov
Copy link

codecov bot commented Jun 26, 2019

Codecov Report

Merging #6017 into master will increase coverage by 0.285%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #6017       +/-   ##
===============================================
+ Coverage     86.927%   87.212%   +0.285%     
+ Complexity     32765     32721       -44     
===============================================
  Files           2016      2011        -5     
  Lines         151466    150954      -512     
  Branches       16628     16133      -495     
===============================================
- Hits          131665    131650       -15     
+ Misses         13737     13692       -45     
+ Partials        6064      5612      -452
Impacted Files Coverage Δ Complexity Δ
...er/utils/runtime/CapturedStreamOutputSnapshot.java 76.923% <0%> (-1.648%) 4% <0%> (ø)
...t/java/org/broadinstitute/hellbender/MainTest.java 84.746% <0%> (-0.969%) 15% <0%> (ø)
...der/cmdline/CommandLineProgramIntegrationTest.java 90.909% <0%> (-0.758%) 5% <0%> (ø)
...ellbender/tools/walkers/vqsr/CNNScoreVariants.java 79.736% <0%> (-0.709%) 45% <0%> (ø)
...ls/walkers/varianteval/util/EvaluationContext.java 75.676% <0%> (-0.64%) 12% <0%> (ø)
...te/hellbender/utils/runtime/ProcessController.java 56.338% <0%> (-0.606%) 8% <0%> (ø)
...der/utils/solver/SynchronizedUnivariateSolver.java 81.609% <0%> (-0.413%) 11% <0%> (ø)
...ools/walkers/haplotypecaller/graphs/TestGraph.java 89.286% <0%> (-0.369%) 6% <0%> (ø)
...spark/sv/utils/SingleSequenceReferenceAligner.java 80.556% <0%> (-0.266%) 16% <0%> (ø)
...gine/spark/datasources/ReadsSparkSinkUnitTest.java 75.41% <0%> (-0.2%) 23% <0%> (ø)
... and 312 more

@asmirnov239
Copy link
Collaborator Author

@bshifaw related to what Sam was saying - we also have a few standard resources needed to run the workflows that we would like to share with users. What is the standard procedure for doing so? Ideally they would be bundled with featured workspaces, but also accessible from outside of Terra

@asmirnov239
Copy link
Collaborator Author

@ldgauthier @mwalker174 Could you please review?

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just a couple of questions.

Array[File] genotyped_segments_vcf
Array[String] entity_ids

Int? maximum_number_events = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the general use, I think this should be a required argument with no default value, as it will depend heavily on experimental design.

@ldgauthier For production, is it typical to provide PASS/FAIL rather reporting the raw metric and letting users interpret it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.

This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I moved the maximum_number_events argument up to the workflow level input

entity_ids=(${sep=" " entity_ids})
for index in ${dollar}{!genotyped_segments_vcfs_array[@]}; do
NUM_SEGMENTS=$(grep -v '@' ${dollar}{genotyped_segments_vcfs_array[$index]} | wc -l)
if [ $NUM_SEGMENTS -lt ${maximum_number_events} ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we defining "event"? If that includes copy-neutral segments then this script is fine, but when I hear "event" I think of DELs/DUPs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's number of non comment lines in the genotyped segments VCF file, so everything including copy-neutral segments (so there are at least 24 events in each sample)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a point of confusion for users. Can you change maximum_number_of_events to maximum_number_of_segments and clarify how it is defined (as you have here) in the comment at the workflow input?

@@ -242,6 +250,7 @@ workflow CNVGermlineCaseWorkflow {
Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar
Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf
Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf
Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd want to be able to flag failing samples in an obvious way in the workspace, like having new fields in the data model called "sample_quality" and "model_quality" with the QC status reported there. Are we violently opposed to having a Cromwell version and a Firecloud version of this WDL? (@LeeTL1220)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done what Lee recommended, however note that I had to make the task process one sample at a time

Array[File] genotyped_segments_vcf
Array[String] entity_ids

Int? maximum_number_events = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.

This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.

@LeeTL1220
Copy link
Contributor

LeeTL1220 commented Jul 2, 2019 via email

@bshifaw
Copy link

bshifaw commented Jul 2, 2019

Sorry, it's difficult for me to spot git notifications in my email.

Maybe @bshifaw can chime in? Are the featured workspaces covered by tests elsewhere? What is the current SOP for taking workflows from this repo, turning them into featured workspaces, and populating their configurations?

Example JSONs with input test data are usually introduced in the gatk-workflows git repos and carried over to the featured workspaces. That isn't to say they are not welcomed from the gatk repo.

@bshifaw related to what Sam was saying - we also have a few standard resources needed to run the workflows that we would like to share with users. What is the standard procedure for doing so? Ideally they would be bundled with featured workspaces, but also accessible from outside of Terra.

Workflow resources files that are not already in broad-references would be saved in the gatk-best-practices bucket. In the past i've separated the resources files per workflow directory (e.g. pathseq, cnn-hg38) but you can organize them a different way if the resources files would be shared by other workflows (e.g. somatic-hg38, somatic-b37).

@asmirnov239
Copy link
Collaborator Author

Thanks @bshifaw, I see that contig ploidy prior file is already in the best practices bucket!

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@asmirnov239 asmirnov239 merged commit e9dec18 into master Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants