Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added QC metrics to the Germline CNV workflow #6017

Merged
merged 2 commits into from
Aug 6, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions scripts/cnv_wdl/cnv_common_tasks.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -453,3 +453,98 @@ task PostprocessGermlineCNVCalls {
File genotyped_segments_vcf = genotyped_segments_vcf_filename
}
}

task CollectSampleQualityMetrics {
Array[File] genotyped_segments_vcf
Array[String] entity_ids

Int? maximum_number_events = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the general use, I think this should be a required argument with no default value, as it will depend heavily on experimental design.

@ldgauthier For production, is it typical to provide PASS/FAIL rather reporting the raw metric and letting users interpret it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.

This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I moved the maximum_number_events argument up to the workflow level input


# Runtime parameters
String gatk_docker
Int? mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? cpu
Int? preemptible_attempts

Int machine_mem_mb = select_first([mem_gb, 1]) * 1000

String dollar = "$" #WDL workaround for using array[@], see https://github.com/broadinstitute/cromwell/issues/1819

command <<<
set -e

genotyped_segments_vcfs_array=(${sep=" " genotyped_segments_vcf})
entity_ids=(${sep=" " entity_ids})
for index in ${dollar}{!genotyped_segments_vcfs_array[@]}; do
NUM_SEGMENTS=$(grep -v '@' ${dollar}{genotyped_segments_vcfs_array[$index]} | wc -l)
if [ $NUM_SEGMENTS -lt ${maximum_number_events} ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we defining "event"? If that includes copy-neutral segments then this script is fine, but when I hear "event" I think of DELs/DUPs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's number of non comment lines in the genotyped segments VCF file, so everything including copy-neutral segments (so there are at least 24 events in each sample)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a point of confusion for users. Can you change maximum_number_of_events to maximum_number_of_segments and clarify how it is defined (as you have here) in the comment at the workflow input?

echo "PASS" >> ./${dollar}{entity_ids[$index]}.qcStatus.txt
else
echo "EXCESSIVE_NUMBER_OF_EVENTS" >> ./${dollar}{entity_ids[$index]}.qcStatus.txt
fi
done
>>>

runtime {
docker: "${gatk_docker}"
memory: machine_mem_mb + " MB"
disks: "local-disk " + select_first([disk_space_gb, 40]) + if use_ssd then " SSD" else " HDD"
cpu: select_first([cpu, 1])
preemptible: select_first([preemptible_attempts, 5])
}

output {
Array[File] qc_status_files = glob("*.qcStatus.txt")
}
}

task CollectModelQualityMetrics {
Array[File] gcnv_model_tars

# Runtime parameters
String gatk_docker
Int? mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? cpu
Int? preemptible_attempts

Int machine_mem_mb = select_first([mem_gb, 1]) * 1000

String dollar = "$" #WDL workaround for using array[@], see https://github.com/broadinstitute/cromwell/issues/1819

command <<<
sed -e
qc_status="PASS"

gcnv_model_tar_array=(${sep=" " gcnv_model_tars})
for index in ${dollar}{!gcnv_model_tar_array[@]}; do
gcnv_model_tar=${dollar}{gcnv_model_tar_array[$index]}
mkdir MODEL_$index
tar xzf $gcnv_model_tar -C MODEL_$index
ard_file=MODEL_$index/mu_ard_u_log__.tsv

#check whether all values for ARD components are negative
NUM_POSITIVE_VALUES=$(awk '{ if (index($0, "@") == 0) {if ($1 > 0.0) {print $1} }}' MODEL_$index/mu_ard_u_log__.tsv | wc -l)
if [ $NUM_POSITIVE_VALUES -eq 0 ]; then
qc_status="ALL_PRINCIPAL_COMPONENTS_USED"
break
fi
done
echo $qc_status >> qcStatus.txt
>>>

runtime {
docker: "${gatk_docker}"
memory: machine_mem_mb + " MB"
disks: "local-disk " + select_first([disk_space_gb, 40]) + if use_ssd then " SSD" else " HDD"
cpu: select_first([cpu, 1])
preemptible: select_first([preemptible_attempts, 5])
}

output {
File qc_status = "qcStatus.txt"
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ workflow CNVGermlineCaseScatteredWorkflow {
Array[Array[File]] gcnv_tracking_tars = CNVGermlineCaseWorkflow.gcnv_tracking_tars
Array[Array[File]] genotyped_intervals_vcf = CNVGermlineCaseWorkflow.genotyped_intervals_vcf
Array[Array[File]] genotyped_segments_vcf = CNVGermlineCaseWorkflow.genotyped_segments_vcf
Array[Array[File]] qc_status_files = CNVGermlineCaseWorkflow.qc_status_files
}
}

Expand Down
9 changes: 9 additions & 0 deletions scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,14 @@ workflow CNVGermlineCaseWorkflow {
}
}

call CNVTasks.CollectSampleQualityMetrics {
input:
genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf,
entity_ids = CollectCounts.entity_id,
gatk_docker = gatk_docker,
preemptible_attempts = preemptible_attempts
}

output {
File preprocessed_intervals = PreprocessIntervals.preprocessed_intervals
Array[File] read_counts_entity_id = CollectCounts.entity_id
Expand All @@ -242,6 +250,7 @@ workflow CNVGermlineCaseWorkflow {
Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar
Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf
Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf
Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd want to be able to flag failing samples in an obvious way in the workspace, like having new fields in the data model called "sample_quality" and "model_quality" with the QC status reported there. Are we violently opposed to having a Cromwell version and a Firecloud version of this WDL? (@LeeTL1220)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done what Lee recommended, however note that I had to make the task process one sample at a time

}
}

Expand Down
17 changes: 17 additions & 0 deletions scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,21 @@ workflow CNVGermlineCohortWorkflow {
}
}

call CNVTasks.CollectSampleQualityMetrics {
input:
genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf,
entity_ids = CollectCounts.entity_id,
gatk_docker = gatk_docker,
preemptible_attempts = preemptible_attempts
}

call CNVTasks.CollectModelQualityMetrics {
input:
gcnv_model_tars = GermlineCNVCallerCohortMode.gcnv_model_tar,
gatk_docker = gatk_docker,
preemptible_attempts = preemptible_attempts
}

output {
File preprocessed_intervals = PreprocessIntervals.preprocessed_intervals
Array[File] read_counts_entity_ids = CollectCounts.entity_id
Expand All @@ -339,6 +354,8 @@ workflow CNVGermlineCohortWorkflow {
Array[File] gcnv_tracking_tars = GermlineCNVCallerCohortMode.gcnv_tracking_tar
Array[File] genotyped_intervals_vcfs = PostprocessGermlineCNVCalls.genotyped_intervals_vcf
Array[File] genotyped_segments_vcfs = PostprocessGermlineCNVCalls.genotyped_segments_vcf
Array[File] sample_qc_status_files = CollectSampleQualityMetrics.qc_status_files
File model_qc_status = CollectModelQualityMetrics.qc_status
}
}

Expand Down