-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added QC metrics to the Germline CNV workflow #6017
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -453,3 +453,98 @@ task PostprocessGermlineCNVCalls { | |
File genotyped_segments_vcf = genotyped_segments_vcf_filename | ||
} | ||
} | ||
|
||
task CollectSampleQualityMetrics { | ||
Array[File] genotyped_segments_vcf | ||
Array[String] entity_ids | ||
|
||
Int? maximum_number_events = 120 | ||
|
||
# Runtime parameters | ||
String gatk_docker | ||
Int? mem_gb | ||
Int? disk_space_gb | ||
Boolean use_ssd = false | ||
Int? cpu | ||
Int? preemptible_attempts | ||
|
||
Int machine_mem_mb = select_first([mem_gb, 1]) * 1000 | ||
|
||
String dollar = "$" #WDL workaround for using array[@], see https://github.com/broadinstitute/cromwell/issues/1819 | ||
|
||
command <<< | ||
set -e | ||
|
||
genotyped_segments_vcfs_array=(${sep=" " genotyped_segments_vcf}) | ||
entity_ids=(${sep=" " entity_ids}) | ||
for index in ${dollar}{!genotyped_segments_vcfs_array[@]}; do | ||
NUM_SEGMENTS=$(grep -v '@' ${dollar}{genotyped_segments_vcfs_array[$index]} | wc -l) | ||
if [ $NUM_SEGMENTS -lt ${maximum_number_events} ]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are we defining "event"? If that includes copy-neutral segments then this script is fine, but when I hear "event" I think of DELs/DUPs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's number of non comment lines in the genotyped segments VCF file, so everything including copy-neutral segments (so there are at least 24 events in each sample) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this could be a point of confusion for users. Can you change |
||
echo "PASS" >> ./${dollar}{entity_ids[$index]}.qcStatus.txt | ||
else | ||
echo "EXCESSIVE_NUMBER_OF_EVENTS" >> ./${dollar}{entity_ids[$index]}.qcStatus.txt | ||
fi | ||
done | ||
>>> | ||
|
||
runtime { | ||
docker: "${gatk_docker}" | ||
memory: machine_mem_mb + " MB" | ||
disks: "local-disk " + select_first([disk_space_gb, 40]) + if use_ssd then " SSD" else " HDD" | ||
cpu: select_first([cpu, 1]) | ||
preemptible: select_first([preemptible_attempts, 5]) | ||
} | ||
|
||
output { | ||
Array[File] qc_status_files = glob("*.qcStatus.txt") | ||
} | ||
} | ||
|
||
task CollectModelQualityMetrics { | ||
Array[File] gcnv_model_tars | ||
|
||
# Runtime parameters | ||
String gatk_docker | ||
Int? mem_gb | ||
Int? disk_space_gb | ||
Boolean use_ssd = false | ||
Int? cpu | ||
Int? preemptible_attempts | ||
|
||
Int machine_mem_mb = select_first([mem_gb, 1]) * 1000 | ||
|
||
String dollar = "$" #WDL workaround for using array[@], see https://github.com/broadinstitute/cromwell/issues/1819 | ||
|
||
command <<< | ||
sed -e | ||
qc_status="PASS" | ||
|
||
gcnv_model_tar_array=(${sep=" " gcnv_model_tars}) | ||
for index in ${dollar}{!gcnv_model_tar_array[@]}; do | ||
gcnv_model_tar=${dollar}{gcnv_model_tar_array[$index]} | ||
mkdir MODEL_$index | ||
tar xzf $gcnv_model_tar -C MODEL_$index | ||
ard_file=MODEL_$index/mu_ard_u_log__.tsv | ||
|
||
#check whether all values for ARD components are negative | ||
NUM_POSITIVE_VALUES=$(awk '{ if (index($0, "@") == 0) {if ($1 > 0.0) {print $1} }}' MODEL_$index/mu_ard_u_log__.tsv | wc -l) | ||
if [ $NUM_POSITIVE_VALUES -eq 0 ]; then | ||
qc_status="ALL_PRINCIPAL_COMPONENTS_USED" | ||
break | ||
fi | ||
done | ||
echo $qc_status >> qcStatus.txt | ||
>>> | ||
|
||
runtime { | ||
docker: "${gatk_docker}" | ||
memory: machine_mem_mb + " MB" | ||
disks: "local-disk " + select_first([disk_space_gb, 40]) + if use_ssd then " SSD" else " HDD" | ||
cpu: select_first([cpu, 1]) | ||
preemptible: select_first([preemptible_attempts, 5]) | ||
} | ||
|
||
output { | ||
File qc_status = "qcStatus.txt" | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -233,6 +233,14 @@ workflow CNVGermlineCaseWorkflow { | |
} | ||
} | ||
|
||
call CNVTasks.CollectSampleQualityMetrics { | ||
input: | ||
genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf, | ||
entity_ids = CollectCounts.entity_id, | ||
gatk_docker = gatk_docker, | ||
preemptible_attempts = preemptible_attempts | ||
} | ||
|
||
output { | ||
File preprocessed_intervals = PreprocessIntervals.preprocessed_intervals | ||
Array[File] read_counts_entity_id = CollectCounts.entity_id | ||
|
@@ -242,6 +250,7 @@ workflow CNVGermlineCaseWorkflow { | |
Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar | ||
Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf | ||
Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf | ||
Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_files | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally I'd want to be able to flag failing samples in an obvious way in the workspace, like having new fields in the data model called "sample_quality" and "model_quality" with the QC status reported there. Are we violently opposed to having a Cromwell version and a Firecloud version of this WDL? (@LeeTL1220) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've done what Lee recommended, however note that I had to make the task process one sample at a time |
||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the general use, I think this should be a required argument with no default value, as it will depend heavily on experimental design.
@ldgauthier For production, is it typical to provide PASS/FAIL rather reporting the raw metric and letting users interpret it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point about experimental design, especially when it comes to exome interval lists. For a small panel 60 events might still be outrageous. I would be more comfortable with a default value if there was a way to tie the maximum event number to the interval list. Otherwise I guess we can provide recommendations in a readme somewhere.
This is a little different from production, but we do have some hard pass/fail cutoffs, though those are things like coverage, contamination, and percent chimeric reads, which won't vary based on capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I moved the
maximum_number_events
argument up to the workflow level input