Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added QC metrics to the Germline CNV workflow #6017

Merged
merged 2 commits into from
Aug 6, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
"CNVGermlineCaseScatteredWorkflow.gcnv_max_advi_iter_subsequent_epochs": 1,
"CNVGermlineCaseScatteredWorkflow.gcnv_max_copy_number": 3,
"CNVGermlineCaseScatteredWorkflow.gcnv_max_calling_iters": 1,
"CNVGermlineCaseScatteredWorkflow.maximum_number_events_per_sample": 120,
"CNVGermlineCaseScatteredWorkflow.ref_fasta_dict": "/home/travis/build/broadinstitute/gatk/src/test/resources/large/cnv_germline_workflows_test_files/Homo_sapiens_assembly19.truncated.dict",
"CNVGermlineCaseScatteredWorkflow.ref_fasta_fai": "/home/travis/build/broadinstitute/gatk/src/test/resources/large/cnv_germline_workflows_test_files/Homo_sapiens_assembly19.truncated.fasta.fai",
"CNVGermlineCaseScatteredWorkflow.ref_fasta": "/home/travis/build/broadinstitute/gatk/src/test/resources/large/cnv_germline_workflows_test_files/Homo_sapiens_assembly19.truncated.fasta"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,6 @@
"CNVGermlineCohortWorkflow.maximum_mappability": 1.0,
"CNVGermlineCohortWorkflow.minimum_segmental_duplication_content": 0.0,
"CNVGermlineCohortWorkflow.maximum_segmental_duplication_content": 1.0,
"CNVGermlineCohortWorkflow.low_count_filter_count_threshold": 0
"CNVGermlineCohortWorkflow.low_count_filter_count_threshold": 0,
"CNVGermlineCohortWorkflow.maximum_number_events_per_sample": 120
}
87 changes: 87 additions & 0 deletions scripts/cnv_wdl/cnv_common_tasks.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -453,3 +453,90 @@ task PostprocessGermlineCNVCalls {
File genotyped_segments_vcf = genotyped_segments_vcf_filename
}
}

task CollectSampleQualityMetrics {
File genotyped_segments_vcf
String entity_id
Int maximum_number_events

# Runtime parameters
Int? mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? cpu
Int? preemptible_attempts

Int machine_mem_mb = select_first([mem_gb, 1]) * 1000

String dollar = "$" #WDL workaround for using array[@], see https://github.com/broadinstitute/cromwell/issues/1819

command <<<
set -e
NUM_SEGMENTS=$(gunzip -c ${genotyped_segments_vcf} | grep -v '#' | wc -l)
if [ $NUM_SEGMENTS -lt ${maximum_number_events} ]; then
echo "PASS" >> ${entity_id}.qcStatus.txt
else
echo "EXCESSIVE_NUMBER_OF_EVENTS" >> ${entity_id}.qcStatus.txt
fi
>>>

runtime {
memory: machine_mem_mb + " MB"
disks: "local-disk " + select_first([disk_space_gb, 20]) + if use_ssd then " SSD" else " HDD"
cpu: select_first([cpu, 1])
preemptible: select_first([preemptible_attempts, 5])
}

output {
File qc_status_file = "${entity_id}.qcStatus.txt"
String qc_status_string = read_string("${entity_id}.qcStatus.txt")
}
}

task CollectModelQualityMetrics {
Array[File] gcnv_model_tars

# Runtime parameters
Int? mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? cpu
Int? preemptible_attempts

Int machine_mem_mb = select_first([mem_gb, 1]) * 1000

String dollar = "$" #WDL workaround for using array[@], see https://github.com/broadinstitute/cromwell/issues/1819

command <<<
sed -e
qc_status="PASS"

gcnv_model_tar_array=(${sep=" " gcnv_model_tars})
for index in ${dollar}{!gcnv_model_tar_array[@]}; do
gcnv_model_tar=${dollar}{gcnv_model_tar_array[$index]}
mkdir MODEL_$index
tar xzf $gcnv_model_tar -C MODEL_$index
ard_file=MODEL_$index/mu_ard_u_log__.tsv

#check whether all values for ARD components are negative
NUM_POSITIVE_VALUES=$(awk '{ if (index($0, "@") == 0) {if ($1 > 0.0) {print $1} }}' MODEL_$index/mu_ard_u_log__.tsv | wc -l)
if [ $NUM_POSITIVE_VALUES -eq 0 ]; then
qc_status="ALL_PRINCIPAL_COMPONENTS_USED"
break
fi
done
echo $qc_status >> qcStatus.txt
>>>

runtime {
memory: machine_mem_mb + " MB"
disks: "local-disk " + select_first([disk_space_gb, 40]) + if use_ssd then " SSD" else " HDD"
cpu: select_first([cpu, 1])
preemptible: select_first([preemptible_attempts, 5])
}

output {
File qc_status_file = "qcStatus.txt"
String qc_status_string = read_string("qcStatus.txt")
}
}
2 changes: 2 additions & 0 deletions scripts/cnv_wdl/germline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ The reference used must be the same between PoN and case samples.
- ``CNVGermlineCohortWorkflow.ref_fasta_dict`` -- Path to reference dict file.
- ``CNVGermlineCohortWorkflow.ref_fasta_fai`` -- Path to reference fasta fai file.
- ``CNVGermlineCohortWorkflow.ref_fasta`` -- Path to reference fasta file.
- ``CNVGermlineCohortWorkflow.maximum_number_events_per_sample`` -- Maximum number of events threshold for doing sample QC (recommended for WES is ~100)

In additional, there are optional workflow-level and task-level parameters that may be set by advanced users; for example:

Expand All @@ -49,6 +50,7 @@ The reference, number of intervals per scatter, and bins (if specified) must be
- ``CNVGermlineCaseWorkflow.ref_fasta_dict`` -- Path to reference dict file.
- ``CNVGermlineCaseWorkflow.ref_fasta_fai`` -- Path to reference fasta fai file.
- ``CNVGermlineCaseWorkflow.ref_fasta`` -- Path to reference fasta file.
- ``CNVGermlineCohortWorkflow.maximum_number_events_per_sample`` -- Maximum number of events threshold for doing sample QC (recommended for WES is ~100)

In additional, there are several task-level parameters that may be set by advanced users as above.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,11 @@ workflow CNVGermlineCaseScatteredWorkflow {
Int ref_copy_number_autosomal_contigs
Array[String]? allosomal_contigs

##########################
#### arguments for QC ####
##########################
Int maximum_number_events_per_sample

call SplitInputArray as SplitInputBamsList {
input:
input_array = normal_bams,
Expand Down Expand Up @@ -178,7 +183,8 @@ workflow CNVGermlineCaseScatteredWorkflow {
gcnv_caller_external_admixing_rate = gcnv_caller_external_admixing_rate,
gcnv_disable_annealing = gcnv_disable_annealing,
ref_copy_number_autosomal_contigs = ref_copy_number_autosomal_contigs,
allosomal_contigs = allosomal_contigs
allosomal_contigs = allosomal_contigs,
maximum_number_events_per_sample = maximum_number_events_per_sample
}
}

Expand All @@ -191,6 +197,8 @@ workflow CNVGermlineCaseScatteredWorkflow {
Array[Array[File]] gcnv_tracking_tars = CNVGermlineCaseWorkflow.gcnv_tracking_tars
Array[Array[File]] genotyped_intervals_vcf = CNVGermlineCaseWorkflow.genotyped_intervals_vcf
Array[Array[File]] genotyped_segments_vcf = CNVGermlineCaseWorkflow.genotyped_segments_vcf
Array[Array[File]] qc_status_files = CNVGermlineCaseWorkflow.qc_status_files
Array[Array[String]] qc_status_strings = CNVGermlineCaseWorkflow.qc_status_strings
}
}

Expand Down
15 changes: 15 additions & 0 deletions scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,11 @@ workflow CNVGermlineCaseWorkflow {
Int ref_copy_number_autosomal_contigs
Array[String]? allosomal_contigs

##########################
#### arguments for QC ####
##########################
Int maximum_number_events_per_sample

Array[Pair[String, String]] normal_bams_and_bais = zip(normal_bams, normal_bais)

call CNVTasks.PreprocessIntervals {
Expand Down Expand Up @@ -231,6 +236,14 @@ workflow CNVGermlineCaseWorkflow {
gatk_docker = gatk_docker,
preemptible_attempts = preemptible_attempts
}

call CNVTasks.CollectSampleQualityMetrics {
input:
genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf,
entity_id = CollectCounts.entity_id[sample_index],
maximum_number_events = maximum_number_events_per_sample,
preemptible_attempts = preemptible_attempts
}
}

output {
Expand All @@ -242,6 +255,8 @@ workflow CNVGermlineCaseWorkflow {
Array[File] gcnv_tracking_tars = GermlineCNVCallerCaseMode.gcnv_tracking_tar
Array[File] genotyped_intervals_vcf = PostprocessGermlineCNVCalls.genotyped_intervals_vcf
Array[File] genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf
Array[File] qc_status_files = CollectSampleQualityMetrics.qc_status_file
Array[String] qc_status_strings = CollectSampleQualityMetrics.qc_status_string
}
}

Expand Down
22 changes: 22 additions & 0 deletions scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,11 @@ workflow CNVGermlineCohortWorkflow {
Int ref_copy_number_autosomal_contigs
Array[String]? allosomal_contigs

##########################
#### arguments for QC ####
##########################
Int maximum_number_events_per_sample

Array[Pair[String, String]] normal_bams_and_bais = zip(normal_bams, normal_bais)

call CNVTasks.PreprocessIntervals {
Expand Down Expand Up @@ -324,6 +329,20 @@ workflow CNVGermlineCohortWorkflow {
gatk_docker = gatk_docker,
preemptible_attempts = preemptible_attempts
}

call CNVTasks.CollectSampleQualityMetrics {
input:
genotyped_segments_vcf = PostprocessGermlineCNVCalls.genotyped_segments_vcf,
entity_id = CollectCounts.entity_id[sample_index],
maximum_number_events = maximum_number_events_per_sample,
preemptible_attempts = preemptible_attempts
}
}

call CNVTasks.CollectModelQualityMetrics {
input:
gcnv_model_tars = GermlineCNVCallerCohortMode.gcnv_model_tar,
preemptible_attempts = preemptible_attempts
}

output {
Expand All @@ -339,6 +358,9 @@ workflow CNVGermlineCohortWorkflow {
Array[File] gcnv_tracking_tars = GermlineCNVCallerCohortMode.gcnv_tracking_tar
Array[File] genotyped_intervals_vcfs = PostprocessGermlineCNVCalls.genotyped_intervals_vcf
Array[File] genotyped_segments_vcfs = PostprocessGermlineCNVCalls.genotyped_segments_vcf
Array[File] sample_qc_status_files = CollectSampleQualityMetrics.qc_status_file
File model_qc_status_file = CollectModelQualityMetrics.qc_status_file
String model_qc_string = CollectModelQualityMetrics.qc_status_string
}
}

Expand Down