Skip to content

Commit

Permalink
TSPS-326 add wdl to mask or subset a vcf given a bed file (#136)
Browse files Browse the repository at this point in the history
Co-authored-by: Jose Soto <jsoto@broadinstitute.org>
  • Loading branch information
jsotobroad and Jose Soto authored Sep 25, 2024
1 parent 20ff8be commit 7a5039e
Show file tree
Hide file tree
Showing 4 changed files with 89 additions and 1 deletion.
7 changes: 7 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,10 @@ workflows:
authors:
- name: Terra Scientific Services
email: teaspoons-developers@broadinstitute.org

- subclass: WDL
name: SubsetVcfByBedFile
primaryDescriptorPath: /pipelines/imputation/simulatedData/SubsetVcfByBedFile.wdl
authors:
- name: Terra Scientific Services
email: teaspoons-developers@broadinstitute.org
20 changes: 20 additions & 0 deletions pipelines/imputation/scientificValidation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,23 @@ This wdl is basically a wrapper around that tool/image
#### Outputs
* recombined_reference_panel - output vcf after mitigation
algorithm has been run


## SubsetVcfByBedFile
### Purpose
This wdl is meant to be used to subset a vcf down
to sites provided through a bed file. This wdl does
not interact with headers or annotations mostly because
the only really "required" header is the dictionary
and that gets transferred across and the imputation
tool only look at GT and no info/format fields so
we can just leave them be and have it not affect

#### Inputs
* input_vcf - input file to be subset
* input_vcf_index
* bed_file - bed file containing intervals to subset by

#### Outputs
* subset_vcf - subsetted vcf
* subset_vcf_index
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ task ReshapeReferencePanel {

Int disk_size_gb = ceil(3*size(ref_panel_vcf, "GiB")) + 20
Int cpu = 1
Int memory_mb = 8000
Int memory_mb = 6000
}

command {
Expand Down
61 changes: 61 additions & 0 deletions pipelines/imputation/scientificValidation/SubsetVcfByBedFile.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
version 1.0

# This script is under review. It is not actively tested or maintained at this time.
workflow SubsetVcfByBedFile {
input {
File input_vcf
File input_vcf_index
File bed_file
}

call BcftoolsSubsetVcf {
input:
input_vcf = input_vcf,
input_vcf_index = input_vcf_index,
bed_file = bed_file
}

output {
File subset_vcf = BcftoolsSubsetVcf.output_vcf
File subset_vcf_index = BcftoolsSubsetVcf.output_vcf_index
}
}

task BcftoolsSubsetVcf {
input {
File input_vcf
File input_vcf_index
File bed_file

Int disk_size_gb = ceil(3 * size(input_vcf, "GiB")) + 20
Int cpu = 1
Int memory_mb = 6000
}

String basename = basename(input_vcf, '.vcf.gz')

command {
set -e -o pipefail

bcftools view \
-R ~{bed_file} \
-O z \
-o ~{basename}.subset.vcf.gz \
~{input_vcf}

bcftools index -t ~{basename}.subset.vcf.gz

}

output {
File output_vcf = "~{basename}.subset.vcf.gz"
File output_vcf_index = "~{basename}.subset.vcf.gz.tbi"
}

runtime {
docker: "us.gcr.io/broad-gatk/gatk:4.5.0.0"
disks: "local-disk ${disk_size_gb} HDD"
memory: "${memory_mb} MiB"
cpu: cpu
}
}

0 comments on commit 7a5039e

Please sign in to comment.