Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding scripts to liftover gnomAD. Also bugfixes for Funcotator NIO. #5514

Merged
merged 5 commits into from
Dec 17, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
567 changes: 567 additions & 0 deletions scripts/funcotator/data_sources/createLiftoverForB37ToHg38.sh

Large diffs are not rendered by default.

56,506 changes: 56,506 additions & 0 deletions scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"CreateGnomadAlleleFreqTsv.gatk_docker": "broadinstitute/gatk:4.0.11.0",

"CreateGnomadAlleleFreqTsv.gnomAD_file": "gs://gnomad-public/release/2.1/vcf/genomes/gnomad.genomes.r2.1.sites.vcf.bgz",
"CreateGnomadAlleleFreqTsv.out_file_name": "gnomad.genomes.r2.1.sites.alleleFreqs.tsv",

"CreateGnomadAlleleFreqTsv.mem_gb": "128",
"CreateGnomadAlleleFreqTsv.disk_space_gb": "16384",
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved
"CreateGnomadAlleleFreqTsv.boot_disk_size_gb": "100"
}
114 changes: 114 additions & 0 deletions scripts/funcotator/data_sources/gnomAD/createGnomadAlleleFreqTsv.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Create a TSV containing genomic position, dbSNP ID, alleles, and the allele frequency from v2.1 of gnomAD (hg19/b37).
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved
#
# Description of inputs:
#
# Required:
# String gatk_docker - GATK Docker image in which to run
# File gnomAD_file - gnomAD VCF file to process
# String out_file_name - Output file name.
#
# Optional:
# File gatk4_jar_override - Override Jar file containing GATK 4. Use this when overriding the docker JAR or when using a backend without docker.
# Int mem - Amount of memory to give to the machine running each task in this workflow.
# Int preemptible_attempts - Number of times to allow each task in this workflow to be preempted.
# Int disk_space_gb - Amount of storage disk space (in Gb) to give to each machine running each task in this workflow.
# Int cpu - Number of CPU cores to give to each machine running each task in this workflow.
# Int boot_disk_size_gb - Amount of boot disk space (in Gb) to give to each machine running each task in this workflow.
#
# This WDL needs to decide whether to use the ``gatk_jar`` or ``gatk_jar_override`` for the jar location. As of cromwell-0.24,
# this logic *must* go into each task. Therefore, there is a lot of duplicated code. This allows users to specify a jar file
# independent of what is in the docker file. See the README.md for more info.
#
workflow CreateGnomadAlleleFreqTsv {

File gnomAD_file = "gs://gnomad-public/release/2.1/vcf/genomes/gnomad.genomes.r2.1.sites.vcf.bgz"
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved
String out_file_name

String gatk_docker

Int? mem_gb
Int? preemptible_attempts
Int? disk_space_gb
Int? cpu
Int? boot_disk_size_gb

call CreateGnomadAlleleFreqTsvTask {
input:
gnomAD_file = gnomAD_file,
out_file_name = out_file_name,
gatk_docker = gatk_docker,
mem_gb = mem_gb,
preemptible_attempts = preemptible_attempts,
disk_space_gb = disk_space_gb,
cpu = cpu,
boot_disk_size_gb = boot_disk_size_gb
}

output {
File vcf_file = CreateGnomadAlleleFreqTsvTask.tsvOut
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved
}
}


task CreateGnomadAlleleFreqTsvTask {

File gnomAD_file
String out_file_name

# ------------------------------------------------
# runtime
String gatk_docker
Int? mem_gb
Int? preemptible_attempts
Int? disk_space_gb
Int? cpu
Int? boot_disk_size_gb

# ------------------------------------------------
# Get machine settings:
Boolean use_ssd = false

# You may have to change the following two parameter values depending on the task requirements
Int default_ram_mb = 1024 * 3
# WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb). Please see [TODO: Link from Jose] for examples.
Int default_disk_space_gb = 100

Int default_boot_disk_size_gb = 15

# Mem is in units of GB but our command and memory runtime values are in MB
Int machine_mem = if defined(mem_gb) then mem_gb * 1024 else default_ram_mb
Int command_mem = machine_mem - 1024
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved

# ------------------------------------------------
# Run our command:
command <<<
set -e

startTime=`date +%s.%N`
echo "StartTime: $startTime" > timingInformation.txt

cat ${gnomAD_file} | sed 's#^\([0-9X]*\)\t\([0-9]*\)\t\(.*\)\t\([ATGCN]*\)\t\([ATGCN,]*\)\t.*;AF=\([e0-9\.+\-]*\);.*#\1 \2 \3 \4 \5 \6#g' > ${out_file_name}

endTime=`date +%s.%N`
echo "EndTime: $endTime" >> timingInformation.txt
elapsedTime=`echo "scale=5;$endTime - $startTime" | bc`
echo "Elapsed Time: $elapsedTime" >> timingInformation.txt
>>>

# ------------------------------------------------
# Runtime settings:
runtime {
docker: gatk_docker
memory: machine_mem + " MB"
disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD"
bootDiskSizeGb: select_first([boot_disk_size_gb, default_boot_disk_size_gb])
preemptible: 0
cpu: select_first([cpu, 1])
}

# ------------------------------------------------
# Outputs:
output {
File tsvOut = "${out_file_name}"
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved
}
}
111 changes: 111 additions & 0 deletions scripts/funcotator/data_sources/gnomAD/gatherVcfsCloud.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Run GatherVcfsCloud on a list of VCF files.
#
# Description of inputs:
#
# Required:
# gatk_docker - GATK Docker image in which to run
# variant_vcfs - Array of Variant Context Files (VCF) containing the variants.
# output_vcf_file_name - Desired name of the resulting VCF output file.
# output_vcf_index_name - Desired name of the resulting VCF index output file.
#
# Optional:
# File gatk4_jar_override - Override Jar file containing GATK 4. Use this when overriding the docker JAR or when using a backend without docker.
# Int mem - Amount of memory to give to the machine running each task in this workflow.
# Int preemptible_attempts - Number of times to allow each task in this workflow to be preempted.
# Int disk_space_gb - Amount of storage disk space (in Gb) to give to each machine running each task in this workflow.
# Int cpu - Number of CPU cores to give to each machine running each task in this workflow.
# Int boot_disk_size_gb - Amount of boot disk space (in Gb) to give to each machine running each task in this workflow.
#
# This WDL needs to decide whether to use the ``gatk_jar`` or ``gatk_jar_override`` for the jar location. As of cromwell-0.24,
# this logic *must* go into each task. Therefore, there is a lot of duplicated code. This allows users to specify a jar file
# independent of what is in the docker file. See the README.md for more info.
#
workflow GatherVcfsCloudWorkflow {
String gatk_docker
Array[File] variant_vcfs
String output_vcf_file_name
String output_vcf_index_name

File? gatk4_jar_override
Int? mem
Int? preemptible_attempts
Int? disk_space_gb
Int? cpu
Int? boot_disk_size_gb

call GatherVcfsCloud {
input:
input_vcfs = variant_vcfs,
output_vcf_file = output_vcf_file_name,
output_vcf_index = output_vcf_index_name,
gatk_docker = gatk_docker,
gatk_override = gatk4_jar_override,
mem = mem,
preemptible_attempts = preemptible_attempts,
disk_space_gb = disk_space_gb,
cpu = cpu,
boot_disk_size_gb = boot_disk_size_gb
}

output {
File vcf_file = GatherVcfsCloud.vcf_file
File vcf_index = GatherVcfsCloud.vcf_index
}
}


task GatherVcfsCloud {
# inputs
Array[File] input_vcfs

# outputs
String output_vcf_file
String output_vcf_index

# runtime
String gatk_docker
File? gatk_override
Int? mem
Int? preemptible_attempts
Int? disk_space_gb
Int? cpu
Int? boot_disk_size_gb

Boolean use_ssd = false

# You may have to change the following two parameter values depending on the task requirements
Int default_ram_mb = 3000
# WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb). Please see [TODO: Link from Jose] for examples.
Int default_disk_space_gb = 100

Int default_boot_disk_size_gb = 15

# Mem is in units of GB but our command and memory runtime values are in MB
Int machine_mem = if defined(mem) then mem *1000 else default_ram_mb
Int command_mem = machine_mem - 1000

command <<<
set -e
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}

gatk --java-options "-Xmx${command_mem}m -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
GatherVcfsCloud \
--create-output-variant-index true \
-I ${sep=' -I ' input_vcfs} \
-O ${output_vcf_file}
>>>

runtime {
docker: gatk_docker
memory: machine_mem + " MB"
disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD"
bootDiskSizeGb: select_first([boot_disk_size_gb, default_boot_disk_size_gb])
preemptible: 0
cpu: select_first([cpu, 1])
}

output {
File vcf_file = "${output_vcf_file}"
File vcf_index = "${output_vcf_index}"
}
}
102 changes: 102 additions & 0 deletions scripts/funcotator/data_sources/gnomAD/indexFeatureFile.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Run IndexFeatureFile on a VCF file.
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved
#
# Description of inputs:
#
# Required:
# String gatk_docker - GATK Docker image in which to run
# Array[File] variant_vcfs - Array of Variant Context Files (VCFs) containing the variants to index.
#
# Optional:
# File gatk4_jar_override - Override Jar file containing GATK 4. Use this when overriding the docker JAR or when using a backend without docker.
# Int mem - Amount of memory to give to the machine running each task in this workflow.
# Int preemptible_attempts - Number of times to allow each task in this workflow to be preempted.
# Int disk_space_gb - Amount of storage disk space (in Gb) to give to each machine running each task in this workflow.
# Int cpu - Number of CPU cores to give to each machine running each task in this workflow.
# Int boot_disk_size_gb - Amount of boot disk space (in Gb) to give to each machine running each task in this workflow.
#
# This WDL needs to decide whether to use the ``gatk_jar`` or ``gatk_jar_override`` for the jar location. As of cromwell-0.24,
# this logic *must* go into each task. Therefore, there is a lot of duplicated code. This allows users to specify a jar file
# independent of what is in the docker file. See the README.md for more info.
#
workflow IndexFeatureFile {
String gatk_docker
Array[File] variant_vcfs

File? gatk4_jar_override
Int? mem
Int? preemptible_attempts
Int? disk_space_gb
Int? cpu
Int? boot_disk_size_gb

scatter ( vcf in variant_vcfs ) {
call DoIndex {
input:
input_vcf = vcf,
gatk_docker = gatk_docker,
gatk_override = gatk4_jar_override,
mem = mem,
preemptible_attempts = preemptible_attempts,
disk_space_gb = disk_space_gb,
cpu = cpu,
boot_disk_size_gb = boot_disk_size_gb
}
}

output {
Array[File] vcf_out_idxs = DoIndex.vcf_index
}
}


task DoIndex {
# inputs
File input_vcf

# runtime
String gatk_docker
File? gatk_override
Int? mem
Int? preemptible_attempts
Int? disk_space_gb
Int? cpu
Int? boot_disk_size_gb

String index_format = if sub(input_vcf, ".*\\.", "") == "vcf" then "idx" else "tbi"

Boolean use_ssd = false

# You may have to change the following two parameter values depending on the task requirements
Int default_ram_mb = 3000
# WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb). Please see [TODO: Link from Jose] for examples.
Int default_disk_space_gb = 100

Int default_boot_disk_size_gb = 15

# Mem is in units of GB but our command and memory runtime values are in MB
Int machine_mem = if defined(mem) then mem *1000 else default_ram_mb
Int command_mem = machine_mem - 1000

command <<<
set -e
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}

gatk --java-options "-Xmx${command_mem}m -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" \
IndexFeatureFile \
-F ${input_vcf} \

>>>

runtime {
docker: gatk_docker
memory: machine_mem + " MB"
disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD"
bootDiskSizeGb: select_first([boot_disk_size_gb, default_boot_disk_size_gb])
preemptible: 0
cpu: select_first([cpu, 1])
}

output {
File vcf_index = "${input_vcf}.${index_format}"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"IndexFeatureFile.gatk_docker": "broadinstitute/gatk:4.0.11.0",

"IndexFeatureFile.variant_vcfs": [ "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr1.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr10.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr11.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr12.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr13.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr14.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr15.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr16.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr17.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr18.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr19.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr2.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr20.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr21.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr22.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr3.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr4.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr5.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr6.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr7.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr8.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chr9.liftoverToHg38.vcf.gz", "gs://broad-dsde-methods-jonn/gnomAD_2.1_Liftover_hg38/gnomad.genomes.r2.1.sites.chrX.liftoverToHg38.vcf.gz" ],
jonn-smith marked this conversation as resolved.
Show resolved Hide resolved

"IndexFeatureFile.mem_gb": "128",
"IndexFeatureFile.disk_space_gb": "2048",
"IndexFeatureFile.boot_disk_size_gb": "100"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"IndexFeatureFile.gatk_docker": "broadinstitute/gatk:4.0.11.0",

"IndexFeatureFile.variant_vcfs": [ "gs://broad-dsde-methods/cromwell-execution-36/MergeVcfsWorkflow/dfe2553a-fe37-4717-9d53-56054708c564/call-MergeVcfs/gnomad.genomes.r2.1.sites.liftoverToHg38.vcf.gz" ],

"IndexFeatureFile.mem_gb": "128",
"IndexFeatureFile.disk_space_gb": "2048",
"IndexFeatureFile.boot_disk_size_gb": "100"
}
Loading