For details about the support for haploid contigs, please read DeepVariant haploid support.
In this case study, we describe how to call variants in non-autosomal regions
like X, Y chromosomes. Then we assess the quality of the DeepVariant variant
calls with hap.py
.
The dataset used in this case-study has following attributes:
Sample: HG002
Region: ChrX, ChrY
Platform: PacBio
Sample Karyotype: X, Y
In this case study, we will use Docker to run DeepVariant for variant calling and hap.py for benchmarking.
If you want to run on GPU machines, or use Singularity
instead of Docker
,
please follow Quick Start documentation.
BASE="${HOME}/XY-walkthrough"
# Set up input and output directory data
INPUT_DIR="${BASE}/input"
OUTPUT_DIR="${BASE}/output"
## Create local directory structure
mkdir -p "${INPUT_DIR}"
mkdir -p "${OUTPUT_DIR}/data"
# Download reference to input directory
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta.fai
# Download bam file to input directory
HTTPDIR=https://storage.googleapis.com/deepvariant/xy-case-study-testdata
curl ${HTTPDIR}/HG002.pfda_challenge.grch38.chrXY.bam > ${INPUT_DIR}/HG002.pfda_challenge.grch38.chrXY.bam
curl ${HTTPDIR}/HG002.pfda_challenge.grch38.chrXY.bam.bai > ${INPUT_DIR}/HG002.pfda_challenge.grch38.chrXY.bam.bai
HTTPDIR=https://storage.googleapis.com/deepvariant/case-study-testdata
curl ${HTTPDIR}/GRCh38_PAR.bed > ${INPUT_DIR}/GRCh38_PAR.bed
# Set up input variables
REF="GRCh38_no_alt_analysis_set.fasta"
BAM="HG002.pfda_challenge.grch38.chrXY.bam"
THREADS=$(nproc)
REGION="chrX chrY"
HAPLOID_CONTIGS="chrX,chrY"
PAR_BED="GRCh38_PAR.bed"
# Set up output variable
OUTPUT_VCF="HG002_pacbio_hifi.chrXY.output.vcf.gz"
OUTPUT_GVCF="HG002_pacbio_hifi.chrXY.output.g.vcf.gz"
INTERMEDIATE_DIRECTORY="intermediate_results_dir"
mkdir -p "${OUTPUT_DIR}/${INTERMEDIATE_DIRECTORY}"
We will run DeepVariant from docker using the run_deepvariant
script.
BIN_VERSION="1.8.0"
sudo docker pull google/deepvariant:"${BIN_VERSION}"
sudo docker run \
-v "${INPUT_DIR}":"${INPUT_DIR}" \
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type PACBIO \
--ref "${INPUT_DIR}/${REF}" \
--reads "${INPUT_DIR}/${BAM}" \
--output_vcf "${OUTPUT_DIR}/${OUTPUT_VCF}" \
--output_gvcf "${OUTPUT_DIR}/${OUTPUT_GVCF}" \
--num_shards "${THREADS}" \
--haploid_contigs "${HAPLOID_CONTIGS}" \
--par_regions_bed "${INPUT_DIR}/${PAR_BED}" \
--regions "${REGION}" \
--intermediate_results_dir "${OUTPUT_DIR}/${INTERMEDIATE_DIRECTORY}"
We will use Genome-in-a-Bottle (GIAB) dataset to evaluate the performance of DeepVariant.
We will benchmark our variant calls against v1.0 of the Genome in a Bottle small variant benchmarks for HG002_chrXY.
FTPDIR=https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/chrXY_v1.0/GRCh38/SmallVariant
curl ${FTPDIR}/HG002_GRCh38_chrXY_smallvar_v1.0.bed > ${INPUT_DIR}/HG002_GRCh38_chrXY_smallvar_v1.0.bed
curl ${FTPDIR}/HG002_GRCh38_chrXY_smallvar_v1.0.vcf.gz > ${INPUT_DIR}/HG002_GRCh38_chrXY_smallvar_v1.0.vcf.gz
curl ${FTPDIR}/HG002_GRCh38_chrXY_smallvar_v1.0.vcf.gz.tbi > ${INPUT_DIR}/HG002_GRCh38_chrXY_smallvar_v1.0.vcf.gz.tbi
TRUTH_VCF="HG002_GRCh38_chrXY_smallvar_v1.0.vcf.gz"
TRUTH_BED="HG002_GRCh38_chrXY_smallvar_v1.0.bed"
sudo docker pull jmcdani20/hap.py:v0.3.12
REGION="chrX,chrY"
sudo docker run \
-v "${INPUT_DIR}":"${INPUT_DIR}" \
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
"${INPUT_DIR}/${TRUTH_VCF}" \
"${OUTPUT_DIR}/${OUTPUT_VCF}" \
-f "${INPUT_DIR}/${TRUTH_BED}" \
-r "${INPUT_DIR}/${REF}" \
-o "${OUTPUT_DIR}/hg002.chrXY.happy.output" \
--engine=vcfeval \
--pass-only \
-l "${REGION}"
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 24273 23289 984 31014 644 6687 49 465 0.959461 0.973527 0.215612 0.966443 NaN NaN 1.559454 0.068240
INDEL PASS 24273 23289 984 31014 644 6687 49 465 0.959461 0.973527 0.215612 0.966443 NaN NaN 1.559454 0.068240
SNP ALL 87443 86918 525 132164 1449 43958 11 234 0.993996 0.983573 0.332602 0.988757 1.937122 1.541799 1.825434 0.046716
SNP PASS 87443 86918 525 132164 1449 43958 11 234 0.993996 0.983573 0.332602 0.988757 1.937122 1.541799 1.825434 0.046716