Skip to content

simonsfoundation/pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

pipeline

Overview

pipeline is a computational engine for genetic variant detection in a single sample, or in a batch of samples. Almost every step in the pipeline is done via a Makefile (GNU make). These makefiles can be used on their own to accomplish common bioinformatics operations, or they can be used together in a shell script to compose a pipeline.

pipeline is well suited for processing large number of familiar cohorts, and has been deployed on 462 families from SPARK collection at Simons Foundation.

From BAM files to de novo germline mutations

  • Input: BAM file(s)
  • Optionally process BAM files according to GATK best practices
  • Compute callable regions, and subdivide genome into bins of approximately equal size for parallelization
  • Call variants with a choice of GATK HaplotypeCaller in GVCF mode, Freebayes, Platypus
  • Apply GATK variant recalibration
  • Apply hard variant filters

Validation was done against CEUTrio, NA12878

Dependencies

Besides Python 2.7, JDK, GNU make, the following packages are required

  1. GATK
  2. freebayes
  3. platypus
  4. piccard
  5. samtools
  6. sambamba
  7. bedtools
  8. bedops
  9. bgzip, tabix
  10. bcftools
  11. vcflib
  12. SnpEff

All (except GATK) can be installed with an install of bcbio-nextgen.

Installation

cd ~
git clone https://github.com/simonsfoundation/pipeline.git

Running

To setup the configuration for the pipeline, you need to edit include_example.mk file that defines Makefile variables.

The pipeline is run as follows:

~/pipeline/ppln/pipe03.sh     \
/path/to/input/bams/       \ #dir with bam file(s)
/path/to/output/dir        \ #will be created for final output, metrics, log. 
familycode                 \ # common prefix for a group of BAM files (familycode*.bam): family, batch etc;If your BAM files do not have a common prefix, create one via symbolic links
WG                         \ #binning method EX, WG(recommended)
0                          \ #set to 0; if set to 1, pipeline will use existing regions (testing purposes)
tmp                        \ #if "tmp", work in /tmp, else work in output dir
~/pipeline/ppln/include_example.mk \ #makefile with variable definition
YES                         \ #if YES/NO - delete/don't delete intermediate files
,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,Freebayes,Platypus,HaplotypeCallerGVCF, \
1                          \#if 1, remove working dir on exit
/path/to/pipeline/ppln     \
20                  \#max number of physical cpu cores to utilize
all  \ # 1-12 if process only region defined in /ppln/data, all - work on full file 
NO    # YES/NO delete/not delete input bam files

Cluster environments

  • Grid Engine
  • Slurm

Submitting via sbatch

sbatch -N1 --exclusive -J batch2 -e batch2.err -o batch2.out --wrap="/mnt/xfs1/home/asalomatov/projects/pipeline/ppln/pipe03.sh /mnt/xfs1/scratch/asalomatov/data/SPARK/bam/batch_2 /mnt/xfs1/scratch/asalomatov/data/SPARK/vars/b2/all batch2 WG 0 work /mnt/xfs1/home/asalomatov/projects/pipeline/ppln/include_example.mk YES ,FixGroups,HaplotypeCallerGVCF,Platypus,Freebayes, 0 /mnt/xfs1/home/asalomatov/projects/pipeline/ppln 25 all NO"

Validation

(freebayes tutorial, bcbio blog post)

NA12878

  1. Download chromosome 20 high coverage bam file, Broad Institute's truth set, and NIST Genome in a Bottle target regions:
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130103_high_cov_trio_bams/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130103_high_cov_trio_bams/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai
wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi
wget -O NA12878-callable.bed.gz ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed.gz
  1. Run the pipeline:
sbatch -J NA12878 -N 1 --exclusive ~/pipeline/ppln/pipe03.sh ./ ./NA12878 NA12878 WG 0 tmp ~/pipeline/ppln/include_example.mk 0 ,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,RecalibVariants, 1 ~/pipeline/ppln/ 20 all
  1. Restrict our consideration to chromosome 20, and to the confidently callable regions:
mkdir chr20
tabix -h NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz 20 | vcfintersect -b NA12878-callable.bed | bgzip -c > chr20/NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz
tabix -p vcf chr20/NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz

Do the same to our variant calls.

  1. Use vcf-compare to gauge the concordance between our calls and the true positives from the truth set:
zcat NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz | grep "^#\|TruthStatus=TRUE_POSITIVE" | bgzip -c > NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz
tabix -p vcf NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz

For Haplotype Caller:

vcf-compare NA12878-HC-vars-flr-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN	1298	NA12878-HC-vars-flr-call.vcf.gz (1.7%)
VN	1740	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (2.3%)
VN	73287	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (97.7%)	NA12878-HC-vars-flr-call.vcf.gz (98.3%)

For Haplotype Caller in GVCF mode:

vcf-compare NA12878-JHC-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN	1400	NA12878-JHC-vars-call.vcf.gz (1.9%)
VN	1729	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (2.3%)
VN	73298	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (97.7%)	NA12878-JHC-vars-call.vcf.gz (98.1%)

For Freebayes:

vcf-compare NA12878-FB-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN	445	NA12878-FB-vars-call.vcf.gz (0.6%)
VN	3131	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (4.2%)
VN	71896	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (95.8%)	NA12878-FB-vars-call.vcf.gz (99.4%)

For Platypus:

vcf-compare NA12878-PL-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN	2486	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (3.3%)
VN	3071	NA12878-PL-vars-call.vcf.gz (4.1%)
VN	72541	../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (96.7%)	NA12878-PL-vars-call.vcf.gz (95.9%)

CEU Trio

  1. Download chromosome 20 alignments for NA1278, NA12891, NA12892, and create symbolic links:
ln -s NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12878.chr20.20120522.bam
ln -s NA12891.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12891.chr20.20120522.bam
ln -s NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12892.chr20.20120522.bam
ln -s NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12878.chr20.20120522.bam.bai
ln -s NA12891.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12891.chr20.20120522.bam.bai
ln -s NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12892.chr20.20120522.bam.bai
  1. Download a set of high confidence calls for the trio:
wget -O GiaB_NIST_RTG_v0_2.vcf.gz ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/GIAB_integration/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz
tabix -f -p vcf GiaB_NIST_RTG_v0_2.vcf.gz
  1. Run the pipeline
sbatch -J CEUTrio -N 1 --exclusive ~/pipeline/ppln/pipe03.sh ./ ./CEUTrio CEUTrio WG 0 tmp ~/pipeline/ppln/include_example.mk 0 ,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,RecalibVariants, 1 ~/pipeline/ppln/ 20 all
  1. Filter out loci where the proband is HomRef, one can use SnpSift to accomplish this task.
vt normalize -r $GENOMEREF CEUTrio-FB-vars.vcf.gz | java -jar SnpSift.jar filter " GEN[0].GT != '0/0' & GEN[0].GT != '0|0' " | vcfintersect -b ../../NA12878-callable.bed | bgzip -c > CEUTrio-FB-vars-NoHomRef-call.vcf.gz
  1. Compare with vcf-compare.

Haplotype Caller:

vcf-compare CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN
VN  456 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (0.7%)
VN  5978    CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz (7.9%)
VN  69434   CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz (92.1%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (99.3%)

Haplotype Caller in GVCF mode:

vcf-compare CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN
VN  470 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (0.7%)
VN  5653    CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz (7.5%)
VN  69420   CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz (92.5%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (99.3%)

Freebayes:

vcf-compare CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN
VN  791 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (1.1%)
VN  3784    CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz (5.2%)
VN  69099   CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz (94.8%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (98.9%)

Platypus:

vcf-compare CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz  | grep ^VN
VN  857 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (1.2%)
VN  7769    CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz (10.1%)
VN  69033   CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz (89.9%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (98.8%)
`,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,``

About

validated, scalable, minimalist variant detection pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published