Releases: hartwigmedical/pipeline-perl
Pipeline version 4.8
Summary
- BAM links are generated in the final links.json also when running from fastq (bug fix)
- An additional config and parameterisation is added to run purple in SHALLOW_MODE
Pipeline version 4.7
Summary
- Run post stats prior to indel realignment (bug fix)
Pipeline version 4.5
Summary
This release has been made in preparation for pipeline v5 which is build on a completely new architecture and infrastructure. This release only contains some cleanups and bug fixes compare to v4.4.
Various resources and JARs used by the pipeline can be found on https://resources.hartwigmedicalfoundation.nl.
Improvements
- Added a GRIDSS somatic filter step which filters down GRIDSS raw output into filtered VCF (using GRIDSS pon)
- GRIDSS filtered vcf is fed into purple which uses the structural variants as-usual but also tries to recover structural variants which were not previously called.
Cleanups
- We generated a new amber BAF BED file to filter for likely heterozygous germline positions. This new BED file effectively leads to more BAF points, plus this file is now publicly shared on our resources page.
- Manta and BPI have been removed
- FastQC has been removed
- The mappability tracks HDR file (used to annotate somatic variants with a mappability score) has been changed (bug fix).
Version changes
- Purple to v2.17
- New Rlibs dependencies (mainly for GRIDSS somatic filter), not publicly available. Tested on Rscript version v3.5.0
Somatic precision & sensitivity
The somatic precision and sensitivity of SNVs and Indels is determined on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878 against 100% of NA24385 as reference sample. Results are identical to pipeline v4.0:
Type | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|
INDEL | Strelka | 74360 | 641 | 22412 | 99,1% | 76,8% | 0% | 0% |
SNV | Strelka | 955590 | 1253 | 38084 | 99,9% | 96,2% | 0% | 0% |
MNV | Strelka | 6868 | 21 | 0 | 99,7% | 100,0% | 0% | 0% |
Pipeline version 4.4
Upgrade to GRIDSS to v2.0.1
Pipeline version 4.3
- Configuration changes in GRIDSS compared to pipeline v4.2
Pipeline version 4.2
Summary
This pipeline upgrades GRIDSS from v1.8.0 to v1.9.0 compared to v4.0
Various improvements to the GRIDSS somatic SV calling algorithm have made been made based on 163 GRIDSS runs done with pipeline v4.0, and have been released as part of GRIDSS v1.9.0.
See also https://github.com/PapenfussLab/gridss/
Other changes
- We retain the metrics generated by the GRIDSS PreProcess steps. These metrics used to be cleaned up after a successful v4.0 run but can be useful for debugging.
- BPI is upgraded from v1.6 to v1.7 (bug fix release)
- Amber is upgraded from v1.5 to v1.6 (bug fix release)
Pipeline version 4.0
Summary
Many minor changes to all somatic algorithms plus addition of GRIDSS structural variant caller.
Removal of KG pipeline and removal of tumor GATK calling.
Various resources and JARs used by the pipeline can be found on https://resources.hartwigmedicalfoundation.nl.
Improvements to somatic SNV / Indel calling
- To improve sensitivity, variants on known pathogenic locations are retained all the way through Strelka if they are called by the initial Strelka (raw) caller. The list used by HMF can be found on the resources page and is based on CiViC, CGI and OncoKB, appended with a few promotor positions in TERT gene.
- Post-strelka, variants are annotated with a mapping probability based on information known about the mappability of positions in the ref genome.
- Switched from Germline PON v1.1 to Germline PON v2.0
- Added a Somatic PON which filters out specific Strelka artefacts.
- Added MNV merging. Variants that potentially affect the same codon(s) are checked for phasing and merged if they are phased. This is done within the Strelka Post Process JAR.
- Cosmic annotation has been adjusted such that the COSMIC ID for every transcript affected by a variant is included, not just a random single COSMIC ID. Information is provided in the INFO to pick the COSMIC ID for a specific transcript.
Added GRIDSS as an additional somatic structural variant caller
- GRIDSS is implemented next to Manta/BPI and our intention is to eventually replace Manta/BPI since we expect it to perform better across our cohort of samples. All documentation on GRIDSS can be found on https://github.com/PapenfussLab/gridss.
Other changes
- Germline calling is now only performed on the reference sample and hence the germline VCF contains the calls for just one sample.
- Every final VCF (germline, somatic, sv, etc) is gzipped and a tabix index is provided along with the gzipped VCF.
- The kinship test to detect sample swaps is replaced by a test based on BAF scores. The main reason is that kinship penalises het-to-hom transitions, which happen in relation to the degree of LOH. Using BAFs, we can detect sample swaps by observing a mean BAF that significantly deviates from 0.5, which is independent of degree of LOH in the tumor.
- The QC checks are now run as part of the pipeline while they previously used to be a post-pipeline step.
- KG configuration is no longer supported, but there is an INI to analyse just a single sample. This ini runs the algorithms that would normally be run on the reference sample of a somatic pair of samples.
New tool versions
- GRIDSS introduced at version v1.8.0 (using bwa v0.7.17)
Version changes
- Purple v1.2 to v2.14
- Cobalt v1.0 to v1.4
- Amber v1.0 to v1.5
- BPI v1.2 to v1.6
- Strelka Post Process v1.0 to v1.4
- HealthChecker v2.1 to v2.4
- GATK v3.4.46 to v3.8
- snpEff v4.1h to v4.3s
Quality
Since we don't have a KG pipeline anymore we don't report germline precision and sensitivity.
The somatic precision and sensitivity of SNVs and Indels is determined on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878 against 100% of NA24385 as reference sample. Results are as follows:
Somatic precision & sensitivity
Type | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|
INDEL | Strelka | 74360 | 641 | 22412 | 99,1% | 76,8% | -0.1% | -0.2% |
SNV | Strelka | 955590 | 1253 | 38084 | 99,9% | 96,2% | 0% | 0% |
MNV | Strelka | 6868 | 21 | 0 | 99,7% | 100,0% | - | - |
- Note: The differences between v3 are entirely attributed to changes we made in the way we measure the above numbers. Running the same method between v3 and v4 yields no differences which is as-expected since we made no changes that significantly affects either sensitivity or precision.
In addition, to measure exact false positive rate, we analyse a sample against itself in roughly 30x/100x coverage. With pipeline v4.0 release we find 136 false positives in total across the whole genome (109 SNVs and 27 INDELs).
Pipeline version v3.0
Summary
Major overhaul of all somatic analyses (SNVs, INDELs, implied purity, BAF, copy numbers and structural variation (SVs)).
Improvements to somatic SNV / Indel calling
- Consensus method has been replaced by Strelka only calling with custom post processing. Mutect, Freebayes and Varscan callers have been removed from the pipeline.
- BQSR is now run prior to somatic calling (to improve Strelka precision).
- Strelka REPEAT filter is switched off to improve INDEL sensitivity.
- A new post calling filter is added to Strelka output: variants in the low confidence regions are hard filtered unless they have > 10% AF and Strelka Somatic score > 20.
- A soft PON (pool of normals) filter is applied to the final Strelka output to improve precision.
Improvements to somatic SV calling
- ‘Alfredi' (BPI) is run on Manta output which performs the following functions:
- Applies a set of 8 filters to Manta output to remove obvious false positives / improve precision.
- Determines the accurate break point of each variant.
- Calculates a AF for each breakpoint end on each variant.
Tumor purity / BAF / copy numbers
- ‘PURPLE’ (PURity & PLoidy Estimator) replaces FreeC as the primary copy number tool
Key features of PURPLE:- ‘COBALT’ (COunt BAm Lines of Tumor) counts the # of reads per kb window for both normal and tumor.
- GC bias is fit for both normal and tumor.
- ‘AMBER’ (A Minipileup Baf EstimatoR) calculates BAF for a set of HC common heterozygous SNPs.
- A set of candidate copy number breakpoints is determined using a PCF (piecewise constant fitting) algorithm on tumor, normal and BAF.
- Sample ploidy and purity is jointly fit by minimising a penalty function using a integer ploidy and minor allele ploidy model.
- Absolute copy number is determined for each segment.
- Candidate breakpoints are smoothed into a set of final copy number breakpoints.
- CIRCOS and QC plots are produced.
Other changes
- Introduce damage estimator tool to estimate DNA damage.
- BQSR no longer produces a QC report, cutting out 40% of total runtime.
- BQSR writes its BAM using lower zip compression leading to faster compute time but bigger recalibrated BAM files.
- A 26 SNP filter is added for changes to SNP check design.
- Health checker is now run as part of the pipeline.
- CPCT Slicing is removed as it is no longer used.
Changes to versions
- dx_tracks updated for KG from v1 to v1.2.1
Quality
- For assessing the quality of the pipeline we do the following checks:
- Determine germline precision & sensitivity on an internally sequenced NA12878
- Determine somatic precision & sensitivity on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878, against 100% of NA24385 as reference sample.
Germline precision & sensitivity
Type | Config | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|---|
SNV | KG | GATK | 3115316 | 10421 | 38943 | 99,7% | 98,8% | 0.0% | 0,0% |
Somatic precision & sentivity
Type | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|
INDEL | Strelka | 74432 | 577 | 22184 | 99,2% | 77,0% | -0,4% | 7,1% |
SNV | Strelka | 969786 | 1247 | 38058 | 99,9% | 96,2% | 0,1% | 0,0% |
Note: Soft PON filter has not been applied on the GIAB mixin results as that is not possible due to the nature of the sample.
Pipeline version 1.12
Summary
- Somatic calling improved through improved Strelka and Freebayes filtering
- FreeC now uses GC-normalization and produces BAF analysis.
- First implementation of SV-calling using Manta.
Improvements to CNV calling
- FreeC copy number output is now based on GC content normalisation and assessess BAF, to remove earlier observed "wave effect"
- Many corner cases solved and new tool version put in place.
Improvements to somatic calling
- Freebayes normalisation and filtering is substantially improved:
- No calls without normal coverage added in the final VCF.
- No SNPs with length > 1 due to INDELs on the same line/position
- Improvement of left-aligned, single-padded INDEL representation
- Improved Strelka filtering by new allelic frequency-based filtering:
- We accept lower quality variants provided they have sufficient frequency
Technical changes
- Fixes in tools
- Fix a corner-case where VarScan would error due to normal input being longer than tumor input for a given chromosome.
- Fix a bug in GATK where BQSR statistics are not flushed before producing the report.
- Completely standardise job creation and submission
- All jobs have their name, template, job ID, script name, log files and .done files standardised and unified; a single function is responsible for submitting jobs to SGE
- As a consequence, every job has its own .done file, which is more granular but more files to delete when re-running
- Greater job re-use, for e.g. concatenating VCFs.
- Backwards compatibility with previously-inconsistent .done file names is provided, so re-running previous samples/parts of samples should work seamlessly
- Cleanup of template structure
- Provided helper functions and consistency around standard operations like logging timing, logging status to dashboard, validating that output files exist etc.
- For the most part templates should be like functions: they are told their inputs and where to store their output.
- Reduce implicit (potentially inconsistent) duplication of paths/filenames.
- Take qsub options from file instead of command line, eliminating need to firm hold job IDs into maximum command line length.
- Fix status reporting to dashboard.
- Validate that FASTQ name does not begin with a dash (likely to be interpreted as a switch to commands).
- Remove secondary (unused) mappability track configuration in FREEC
- Add an INI option to retain the recalibrated BAM file (useful for testing/experiments).
- links.json now has relative paths (relative to the run directory). This makes the paths portable as the run dir is copied around.
- New additions to the extras.tar in the portal:
- Final ini used when running the pipeline
- ExonCov preferred transcripts
- Germline and potentially somatic SVs from Manta (depending on INI used).
Tools & version changes
- FreeC upgraded to 10.3
- FreeC BAF uses dbSNP v149 sliced using CytoscanHD positions
- ExonCov upgraded to 2.1.3
- Manta introduced at 1.0.3
- BCFtools 1.3.1 used by Freebayes post process
Quality
- For assessing the quality of the pipeline we do the following checks:
- Determine germline precision & sensitivity on an internally sequenced NA12878
- Determine somatic precision & sensitivity on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878, against 100% of NA24385 as reference sample.
Germline precision & sensitivity
Type | Config | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|---|
SNP | KG | GATK | 3115316 | 10425 | 38943 | 99,7% | 98,8% | 0.0% | 0,0% |
Somatic precision & sentivity
Type | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|
INDEL | Freebayes | 66002 | 494 | 30614 | 99,3% | 68,3% | 0,4% | 1,7% |
INDEL | Strelka | 67300 | 275 | 29316 | 99,6% | 69,7% | 0,2% | 19,6% |
INDEL | Varscan | 63576 | 624 | 33040 | 99,0% | 65,8% | 0,0% | 0,0% |
SNP | Freebayes | 936784 | 1014 | 71060 | 99,9% | 93,0% | 0,0% | 0.2% |
SNP | Mutect | 931973 | 5948 | 75871 | 99,4% | 92,5% | 0,0% | 0,0% |
SNP | Strelka | 969316 | 2068 | 38528 | 99,8% | 96,2% | 0,2% | 2,8% |
SNP | Varscan | 899598 | 832 | 108246 | 99,9% | 89,3% | 0,0% | 0,0% |
Pipeline version 1.11
Summary
- Release with final fixes and validation for KG
Changes for KG
- Add CallableLoci functionality, enable in KG.ini
- Enable ExonCov in KG.ini (disabled in v1.10)
- Recalibrated BAM should not be the final BAM for BQSR runs due to size/lack of FASTQ recoverability: do not link it and delete it on success
- Ability to link some pipeline artefacts as "extras" that themselves linked (in links.json) as a single archive file
and use this for a selection of extra files for KG
Technical Changes
- Code has been rewritten to adhere to standard perl structure and automated tests has been added, build status vieweable via travis-ci
Validation
- Regression for our somatic pipeline succeeded with no change in BAM and no regression in precision/sensitivity for germline or somatic VCF
- For validation of single sample pipeline we use data for NA12878, internally known as VAL-S00025. This sample has a truthset of 3154259 variants, and we achieve the following results:
Type | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|
ANY | GATK | 3115315 | 10425 | 38944 | 99,7% | 98,8% | - | - |