Skip to content

Commit

Permalink
Taking MarkDuplicatesSpark out of beta (#5603)
Browse files Browse the repository at this point in the history
* Removed the beta tag and added full tool docs
  • Loading branch information
jamesemery authored and droazen committed Jan 25, 2019
1 parent e4c90aa commit ec2a6f7
Showing 1 changed file with 76 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,87 @@

import java.util.*;

/**
* <p>This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
* while it groups the read information, it is recommended running this tool on a machine/configuration
* with at least 8 GB of memory overall for a typical 30x bam.</p>
*
* <p>This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are
* defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library
* construction using PCR. See also "<a href='https://broadinstitute.github.io/picard/command-line-overview.html#EstimateLibraryComplexity'>EstimateLibraryComplexity</a>" +
* for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster,
* incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are
* referred to as optical duplicates.</p>
*
* <p>The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.
* After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks
* reads by the sums of their base-quality scores (default method).</p>
*
* <p>The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each
* read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024.
* If you are not familiar with this type of annotation, please see the following <a href='https://www.broadinstitute.org/gatk/blog?id=7019'>blog post</a> for additional information.</p>" +
*
* <p>Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of
* duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in
* the 'optional field' section of a SAM/BAM file. Invoking the 'duplicate-tagging-policy' option,
* you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no
* duplicates (DontTag). The records within the output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked
* 'duplicate-tagging-policy'), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ).
* This tool uses the 'read-name-regex' and the 'optical-duplicate-pixel-distance' options as the primary methods to identify
* and differentiate duplicate types. Set read-name-regex' to null to skip optical duplicate detection, e.g. for RNA-seq
* or other data where duplicate sets are extremely large and estimating library complexity is not an aim.
* Note that without optical duplicate counts, library size estimation will be inaccurate.</p>
*
* <p>MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.</p>
*
* <p>The program can take either coordinate-sorted or query-sorted inputs, however it is recommended that the input be
* query-sorted or query-grouped as the tool will have to perform an extra sort operation on the data in order to associate
* reads from the input bam with their mates.</p>
*
* <p>If desired, duplicates can be removed using the 'remove-all-duplicates' and 'remove-sequencing-duplicates' options.</p>
*
* <h4>Usage example:</h4>
* <pre>
* gatk MarkDuplicatesSpark \\<br />
* -I input.bam \\<br />
* -O marked_duplicates.bam \\<br />
* -M marked_dup_metrics.txt
* </pre>
*
* <h4>MarkDuplicates run locally specifying the core input (if 'spark.executor.cores' is unset spark will use all available cores on the machine)</h4>
* <pre>
* gatk MarkDuplicatesSpark \\<br />
* -I input.bam \\<br />
* -O marked_duplicates.bam \\<br />
* -M marked_dup_metrics.txt \\<br />
* --conf 'spark.executor.cores=5'
* </pre>
*
* <h4>MarkDuplicates run on a spark cluster 5 machines</h4>
* <pre>
* gatk MarkDuplicatesSpark \\<br />
* -I input.bam \\<br />
* -O marked_duplicates.bam \\<br />
* -M marked_dup_metrics.txt \\<br />
* -- \\<br />
* --spark-runner SPARK \\<br />
* --spark-master <master_url> \\<br />
* --num-executors 5 \\<br />
* --executor-cores 8 <br />
* </pre>
*
* Please see
* <a href='http://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics'>MarkDuplicates</a>
* for detailed explanations of the output metrics.
* <hr />
*/
@DocumentedFeature
@CommandLineProgramProperties(
summary ="Marks duplicates on spark",
oneLineSummary ="MarkDuplicates on Spark",
programGroup = ReadDataManipulationProgramGroup.class)
@BetaFeature
public final class MarkDuplicatesSpark extends GATKSparkTool {
private static final long serialVersionUID = 1L;

Expand Down

0 comments on commit ec2a6f7

Please sign in to comment.