Skip to content

Subcommand: accumulate

Lucas Czech edited this page Jan 4, 2022 · 8 revisions

Accumulate the masses of each query in jplace files into basal branches so that they exceed a given mass threshold.

Usage: gappa edit accumulate [options]

Options

Input
--jplace-path Required. TEXT:PATH(existing)=[] ...
List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed.
Settings
--threshold FLOAT:FLOAT in [0.5 - 1]=0.95
Threshold of how much mass needs to be accumulated into a basal branch.
Output
--out-dir TEXT=.
Directory to write output files to.
--file-prefix TEXT
File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix TEXT
File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Global Options
--allow-file-overwriting FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose FLAG
Produce more verbose output.
--threads UINT
Number of threads to use for calculations.
--log-file TEXT
Write all output to a log file, in addition to standard output to the terminal.

Description

The command is useful to assess placements that are distributed across (nearby) branches of the reference tree - for example, if the reference tree contains multiple representatives for the same species. It accumulates the placement mass (likelihood weight ratio) of the placements of each pquery upwards the tree (towards the root), until the accumulated mass at a basal branch reaches the given --threshold:

Accumulation of placement mass towards a basal branch.

That is, each pquery is treated separately. Its mass is first normalized to a total of 1.0. Then, the command looks for the basal branch whose underlying clade accumulates more than the threshold mass. This can be understood as finding the clade that contains most of the placement mass. All placements of the pquery are then removed, and only one placement at the basal branch is added, with a mass of 1.0, which hence represents the accumulated original masses. The pendant length of the resulting pquery is set to the weighted average of the pendant lengths that have been accumulated in the clade, using the masses (likelihood weight ratios) as weights.

It can happen that a pquery contains placement mass across different sides of the root. If no side contains more than the given --threshold mass, there is no basal branch or clade that satisfies the above description. In that case, the whole pquery is removed from the output, and its name(s) are printed in order to inform about this. This can for example happen with chimeric sequences that fit in multiple places of the tree, and hence should be treated as a warning sign. Another reason can be that the root of reference tree is not chosen properly. In that case, it can help to reroot the tree first.

The output of the command is a file called accumulated.jplace, potentially using the --file-prefix and --file-suffix. The file can then be visualized, for example via the heat-tree command, or examined in even greater detail with the graft command.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Clone this wiki locally