Subcommand: taxonomy tree

Turn a taxonomy into a tree that can be used as a constraint for tree inference.

Usage: gappa prepare taxonomy-tree [options]

Options

Input
`--taxon-list-file`	`TEXT:FILE` File that maps taxon names to taxonomic paths.
`--taxonomy-file`	`TEXT:FILE` File that lists the taxa of the taxonomy as taxonomic paths.
Settings
`--keep-singleton-inner-nodes`	`FLAG` Taxonomic paths can go down several levels without any furcation. Use this option to keep such paths, instead of collapsing them into a single level.
`--keep-inner-node-names`	`FLAG` Taxonomies contain names at every level, while trees usually do not. Use this option to also set taxonomic names for the inner nodes of the tree.
`--max-level`	`INT=-1` Maximum taxonomic level to process (0-based). Taxa below this level are not added to the tree.
`--replace-invalid-chars`	`FLAG` Replace invalid characters in node labels ( `,:;"()[]`) by underscores, which can occur if the input taxonomic paths contain such characters. The Newick format requires node labels to be wrapped in double quotation marks if they contain these characters, but many parsers cannot handle this. For such cases, replacing the characters can help.
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Newick Tree Output
`--newick-tree-quote-invalid-chars`	`FLAG` If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and `:;()[],{}`) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

The command takes a taxonomy as input, and converts it into a tree in Newick format, which can for example be used as a (taxonomic) constraint for tree inference. This is particularly useful for a set of sequences that each have a taxonomic assignment.

The required taxonomy can be given in two ways:

--taxon-list-file: A map from taxon names to their taxonomic path.
--taxonomy-file: A list of taxonomic paths.

Both of them can be given at the same time, in which case the taxonomic information is combined. See below for details on the expected formats.

The output of the command is a tree in Newick format, written to a file named taxonomy-tree.newick (additionally using the --file-prefix and --file-suffix if provided).

Details

A taxonomy is a hierarchy that can be interpreted as a rooted tree. The command can be used in general to convert a taxonomy into a (multifurcating) tree. The most typical use case however is to create a taxonomically constrained tree that can be used for tree inference, using the taxonomy of the sequences as constraint.

`--taxon-list-file`

This is useful if the taxonomy is used for a set of sequences that have taxonomic assignments: One might wish to build a tree where tips correspond to sequences, and the tree topology reflects the taxonomy of these sequences. For such a use case, we use the taxonomy of the sequences as input, by mapping sequences names to their taxonomic paths, separated by tabs.

Example:

AY842031	Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;Amaurochaete_comata
JQ031957	Eukaryota;Amoebozoa;Myxogastria;Brefeldia;Brefeldia_maxima
...

This will create a tree with tips labeled AY842031, JQ031957, etc, and a topology that reflects the taxonomic paths.

`--taxonomy-file`

A more general case is to simply use the full taxonomy to create a tree. This input file needs to contain a list of semicolon-separated taxonomic paths. Everything after the first tab is ignored.

Example:

Eukaryota;	4	domain
Eukaryota;Amoebozoa;	4052	kingdom		119
Eukaryota;Amoebozoa;Myxogastria;	4094	phylum		119
Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;	4095	genus		119
Eukaryota;Amoebozoa;Myxogastria;Badhamia;	4096	genus		119
Eukaryota;Amoebozoa;Myxogastria;Brefeldia;	4097	genus		119
Eukaryota;Amoebozoa;Myxogastria;Comatricha;	4098	genus		119
...

Both inputs --taxon-list-file and --taxonomy-file can be used at the same time, in which case the resulting tree contains the taxonomic information from both files.

Settings

It might happen that a taxonomic path goes down several levels with just one taxon at each level. This would create inner nodes in the tree that just connect two other nodes, that is, nodes that do not furcate at all. Many downstream programs might have problems with such trees. Hence, by default, such nodes are collapsed. Use --keep-singleton-inner-nodes to include these inner nodes in the tree.

Furthermore, a taxonomy contains names at every level, while a tree usually does not contain inner node names. Thus, by default, inner nodes are not named. Use --keep-inner-node-names to also name the inner nodes of the tree.

Lastly, --max-level can be used to only use the first few levels (starting at 0) of the taxonomy for constructing the tree, and stopping after that. Per default, the whole taxonomy is used.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070