-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: taxonomy tree
Turn a taxonomy into a tree that can be used as a constraint for tree inference.
Usage: gappa prepare taxonomy-tree [options]
Input | |
---|---|
--taxon-list-file |
TEXT:FILE File that maps taxon names to taxonomic paths. |
--taxonomy-file |
TEXT:FILE File that lists the taxa of the taxonomy as taxonomic paths. |
Settings | |
--keep-singleton-inner-nodes |
FLAG Taxonomic paths can go down several levels without any furcation. Use this option to keep such paths, instead of collapsing them into a single level. |
--keep-inner-node-names |
FLAG Taxonomies contain names at every level, while trees usually do not. Use this option to also set taxonomic names for the inner nodes of the tree. |
--max-level |
INT=-1 Maximum taxonomic level to process (0-based). Taxa below this level are not added to the tree. |
--replace-invalid-chars |
FLAG Replace invalid characters in node labels ( ,:;"()[] ) by underscores, which can occur if the input taxonomic paths contain such characters. The Newick format requires node labels to be wrapped in double quotation marks if they contain these characters, but many parsers cannot handle this. For such cases, replacing the characters can help. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Newick Tree Output | |
--newick-tree-quote-invalid-chars |
FLAG If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and :;()[],{} ) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
The command takes a taxonomy as input, and converts it into a tree in Newick
format,
which can for example be used as a (taxonomic) constraint for tree inference.
This is particularly useful for a set of sequences that each have a taxonomic assignment.
The required taxonomy can be given in two ways:
-
--taxon-list-file
: A map from taxon names to their taxonomic path. -
--taxonomy-file
: A list of taxonomic paths.
Both of them can be given at the same time, in which case the taxonomic information is combined. See below for details on the expected formats.
The output of the command is a tree in Newick format, written to a file named taxonomy-tree.newick
(additionally using the --file-prefix
and --file-suffix
if provided).
A taxonomy is a hierarchy that can be interpreted as a rooted tree. The command can be used in general to convert a taxonomy into a (multifurcating) tree. The most typical use case however is to create a taxonomically constrained tree that can be used for tree inference, using the taxonomy of the sequences as constraint.
This is useful if the taxonomy is used for a set of sequences that have taxonomic assignments: One might wish to build a tree where tips correspond to sequences, and the tree topology reflects the taxonomy of these sequences. For such a use case, we use the taxonomy of the sequences as input, by mapping sequences names to their taxonomic paths, separated by tabs.
Example:
AY842031 Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;Amaurochaete_comata
JQ031957 Eukaryota;Amoebozoa;Myxogastria;Brefeldia;Brefeldia_maxima
...
This will create a tree with tips labeled AY842031
, JQ031957
, etc, and a topology that reflects
the taxonomic paths.
A more general case is to simply use the full taxonomy to create a tree. This input file needs to contain a list of semicolon-separated taxonomic paths. Everything after the first tab is ignored.
Example:
Eukaryota; 4 domain
Eukaryota;Amoebozoa; 4052 kingdom 119
Eukaryota;Amoebozoa;Myxogastria; 4094 phylum 119
Eukaryota;Amoebozoa;Myxogastria;Amaurochaete; 4095 genus 119
Eukaryota;Amoebozoa;Myxogastria;Badhamia; 4096 genus 119
Eukaryota;Amoebozoa;Myxogastria;Brefeldia; 4097 genus 119
Eukaryota;Amoebozoa;Myxogastria;Comatricha; 4098 genus 119
...
Both inputs --taxon-list-file
and --taxonomy-file
can be used at the same time,
in which case the resulting tree contains the taxonomic information from both files.
It might happen that a taxonomic path goes down several levels with just one taxon at each level.
This would create inner nodes in the tree that just connect two other nodes, that is, nodes that
do not furcate at all. Many downstream programs might have problems with such trees.
Hence, by default, such nodes are collapsed. Use --keep-singleton-inner-nodes
to include these
inner nodes in the tree.
Furthermore, a taxonomy contains names at every level, while a tree usually does not contain
inner node names. Thus, by default, inner nodes are not named. Use --keep-inner-node-names
to also name the inner nodes of the tree.
Lastly, --max-level
can be used to only use the first few levels (starting at 0) of the taxonomy
for constructing the tree, and stopping after that. Per default, the whole taxonomy is used.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools