-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: assign
Taxonomically assign placed query sequences and output tabulated summarization.
Usage: gappa examine assign [options]
Input | |
---|---|
--jplace-path |
Required. TEXT:PATH(existing)=[] ... List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
--taxon-file |
Required. TEXT:FILE File containing a tab-separated list of reference taxon to taxonomic string assignments. |
--root-outgroup |
TEXT:FILE Root the tree by the outgroup taxa defined in the specified file. |
--taxonomy |
TEXT:FILE EXPERIMENTAL: File containing a tab-separated list defining the taxonomy. If mapping is incomplete (for example if the output taxonomy shall be NCBI, but SILVA was used as the basis in the --taxon-file) a best-effort mapping is attempted. |
--ranks-string |
TEXT=superkingdom|phylum|class|order|family|genus|species String specifying the rank names, in order, to which the taxonomy adheres. Required when using the CAMI output format. Assignments not adhereing to this constrained will be collapsed to the last valid mapping EXAMPLE: superkingdom|phylum|class|order|family|genus|species |
Settings | |
--sub-taxopath |
TEXT Taxopath (example: Eukaryota;Animalia;Chordata) by which the high level summary should be filtered. Doesn't affect intermediate results, and an unfiltered verison will be printed as well. |
--max-level |
UINT=0 Maximal level of the taxonomy to be printed. Default is 0, that is, the whole taxonomy is printed. If set to a value about 0, only this many levels are printed. That is, taxonomic levels below the specified one are omitted. |
--distribution-ratio |
FLOAT:FLOAT in [0 - 1]=-1 Ratio by which LWR is split between annotations if an edge has two possible annotations. Specifies the amount going to the proximal annotation. If not set program will determine the ratio automatically from the 'distal length' specified per placement. |
--consensus-thresh |
FLOAT:FLOAT in [0 - 1]=1 For assignment of taxonomic labels to the reference tree, require this consensus threshold. Example: if set to 0.6, and 60% of an inner node's descendants share a taxonomic path, set that path at the inner node. |
--resolve-missing-paths |
FLAG Should the taxon file be incomplete and leave some taxa without taxopaths, fill in the missing node labels using the closest (in the tree) label. If not specified, those parts of the tree remain unlabelled, and their placements unassigned. |
--distant-label |
FLAG Take into account the pendant length of the placements, assigning the LWR to a new label called 'DISTANT' in proportion to the pendant length. Assigns no LWR to 'DISTANT' if the pednant length is below the insertion branch length, and assigns all LWR to 'DISTANT' is the pendant length exceeds the radius of the reference tree. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--cami |
FLAG Needs: --taxonomy EXPERIMENTAL: Print result in the CAMI Taxonomic Profiling Output Format. |
--sample-id |
TEXT Needs: --cami Sample-ID string to be used in the CAMI output file |
--krona |
FLAG Print result in the Krona text format. |
--sativa |
FLAG Print result as SATIVA would. |
--per-query-results |
FLAG Print intermediate / per-query results (per_query.tsv). |
--best-hit |
FLAG In the per-query results, only print the taxonomic path with the highest LWR. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
The command takes one or more jplace
files as input and assigns the likelihood weights of each placement to a taxonomic rank, then prints a high level profile of how the total likelihood weight is distributed on the taxonomy (specified by the --taxon-file
). To achieve this, the command operates in three phases.
First, the tree found in the jplace
input is labelled according to the information found in the taxon-file
, beginning at the tip nodes.
Inner nodes of the tree are labelled by a consensus of the tips that are descendant of that inner node.
The resulting labelled tree is printed to file as the first intermediate result (filename <prefix>labelled_tree<suffix>.newick
).
Second, the algorithm goes through each placement in the jplace
input and assigns its likelihood weight to one or more taxonomic ranks, according to the specified strategy.
Assuming the option --per-query-results
is specified, this triggers the second intermediate result, which is a file containing the per-query results of this assignment (filename <prefix>per_query<suffix>.tsv
, off by default as the volume of data can be high).
Third, the command summarizes these assignments by collapsing them into one tabulation, showing information about the total distribution of likelihood weight across the taxonomy (example, filename <prefix>profile<suffix>.tsv
).
This argument regulates the threshold by which a majority taxonomic path is chosen while assigning such labels to the inner nodes of the tree.
For example, assuming the consensus threshold is set to 0.5
, then if four descendants of an inner node are labelled "A;B;C", and three are labelled "A;B;D", the inner node will get the label "A;B;C".
In this same scenario if the threshold is set to 0.6
, the third taxonomic level will not reach a sufficient consensus, and thus the inner node would be labelled "A;B".
The default value is 1.0
, which is equivalent to a strict intersection of the taxopaths of the inner nodes direct children (the default behaviour before this option was introduced).
List of jplace files or directories to process. For directories, only files with the extension .jplace
are processed. When multiple files are specified, the command treats them all as one collective input (as opposed to processing each independently).
All files must have compatible trees, i.e. same topology and tip labels that are congruent with the assignment in the taxon-file
.
We further strongly recommend that also the branch lengths are identical to facilitate better comparability, however differences in branch lengths will not cause the command to fail.
This file is used to assign machine-readable taxonomic paths to taxa (tips) of the reference tree (which is taken from the jplace
input).
The format is as follows. Each line assigns a taxonomic path to one taxon, and contains two columns: the taxon label as it appears in the tree, followed by the semicolon-separated taxonomic path. The two columns are separated by a tab character.
Seal Eukaryota;Animalia;Chordata;Mammalia;Carnivora;Phocoiae
Whale Eukaryota;Animalia;Chordata;Mammalia;Cetartiodactyla;
Mouse Eukaryota;Animalia;Chordata;Mammalia;Rodentia;Muridae
Human Eukaryota;Animalia;Chordata;Mammalia;Primates;Homonidae
Chicken Eukaryota;Animalia;Chordata;Amphibia;Galliformes;Phasianidae
Frog Eukaryota;Animalia;Chordata;Amphibia;Anura;Dendrobatidae
Loach Eukaryota;Animalia;Chordata;Amphibia;Anura;Rhacophoridae
Cow Eukaryota;Animalia;Chordata;Mammalia;Artiodactyla;Bovidae
The above mentioned Taxonomic Label file may be incomplete, leaving some taxa (tips) of the reference tree without a label. The direct consequence of this is an incompletely labelled reference tree, leaving queries on those branches without a taxonomic assignment.
When specifying the --resolve-missing-paths
flag, the assign
algorithm tries to resolve the missing labels. It does so by identifying unlabelled branches, traveling "up" the tree (in direction of the root) until it finds a branch that is labelled. It then labels all branches on that path using this closest label.
This option controls the strategy by which the likelihood weight of a placement is assigned to the taxonomic labels associated with the placement branch.
If the option is omitted, the strategy is to use an automatic ratio, meaning that the ratio is calculated from the attachment point of the placement on the placement branch, as specified by the distal
(or proximal
) length field of the placement.
In this example, the edge drawn as the line between p
and d
represent the placement branch in the reference tree, while q
represents the attached query sequence. Where the line from q
meets the reference branch is the attachment point, defined by the distal length
. As in this case the distal length is about 1/3rd of the total branch length, the automatic distribution ratio will result in 2/3rds of the likelihood weight associated with this placement to contribute to the taxonomic label of the node d
, while the label associated with node p
receives 1/3rd of the likelihood weight.
q
|
|
p---------------d
<--->
|
distal length
When --distribution-ratio
is used with a fixed specified value (between 0.0 and 1.0), that value defines how the likelihood weight is split between the two labels.
It defines the fraction of the likelihood weight that will contribute to the label at node p
in the example below, where p
is the proximal node, meaning the node that is closer to the (virtual) root of the tree.
--distribution-ratio 0.25
q
|
|
p-----------d
▲ 0.25 ▲ 0.75
\ /
\ /
\ /
LWR: 1.0
This is for example useful to produce assignments similar to Sativa, which uses a fixed ratio of 0.49 (in order to break ties).
We however usually recommend using the default automatic ratio (that is, do not specify the --distribution-ratio
options), as this takes more phylogenetic information into account.
The final output of the command is a tabulation of the distribution of the total likelihood weight across the taxonomy, which is written to the file <prefix>profile<suffix>.tsv
.
The meaning of the column headers are:
-
LWR
: likelihood weight that was assigned to this exact taxonomic path -
fract
:LWR
divided by the global total likelihood weight -
aLWR
: accumulated likelihood weights that were assigned either to this taxonomic path or any taxonomic path below this -
afract
:aLWR
divided by the global total likelihood weight -
taxopath
: the taxonomic path
LWR fract aLWR afract taxopath
0 0 2 1 Eukaryota
0 0 2 1 Eukaryota;Animalia
0.49 0.245 2 1 Eukaryota;Animalia;Chordata
0 0 1 0.5 Eukaryota;Animalia;Chordata;Amphibia
0.49 0.245 1 0.5 Eukaryota;Animalia;Chordata;Amphibia;Anura
0.51 0.255 0.51 0.255 Eukaryota;Animalia;Chordata;Amphibia;Anura;Rhacophoridae
0 0 0.51 0.255 Eukaryota;Animalia;Chordata;Mammalia
0 0 0.51 0.255 Eukaryota;Animalia;Chordata;Mammalia;Rodentia
0.51 0.255 0.51 0.255 Eukaryota;Animalia;Chordata;Mammalia;Rodentia;Muridae
In addition to this, if the option --per-query-results
is also passed, the command will print a file called <prefix>per_query<suffix>.tsv
containing one assignment profile per input query.
name LWR fract aLWR afract taxopath
Carp 0 0 1 1 Eukaryota
Carp 0 0 1 1 Eukaryota;Animalia
Carp 0 0 1 1 Eukaryota;Animalia;Chordata
Carp 0 0 1 1 Eukaryota;Animalia;Chordata;Amphibia
Carp 0.4489 0.4489 1 1 Eukaryota;Animalia;Chordata;Amphibia;Anura
Carp 0.5511 0.5511 0.5511 0.5511 Eukaryota;Animalia;Chordata;Amphibia;Anura;Rhacophoridae
Rat 0 0 1 1 Eukaryota
Rat 0 0 1 1 Eukaryota;Animalia
Rat 0.5002 0.5002 1 1 Eukaryota;Animalia;Chordata
Rat 0 0 0.4998 0.4998 Eukaryota;Animalia;Chordata;Mammalia
Rat 0 0 0.4998 0.4998 Eukaryota;Animalia;Chordata;Mammalia;Rodentia
Rat 0.4998 0.4998 0.4998 0.4998 Eukaryota;Animalia;Chordata;Mammalia;Rodentia;Muridae
This can be combined with the --best-hit
option to only print the taxopath that itself has the highest LWR:
name LWR fract aLWR afract taxopath
Carp 0.5511 0.5511 0.5511 0.5511 Eukaryota;Animalia;Chordata;Amphibia;Anura;Rhacophoridae
Rat 0.5002 0.5002 1 1 Eukaryota;Animalia;Chordata
Furthermore, additional output formats are available:
- When using
--cami
, an additional file "cami.profile" is written, using the CAMI (Taxonomic) Profiling Output Format. This format can be used to compare the taxonomic assignment to other tools that participated in the CAMI challenge. If using this format, do not forget to cite their paper. - When using
--krona
, an additional file "krona.profile" is written, which can be visualized with Krona, using the ktImportText command. If using this format and visualization, do not forget to cite their paper. - When using
--sativa
, an additional file "sativa.tsv" is written, which emulates the outout of SATIVA.
Additionally, the tabulated output may be filtered (constrained) to include only a part of the taxonomy.
This behavior is regulated via the --sub-taxopath
option, and the result is printed to a file called <prefix>profile_filtered<suffix>.tsv
in the specified output directory.
For this output, the normalization of the accumulated LWR and fraction columns only takes the specified subtaxonomy into account. This option is hence useful to get a more detailed view at a specific part of the taxonomy.
LWR fract aLWR afract taxopath
0.49 0.49 1 1 Anura
0.51 0.51 0.51 0.51 Anura;Rhacophoridae
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools