-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: edpl
Calcualte the Expected Distance between Placement Locations (EDPL) for all pqueries.
Usage: gappa examine edpl [options]
Input | |
---|---|
--jplace-path |
Required. TEXT:PATH(existing)=[] ... List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
Settings | |
--histogram-bins |
UINT=25 Number of histogram bins for binning the EDPL values. |
--histogram-max |
FLOAT=-1 Maximum value to use in the histogram for binning the EDPL values. To use the maximal EDPL found in the samples, use a negative value (default). |
--no-list-file |
FLAG If set, do not write out the EDPL per pquery, but just the histogram file. As the list needs to keep all pquery names in memory (to get the correct order), the memory requirements might be too large. In that case, this option can help. |
Output | |
--out-dir |
TEXT=. Directory to write output files to. |
--file-prefix |
TEXT File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
Calculates the expected distance between placement locations (EDPL) for all pqueries in the given samples.
The command is a re-implementation of guppy edpl
,
see there for more details.
The EDPL is a measure of uncertainty of how far the placements of a pquery (query sequence) are spread across the branches of the reference tree. In a reference tree with similar sequences, a query sequence might be placed on several nearby branches with relatively high likelihood (LWR). This still constitutes a high confidence in the placement, as the spreading is due to the similar reference sequences, and not due to inherent uncertainty in the placement itself. This is opposed to a query sequence whose placements are spread all across the tree, which might indicate that a fitting reference sequence is missing from the tree, and hence yields uncertain placements.
This can be assessed with the EDPL, which calculates the distances between different placements, weighted by their respective LWRs:
The p
values in the figure represent likelihood weight ratios of the placements at these locations.
The distances d
are calculated using the branch lengths of the tree on the path between the
placement locations. Hence, a low EDPL indicates that the placements of a pquery (query sequence)
are focused in a narrow region of the tree, whereas a high EDPL indicates that the placements are
spread across the tree.
See http://matsen.github.io/pplacer/generated_rst/guppy_edpl.html for more information.
The command produces two tables:
-
list.csv
: A list of the EDPL for each pquery of each sample. The list contains four columns: Sample name (using the input file name), pquery name (one line for each name for pqueries with multiple names), the weight (multiplicity) of the pquery, and the EDPL value of that pquery. As this list needs quite some memory (about as much as the input jplace files), it can also be deactivated with--no-list-file
. -
histogram.csv
: A summary histogram of the EDPL values. This can be used in spreadsheet tools to produce a graph that allows an overview of the values for easy assessment. Using the settings--histogram-bins
and--histogram-max
, the histogram output can be refined.
The histogram can for example be visualized as follows:
The histogram shows the accumulated EDPL values: The x-axis are EDPLs, the y-axis shows how many of the query sequences have an EDPL at or below the respecive value. For example, the lowest bin indicates that more than 60% of the query sequences have an EDPL between 0.0 and 0.02.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Frederick Matsen, Steven Evans. Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLOS ONE, 2013. doi:10.1371/journal.pone.0056859
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools