TreeCluster is a tool that, given a tree T (Newick format) and a distance threshold t, finds the minimum number of clusters of the leaves of T such that some user-specified constraint is met in each cluster. The user can also specify a branch support threshold s such that no pair of leaves in any cluster can be connected by branches with support less than or equal to s. Note that all leaves given a cluster of -1 are singletons, meaning they did not cluster with any other leaves (i.e., each leaf with a cluster of -1 is in its own cluster).
TreeCluster was motivated by Cluster Picker.
The default method is "Max Clade" (see Clustering Methods). There is no explicit default distance threshold, but because Cluster Picker recommends a distance threshold of 0.045 and because the same objective function is optimized by both Cluster Picker and TreeCluster "Max Clade", we currently recommend 0.045 as well.
Note that TreeCluster can run within seconds even on ultra-large datasets, so it may make sense to use a range of thresholds and determine the appropriate choice based on the results. We intend to develop non-parametric methods of TreeCluster clustering in the future.
TreeCluster can be installed using pip
:
sudo pip install treecluster
If you are using a machine on which you lack administrative powers, TreeCluster can be installed locally using pip
:
pip install --user treecluster
usage: TreeCluster.py [-h] [-i INPUT] -t THRESHOLD [-s SUPPORT] [-m METHOD] [-tf THRESHOLD_FREE]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input Tree File (default: stdin)
-o OUTPUT, --output OUTPUT
Output File (default: stdout)
-t THRESHOLD, --threshold THRESHOLD
Length Threshold (default: None)
-s SUPPORT, --support SUPPORT
Branch Support Threshold (default: -inf)
-m METHOD, --method METHOD
Clustering Method (options: avg_clade, leaf_dist_avg, leaf_dist_max, leaf_dist_min, length, length_clade, max, max_clade,
med_clade, root_dist, single_linkage, single_linkage_cut, single_linkage_union, sum_branch, sum_branch_clade) (default: max_clade)
-tf THRESHOLD_FREE, --threshold_free THRESHOLD_FREE
Threshold-Free Approach (options: argmax_clusters) (default: None)
-v, --verbose Verbose Mode (default: False)
--version Display Version (default: False)
To help users, we have provided example files in the example
directory, and we have provided some helper scripts that implement common clustering-related tasks in the helper_scripts
directory:
helper_scripts/score_clusters.py
: Given two clustering files, calculate a comparison metric between them- See scikit-learn's Clustering Metrics documentation for details
-
Avg Clade: Cluster the leaves such that the following conditions hold for each cluster:
- The average pairwise distance between leaves in the cluster is at most t
- Leaves cannot be connected by branches with support less than or equal to s
- The leaves in the cluster must define a clade in T
- For a tree with n leaves, this algorithm is O(n) in the worst case
- If verbose mode is enabled (
-v
), the clades defined by the clusters will be printed to standard error
-
Leaf Dist Avg: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where "bottom" is defined as the average root-to-tip distance
- Branches with support less than or equal to s are simply treated as infinitely long
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Leaf Dist Max: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where "bottom" is defined as the furthest leaf from the root
- Branches with support less than or equal to s are simply treated as infinitely long
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Leaf Dist Min: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where "bottom" is defined as the closest leaf to the root
- Branches with support less than or equal to s are simply treated as infinitely long
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Length: Cluster the leaves such that the following conditions hold for each cluster:
- The cluster does not contain any edges above length t
- Leaves cannot be connected by branches with support less than or equal to s
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Length Clade: Cluster the leaves such that the following conditions hold for each cluster:
- The cluster does not contain any edges above length t
- Leaves cannot be connected by branches with support less than or equal to s
- The leaves in the cluster must define a clade in T
- For a tree with n leaves, this algorithm is O(n) in the worst case
- If verbose mode is enabled (
-v
), the clades defined by the clusters will be printed to standard error
-
Max: Cluster the leaves such that the following conditions hold for each cluster:
- The maximum pairwise distance between leaves in the cluster is at most t
- Leaves cannot be connected by branches with support less than or equal to s
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Max Clade: Cluster the leaves such that the following conditions hold for each cluster:
- The maximum pairwise distance between leaves in the cluster is at most t
- Leaves cannot be connected by branches with support less than or equal to s
- The leaves in the cluster must define a clade in T
- For a tree with n leaves, this algorithm is O(n) in the worst case
- If verbose mode is enabled (
-v
), the clades defined by the clusters will be printed to standard error
-
Med Clade: Cluster the leaves such that the following conditions hold for each cluster:
- The median pairwise distance between leaves in the cluster is at most t
- Leaves cannot be connected by branches with support less than or equal to s
- The leaves in the cluster must define a clade in T
- For a tree with n leaves, this algorithm is O(n² log n) in the worst case
- If verbose mode is enabled (
-v
), the clades defined by the clusters will be printed to standard error
-
Root Dist: Cluster the leaves by cutting the tree at t distance away from the root
- Branches with support less than or equal to s are simply treated as infinitely long
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Single Linkage: Cluster the leaves such that the following conditions hold:
- For any two leaves u and v, if the distance between u and v is at most t, they must be in the same cluster
- Leaves cannot be connected by branches with support less than or equal to s
- The number of clusters is maximized
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Sum Branch: Cluster the leaves such that the following conditions hold for each cluster:
- The total branch length of the spanning tree connecting the leaves of the cluster is at most t
- Leaves cannot be connected by branches with support less than or equal to s
- For a tree with n leaves, this algorithm is O(n) in the worst case
-
Sum Branch Clade: Cluster the leaves such that the following conditions hold for each cluster:
- The total branch length of the spanning tree connecting the leaves of the cluster is at most t
- Leaves cannot be connected by branches with support less than or equal to s
- The leaves in the cluster must define a clade in T
- For a tree with n leaves, this algorithm is O(n) in the worst case
- If verbose mode is enabled (
-v
), the clades defined by the clusters will be printed to standard error
- Argmax Clusters: Choose the threshold that maximizes the number of non-singleton clusters over all thresholds from 0 to t
- Currently, for the sake of speed, only every 0.001 threshold is tested (i.e., 0, 0.001, 0.002, ..., t)
If you use TreeCluster in your work, please cite:
Balaban M, Moshiri N, Mai U, Jia X, Mirarab S (2019). "TreeCluster: Clustering biological sequences using phylogenetic trees." PLoS ONE. 14(8):e0221068. doi:10.1371/journal.pone.0221068