-
Notifications
You must be signed in to change notification settings - Fork 26
phylogenetic graphs
See supertrees using tree ranking page for some broader discussion of the role of a phylogenetic network. This page discusses how to construct a "phylogenetic graph" or "phylogenetic network" used in step 5 "build a phylogenetic graph for each subproblem" of a pipeline for producing a supertree is here..
The input is a ranked set of trees with a full tree from the taxonomy as the lowest ranked input.
There is a proposal to remap tips to the deepest uncontested taxon that they examplify. So tips of the tree may be associated with higher taxa.
This page assumes that a this proposal or something similar is adpoted to prune any tip that was mapped by a curator to a higher taxon which is contested.
Thus we assume that, at this point in the pipeline, the inputs that are higher taxa are uncontested higher taxa.
Because we know that each of the tips mapped to a higher taxon is a case of a taxon that is going to be in the final answer, we can unproblematically create a set of inputs with no complexity of mapping to taxonomic groups by replacing such labels with polytomies include the terminal taxa. In fact, (because we know that the last input will bring in all of the leaves), it is safe to remap each leaf of the higher ranked trees as follows.
- create a list of all of the tips in the phylo inputs that are non-terminal taxa.
- partially sort the list such that, if any of these higher taxa are nested (based on the taxonomy) the shallower taxa come first.
- walk through this list remapping by: * A. find the union of terminal taxa used in the phylo (non taxonomy) trees. * B. if the set from A is empty, then choose one terminal taxon (arbitrarily) to stand in for the higher taxon * C. if the set from A is not empty, replace the tip with a polytomy of the set of terminals, but make an annotation on the branch that will indicate that this tree is not providing support for this grouping. Later when support for different branches are tabulated these expansions should not be included because thes are cases in which we use the taxonomy to assume that the group is monophyletic, even though the phylo input just has one tip.
Imagine that the score for a supertree, s, is a tuple:
( |c(s, N)| , |c(s, N-1)|, |c(s, N-2)|, ... |c(s, 0)| )
where c(s,N) is the set of groups in the highest ranked input tree N, that are displayed on the supertree.
|c(s,x)|
indicates the size of the sets.
We would like to take these inputs and find the set of trees that has the highest score attainable if scores are computed lexicographically. So it is better to display "4 groupings from tree 4, and none from trees 0-3" than it is to display "3 groupings from tree 4, and all of the groupings from the lower ranked inputs".
The set of trees may be very large, so we would like to have a compact structure that represents the entire set.
As discussed here, MTH conjectured that a phylogenetic graph will be able to accomplish this. He later found that this conjecture was not correct. The counter example is simple:
Tree 1: ((((A,B),C),D),E)
Tree 2: (((A,B),F),E)
Tree 3: ((C,G),F)
The set of trees compatible with all of these splits cannot be described in single phylogenetic graph of the type described below.
The attachment points for G
depend on the resolution of the loop that has C
and D
on one side and F
on the other.
The watered down conjecture is that the set of trees can be described as a set of phylogenetic graphs. This is trivially true, because the set of trees is a set of phylogenetic graphs.
MTH feels that these graphs are still worth pursuing as compact representation of the set of optimal trees; but the algorithms to produce the set of graphs will have to be able to detect cases in which multiple graphs are required.
This document will try to describe the rules for such graphs.
The summary of the supertree strategy is: walk through ranked list of trees. For each tree, walk through each grouping. For each grouping decide if it is compatible with the current phylogenetic graph or not. If it is, update the graph. If the information from the grouping is not redundant with previously constructed information, then this will change the structure of the graph. At minimum it will entail making note of the association between the graph edge and the input edge.
If the grouping is not compatible, make note of that split as an undisplayed split, but do not alter the graph.
- The edges are directed.
- there are no directed cycles.
- There are 2 sorts of adjectives types of edges: looped vs unlooped. dangling vs rooted. So there are four types of edges. All edges to terminals are unlooped. * unlooped edge type: These are the edges that are most like a directed edge in a phylogenetic tree. If the set of trees can be displayed in a consensus tree, then each edge makes a group of phylogenetic statements on the full set of leaves. These edges make similar statements about all of the leaves in their connected component of the graph. * looped edges: these edges create "bubbles" in the graph. They can be used to make phylogenetic statements about a subset of the full set of leaves. * dangling edges: if a connected component of the graph has multiple roots then each edge that whose ancestors include only a proper subset of all of roots is considered to be a dangling edge. In some sense, a dangling edge is a type of looped edge.
Every node in the graph has a unique set of taxa descended from it with the exception of the nodes created to close the ends of a loops. These nodes may have a set of descendants that is identical to their child node.
- because of the tree ranking, the graph will be start as a phylogenetic tree identical to the highest ranked tree.
Consider an example with 3 ranked inputs. The highest priority tree is:
The second ranked tree is:
and the tree with the lowest priority is:
These inputs are all pairwise compatible and jointly compatible. After adding the first input, the phylogenetic graph would be identical to that input tree. After the second input is added, the graph would be:
with the looped edges shown in red and the red nodes being the degenerate nodes at the end the loop.
The loop is a statement of partial relative placement of some of the taxa. If we were trying to simply add E
(from tree 2) to the top tree, it would be unclear where on the path from the MRCA of A+F
to the MRCA of A+B
we should attach E
. In fact E
could be on a "side-branch" - such as sister to D
and the resulting tree would still satisfy both trees.
Each "side" of the loop make some statements (e.g. C
is closer to A
than D
is), but these statements can be mixed and matched with the statements on the other "side" of the loop to generate final trees that satisfy the phylogenetic statement made on both branches.
Note that a polytomy of E
, C
, and D
would not correctly capture the statement: "C
is closer to A
than D
is".
Also note that this graph makes some statements that are not in either input, but must be true of any solution that satisfies both trees. For example, A+E
form a group that excludes G
. Thus we see that the structure can detect some incompatibilities with subsequent inputs that would not be found if we just scanned all input trees for clades that conflict with a grouping.
Note that for the nodes inside the loop (but not the degenerate nodes at the ends), the complete set of descendant taxa in the final graph is not known. For example the node that is the parent of E
must have E
, B
, and A
as descendants (or the node could later be merged with another node in the graph, but the resulting node will still have these descendants).
Similarly the "outgroup" taxa (F
and G
) must be outside of the loop must not be descendants of that node.
But subsequently the loop may be close by another tree; at that point the node may be merged with the parent of C
, or the parent of D
; or the node may be placed along any of the edges in the "C
+ D
side" of the loop.
So we do not know whether the node that is currently the parent of E
will have C
or D
as descendants.
The graph structure (and rules to be described later) will guarantee that, in the final graph, if the MRCA of E
and A
is an ancestor of D
, then it will also be an ancestor of C
.
If we add the third tree, we obtain a graph shown here:
dangling edges are blue. The edges that are both in a loop and dangling, are purple.
The dangling components of the graph are treated just like loops with the common ancestral node simply being deeper in the tree than any of the roots of the dangling components.
If there were another input that says A+D
are closer to each other than E
, then the loop formed earlier would be closed.
This information forces the parent of E
to be on the edge that connects the start of the loop to the parent of D
.
There is no other location for E
to go (and satify all of the inputs). So the looped part of the graph would just be come pectinate with E
, then D
, then C
, branching off the lineage that leads to A
.
The hybrid node for the dangling portion would remain.
If the 4th input said that C
was closer to A
than E
is, then the loop would not close in an obvious way.
The parent of C
would unambiguously map to the daughter edge of the parent of E
.
But the parent of E
could still map to 3 edges in the loop.
So the loop would only shrink to:
Actually, this represents a special case. Whenver both sides of a loop just have on internal edge, the loop expressed the same ambiguity as a polytomy. So more processing could get rid of this loop with this graph:
If the fourth tree just provided redundant information, the loop would not close.
Note that if the new tree just said A+C+E
form a clade that excludes F
, then it might seem that we have learned that the parent of E
should be the same as the parent of C
.
However, if that fourth tree lacks D
, then the set of edges represented by that polytomy (as possible attachments points for E
include all of the edges in the loop).
The new tree is not clarifying where in the loop the parent of E
will attach, so the uncertainty expressed by the loop is still relevant.