doc/summarizing-taxonomy-plus-trees.tex

\documentclass[11pt]{article}
\input{preamble.tex}
\usepackage{hyperref}
\hypersetup{backref,  linkcolor=blue, citecolor=black, colorlinks=true, hyperindex=true}
\begin{document}
The source for this in the doc subdirectory of the otcetera
    repo \url{https://github.com/OpenTreeOfLife/otcetera/tree/master/doc}.
\begin{center}
    {\bf Summarizing a taxonomy and multiple estimates of phylogenetic trees} \\
{Mark T.~Holder$^{1,2,\ast}$. feel free to contribute and add your name}
\end{center}
\tableofcontents
\section{Background}
The \otol project is attempting to build a platform for summarizing what is known
    about phylogenetic relationships across all of Life.
Presenting an easy-to-interpret summary of trees that have been ``curated''
    is one component of that effort.
The project has decided that this summary should include a tree which
\begin{compactenum}
    \item can be served and browsed;
    \item contains annotations indicating which input trees support a particular grouping;
    \item has all of the tips of the taxonomy;
    \item displays as many of the groups in the input trees as is feasible;
    \item may utilize ranking of trees;
    \item (tentative - not sure if everyone is on board with this one) has no unsupported groups;\label{noUnsupportedReq}
    \item (tentative) does not ``prefer'' lack of resolution (defined more thoroughly below)
\end{compactenum}
This results in a new form of a supertree problem described below as the ``taxonomy-based tree summary'' problem.
\subsection{Taxonomy-based supertree}
This is a novel name for a special form of the supertree problem.
A taxonomy-based supertree has at least one input (the taxonomic tree) which is complete.
\subsubsection{Taxonomy-based summary tree}
Many supertree approaches seek to maximize accuracy according to some notion
    of distance between the true tree and estimated true.
The summary that we seek has requirements (e.g. \ref{noUnsupportedReq})
    which lead to less resolved trees.
Such trees may fail to display all of the well-supported (or even uncontested)
    rooted triples, but such trees also make it easier for users to see the
    connection between the summary tree and the input trees.
Thus, the phrase ``Taxonomy-based summary tree'' is used here to describe a
    taxonomy-based supertree which tries to maximize some notion of
    explaining the set of input trees (rather than a supertree designed to
    display the highest number of groups in the inputs, or some other criterion).

\subsubsection{Taxonomy-based summary of ranked input trees}
We have been pursuing a strategy that uses a ranking of trees.
In cases of conflict, the grouping that is compatible with the 
    higher ranked tree is shown in the tree (if it is not contradicted by 
    another grouping from trees of even higher rank).
 The taxonomic tree is considered to be the lowest ranked input.

Applying a ranking to the input trees is biologically questionable (because
    the importance of an input \ps should be based on the support for that
    grouping -- it is rarely the case that tree-wide rank would adequately
    describe the degree of statistical support for that group).
Using tree-based ranks also introduces subjectivity into the summary building process.

Nevertheless, tree-rank based summaries represent a reasonable starting point
    because they are easy to explain and the ranking permits some algorithmic
    simplifications.

\subsection{The Sum of Weighted Input \PSs Displayed Score}
Let $\mathcal{P}$ be a multiset of \pss.
If each member, $i$, in this set is assigned a weight, $w_i$, then the 
sum of weighted input \pss displayed score, $S_w$, for a tree $T$ is:
\begin{equation}
    S_w(T, \mathcal{P}) = \sum_{i\in \mathcal{P}} w_i \displaysPred{T}{i}
\end{equation}
where {\displaysPred{T}{x}} is an identity function that evaluates to 1 if tree $T$
        displays $x$, and to 0 otherwise.

\subsubsection{\MSWIPSD problem}
Trying to find the set of trees that maximize $S_w$ is one natural
    goal for a summarization procedure.
We can call this the ``\MSWIPSD''
problem,
    and the set of trees that maximize the score are denoted:
\begin{equation}
    \mathcal{S}_{w}(\mathcal{P}) = \argmax_T S_w(T,\mathcal{P}) 
\end{equation}

Tree-based ranking can be viewed as a means of providing weights to \pss.
Each tree is converted to a set of \pss, and the tree's weight is 
    assigned to each \ps in the set.
The union of the tree's sets becomes the multiset, $\mathcal{P}$, referred to
    in the definition of the score above.

\href{https://github.com/OpenTreeOfLife/treemachine/wiki/MaxWeightOfInputTreeEdgesDisplayed}{This link}
    sketches out a proof of how an extreme form of tree-based ranks
    can lead to greedy tree addition strategy can be guaranteed to 
    find the set of trees that solve the \MSWIPSD problem.
If the difference in tree ranks are sufficiently large from one tree to the next,
    then there is no need to consider skipping a \ps from a high ranking tree
    even if not considering that grouping would result in all of the \ps
    from the lower ranking trees being displayed.

Unfortunately, that proof only applies to an exact algorithm which returns
    all possible trees that maximize the score.
We do not know of a polynomial-time algorithm for solving that problem.

\subsection{Trees without unsupported groups}\label{unsupportedTheory}
It turns out that our ``no unsupported groups'' rule corresponds to:
\begin{compactenum}
    \item Finding a tree, $T$, that maximizes the score by displaying as many high-ranking input \pss as possible.
    \item Let \pssInOptimalTree denote the set of input \pss displayed by this tree.
    \item Let \tripleSetInOptimal denote a set of rooted triples that is sufficient to encode the information in \pssInOptimalTree.
    \item Collapse edges in $T$ to create $T^{\ast}$ which is a minor-minimal tree with respect to \tripleSetInOptimal
\end{compactenum}
\citet{JanssonLL2012} define ``minor-minimal'' from \citet{Semple2003}:
\begin{quote}
``If $T$ is a phylogenetic tree consistent with $\mathcal{R}$ and it is not possible to obtain a tree consistent with $\mathcal{R}$ by contracting an internal edge of $T$, then $T$ is called minor-minimal with respect to $\mathcal{R}$.''
\end{quote}
A minor-minimal tree will display no unsupported groups.


Consider an edge connecting parent $\parent{V}$ to its child $V$ in a complete tree $T$.
This edge is supported (or ``the node $V$ is supported'') in the sense of the \MSWIPSD score if
    collapsing the edge leads to a lower (worse) score.
For an edge to be supported in this sense, it must display at least 1 input \pss, and
    the tree that would be created by collapsing the edge to a polytomy must
    {\em not} display those \ps.

Equivalently, we can state the conditions for node $V$ (or its subtending edge) being supported
    by an input \ps derived from node $x$ of tree $t$ as:
\begin{eqnarray}
    \leafLabels{V} \cap \leafLabels{t} & = & \leafLabels{x}\\
    \left(\leafLabels{\parent{V}} \cap \leafLabels{t}\right) - \leafLabels{x} & \neq &\emptyset\hskip 5em \mbox{and}\\
    \leafLabels{c} \cap \leafLabels{t} & \neq & \leafLabels{x} \hskip 2em \forall c \in \children{V}
\end{eqnarray}
where $\children{V}$ is the set of nodes that are children of $V$.
The first condition guarantees that the node $V$ displays the \ps made by $x$.
The second condition assures that if the edge leading to $V$ were collapsed, the resulting polytomy would not display the \ps.
The final condition assures that none of the children of $V$ also display this \ps from $x$;
    if any child displayed the \ps, then collapsing the edge leading to $V$ would still
    yield a tree that still displays the \ps derived from $x$.

If we demand that a summary tree contains no unsupported groups (in the previous sense of support), we 
    are constraining the set of summary trees such that, for every internal node in the summary tree
    there is at least one \ps in the input set that supports it.
This restricts the number of edges in the tree: as the sum of the number of internal nodes in the inputs becomes an upper bound
    on the number of internal nodes in the summary tree.
However, restricting the summary to only contain supported groups does not decrease the 
    number of \pss from the inputs that are displayed.
Furthermore, any maximal scoring solution can be converted to a maximal scoring
    solution by collapsing edges one at a time and checking for the existence of further
    unsupported groups.
If $\mathcal{S}_s$ is the subset of $\mathcal{S}_w$ which do not contain any unsupported nodes,
    then $\mathcal{S}_s$ is never empty, but may be much smaller than $\mathcal{S}_w$ because
    every resolution of $\mathcal{S}_s$ is a member of $\mathcal{S}_w$.

\subsubsection{Minimally resolved phylogenetic supertrees} \label{minrs}
\citet{JanssonLL2012} define a minimally resolved supertree in the context of 
    a supertree that is consistent with (displays) every member of a set of 
    rooted triplets.
They define a minimally resolved supertree (\textsc{MinRS} tree): for ``a set $\mathcal{R}$ of rooted 
    triples with the leaf label set $L$ $\ldots$[the \textsc{MinRS} tree is] a 
    rooted, unordered tree whose leaves are distinctly labeled by $L$ which has as few internal
    nodes as possible an which is consistent with every rooted triple in $\mathcal{R}$.''
They present a polynomial time algorithm for finding the \textsc{MinRS} tree for pectinate tree shapes.
Their general algorithm for finding \textsc{MinRS} in $2^{O(n\log p)}$ time where $n$ is the number of 
    leaves and $p$ is the largest outdegree of any internal nodes in the output.

\subsubsection{trees that are minor-minimal with respect to set of triples}

Page 277 of \citet{JanssonLL2012} points out that the concept of minor-minimal trees from \citet{Semple2003} is not the same as minimally resolved trees.  It also points out that the BUILD tree is minor-minimal.  \textsc{Note: I should have read \citet{JanssonLL2012} further before writing the next paragraph.}

Note that, every internal node of a \textsc{MinRS} tree will be ``supported'', but the
    ``no unsupported nodes'' rule is not the same as the \textsc{MinRS} rule.
The ``no unsupported nodes'' rule is more lenient in the sense of admitting more supertrees
Consider the inputs (in terms of rooted triples): $A,B\mid C$ and $D,E\mid F$.
The supertree $(((A,B),C,D,E),F)$ is one of several trees which has no unsupported nodes, but 
    only the tree $((A,B,D,E),C,F)$ is the only \textsc{MinRS} tree.

\subsubsection{\textsc{MinRS} behaving badly}

Suppose we have taxa $h$uman, $g$orilla, $d$og, $c$at, $f$ugu, $t$una,
$s$hark, and $r$ay.  Let us suppose that we have two studies.  One focuses on mammals and
contributes $\{gh|d,cd|h\}$.  The other focuses on fish, and contributes
$\{ft|r,sr|f\}$.  The \textsc{BUILD} tree is then
$((h,g),(c,d),(f,t),(s,r))$.  It is possible to merge groups to obtain
the \textsc{MinRS} tree $((h,g,f,t),(c,d,s,r))$.  This tree has fewer
nodes, but seems worse. %to me: Ben

This example illustrates the problem of what to do when triplets
fall into multiple non-overlapping groups.  \citet{JanssonLL2012}
mentions the possibility of merging groups in the \textsc{BUILD} tree
to obtain a tree with fewer internal nodes.  However, it might be
preferable to leave groups unmerged if there is no triplet in the
input tree to support the merger.  Perhaps it would be possible to
indicate visually which sister branches can be merged without
contradicting an input tree, since such groups could be merge by the
addition of a new triplet to the input set.


\subsection{Evaluating how well a single summary tree summarizes a set of summary trees}\label{treeAdmissibility}
As discussed above, we might try to seek the set of trees that contain no unsupported nodes and that
    maximize the \SWIPSD score.
However, one of the requirements is that we return a single tree.

\subsubsection{Number of fully resolved trees that maximize/fail-to-maximize the \SWIPSD score}
One can interpret an unresolved tree as a set of trees - specifically the set of trees that 
    can be produced by resolving the tree.
We may be able to formalize a score for a single summary tree, $T$, by considering
    the set of trees that can be produced by resolving it, calling this set $\mathcal{R}(T)$.
Specifically, if our primary summary is a set of trees $\mathcal{S}$, we may want
    to evaluate $T$ by the number of trees in $\mathcal{S}$ which are not found
    in $\mathcal{R}(T)$ 
    and the number (or proportion) of trees in $\mathcal{R}(T)$ which
    are not in $\mathcal{S}$.
Both of these statistics would be small if $T$ is a good summary of $\mathcal{S}$; they 
    would be zero if the set of trees to be summarized is identical to the resolutions
    of the tree.
Note this pair of sets (the ``false negative'' and ``false positive'' sets) are the
    sets that are calculated when calculating the symmetric difference between 
    sets (and related statistics such as the Robinson-Foulds distance in phylogenetics).

Given the difficulty enumerating either $\mathcal{S}$ or $\mathcal{R}(T)$, we may not be able to easily 
    apply this method of scoring trees often.
It would also be difficult to figure out an appropriate weighting of false positives vs false negatives.

However, there may be cases in which we compare two very similar trees, $A$ and $B$, it is obvious
    that $A$ has a lower value than $B$ for one statistic and an equal-or-lower value for the other.
Using an analogy to statistics, we would say that $A$ dominates $B$ and that $B$ is an inadmissible summary.

\subsubsection{The number of unsupported triples implied}

Consider the inputs (in terms of rooted triples): $A,B\mid C$ and $D,E\mid F$.
All valid solutions in $\mathcal{S}$ will display both of these triples (because they are compatible).
These are the only 2 ``supported'' triplets.

One supertree without unsupported nodes $(((A,B,D),C,E),F)$ 
implies 16 rooted triplets (14 unsupported triples).

One supertree without unsupported nodes $(((A,B),C,D,E),F)$ 
implies 13 rooted triplets (11 unsupported triples).

The \textsc{MinRS} tree $((A,B,D,E),C,F)$ displays 12 triplets (10 unsupported triplets).

The \textsc{BUILD} tree $((A,B),C,(D,E),F)$, which has no unsupported nodes, displays only 8 triplets (6 unsupported)

Thus, the final tree might be preferred on the basis of implying the lowest number of unsupported triplets, even
though it has a higher number of internal nodes than the \textsc{MinRS} tree.

\subsection{Relationship between a series of trees and a series \pss}\label{orderPSsTheory}
In terms of the \MSWIPSD problem, the set of input trees can converted to a multiset of input \pss without 
    altering the solution because no aspect of the scoring system depends on whether the input \ps
    which are displayed were derived from the same tree.

As mentioned above, using a ranking system that very strongly favors the more highly ranked trees
    simplifies the search for a solution to the \MSWIPSD problem.
Thus a ranked list of trees (highest priority to lowest) can be mapped to a list of sets of \pss.
A greedy approach that tries builds up a solution by adding one \ps at a time that could generate the optimal
    set of summary trees could be guaranteed to work if the input order is correct.
The greedy solver would have to accept as many splits as possible, and avoid rejecting a \ps unnecessarily,
    but it would be greedy in the sense that it does not have to ``look ahead'' or reconsider a split
    that it has accepted or rejected.

Unfortunately, it is not clear how to convert the ranked list of trees to a ordering of the splits.
The ranked list of sets of \pss that can be naturally derived from the ranked list of trees only provides
    a partial order.

I need to dig through my notes, but I think that there are cases for which the order of adding subtrees within
    a tree affects the output.

Checking all possible input orders would be one solution. 

Currently, otcetera and peyotl-based supertree steps just use a postorder traversal (which is arbitrary with
    respect to the order of sister groups). \NeedsAlgorithmicWork

\subsection{The interpretation of input trees with tips mapped to non-terminal taxa}
Some of the input phylogenetic estimates may have leaves that are not mapped to terminal taxa.
The correct biological interpretation of such labels is not clear.

Some possible meanings of a leaf in a tree being mapped to a non-terminal taxon, $A$:
\begin{compactenum}
    \item $A$ should be a terminal taxon - the reference taxonomy is incorrect.
    \item the taxon $A$ is asserted or assumed by the authors of the study to be monophyletic.\label{itmMonophyleticTip}
    \item at least one descendant taxon of $A$ occurs at this point in the tree, but it is not known which descendant.\label{itmUnknownTip}
    \item the phylogenetic analysis was conducted using a ``chimeric'' set of character data drawn from multiple
        members of the taxon $A$.
    \item a non-extant lineage was sampled and included in the phylogenetic analysis. That tip is thought to be:
    \begin{compactenum}
        \item the most recent common ancestor of taxon $A$,
        \item an ancestor of taxa that are descendants of $A$ (but we don't know which one), OR
        \item an extinct taxon that is a member of $A$.
    \end{compactenum}
\end{compactenum}
Presumably case \ref{itmUnknownTip} is the most common case in our corpus.
Even if case \ref{itmMonophyleticTip} is the case for some trees+leaf combinations, I presume
    that we would want the source of such phylogenetic claims to be more transparent.
Thus, I assume that we do {\em not} want such tips to be interpreted as providing evidence
    for monophyly of the non-terminal tip.

\subsubsection{expanding non-terminals to the contained terminals}
If the taxon is monophyletic based on other trees in the corpus, these cases are not too not problematic.
One could simply transform the tip to a polytomy containing all of the terminal taxa that are
    descendants of the mapped taxon as children of the polytomy.
This input representation would imply that the input tree supported the monophyly of the taxon, but 
    that could be rectified by making note of the fact that the polytomy was an expansion of a tip
    and later suppressing any annotation that claims that the tree supports monophyly.

One could also follow this expansion procedure in the case of contested taxa.
However, that would presumably entail more care to ensure that the \ps that corresponds
    to the polytomy does not contribute to the topological decisions during the 
    tree construction.

\subsubsection{expanding non-terminals to the contained terminals attached to the parent of the leaf}\label{expandNonTermPar}
As described above, expanding the non-terminals to their contained terminals requires 
    some bookkeeping to note that the internal node produced should not generate a \ps that
    is taken to be an input.
If we are calculated ``supported by'' statements about the summary tree as a post-processing step (rather
    than propagating that information at every step of the pipeline), it is sufficient
    to transform the input tree with a non-terminal taxon mapping to a tree that does not
    claim monophyly for the non-terminal taxon.
This can be done by creating the polytomy of terminal leaves at the parent node of the node that is
    mapped to a non-terminal taxon (and pruning the ``barren'' leaf that is the remnant of the tip
    mapped to the non-terminal taxon).

    \textbf{BEN: while this doesn't impose monophyly of the non-terminal taxon, it does seem to impose monophyly of non-terminal taxa + sisters.  I'm not sure why this is correct.  The ``optimizing assignment'' below seems clearly correct.}
    \textbf{BEN: after thinking about this, I suppose we could interpret the clade (A,B) in an input tree as stipulating that
      then groups A and B are jointly monophyletic (so $A \cup B$ is monophyletic).  Not that this is bulletproof, but instead clarifying that we could phrase this as an interpretation rule for input trees.  By interpreting input trees in this way, we assign the blame for cases where this is wrong to the input trees that contain such statements.}
    \textbf{BEN: when the input trees are in fact created under the interpration that (A,B) means ``there exists an A and a B'' that form a monophyletic group, then this is naturally still problematic. So the problem doesn't go away.}

\subsubsection{optimizing the assignment to a terminal taxon}
Under the unknown tip case (case \ref{itmUnknownTip} above), we could treat the
    correct assignment of the non-terminal tip to a terminal tip as an
    unknown, latent variable to be optimized.
In other words, we would try to make the assignments in such a way as to
    maximize the \SWIPSD score.
This sounds like it would lead to a combinatorial explosion in complexity when 
    we have multiple trees that use the same non-terminal taxon in the corpus.
So, as far as I know, we have not seriously considered this.

\subsubsection{pruning non-terminal taxon tips}
We have many tips that are pruned from the input trees because they are 
    not correctly mapped to a taxon in the reference taxonomy.
We could adopt the unknown tip (case \ref{itmUnknownTip} above) interpretation,
    and prune these tips.
This seems a bit draconian and wasteful - particularly given the the study
    curation tool does not warn about non-terminal mapping.

\subsubsection{pruning non-terminal taxon tips if the terminal taxon is not monophyletic}
We might view leaves mapped to non-terminal taxa as hopelessly ambiguous whenever the non-terminal
    taxon is not monophyletic (based on other trees).
Thus, we could prune these cases when other trees reject monophyly.

This seems more reasonable that unconditionally pruning them, but more difficult to implement.
A higher ranked \ps might contest the monophyly of the taxon, but that \ps might be 
    in conflict with even more highly ranked \pss.
There may be some clever trick for determining whether a taxon will be monophyletic in the
    final tree without performing synthesis iteratively.

\subsubsection{pruning non-terminal taxon tips if the terminal taxon is contested}
This is a proposal that is intermediate between the previous 2 proposals.
It is easy to test for whether or not a taxon is contested.
However, if there are high ranking \pss that support the monophyly of the taxon,
    then this procedure may prune tips that are not really ambiguous given 
    the full data.

\subsection{Proposed formalization of the goal}
It would be great if the summary draft tree would be:
    an admissible summary tree ({\em sensu} section \ref{treeAdmissibility}) of the set of trees
    that maximize the ranked-tree \SWIPSD score.

Unfortunately that is probably infeasible.  I think that a reasonable back up would be to 
    produce a summary tree which:
\begin{compactenum}
    \item displays every uncontested taxon, and
    \item shows an admissible summary of the \MSWIPSD set for each subproblem that is
       created by tiling the tree into the contested subproblems.
\end{compactenum}
\subsection{Decomposition into uncontested taxon subproblems}
One can efficiently:
\begin{compactenum}
    \item determine whether a taxon context is contested,
    \item resolve any polytomy in an input tree which can be resolved by constraining
        the set of uncontested taxa, and
    \item produce a subproblem for each uncontested taxon. Each subproblem just contains
        a subset of each input tree that overlaps with this subproblem
\end{compactenum}

\subsubsection{Constraining uncontested taxa may force the \SWIPSD score to decrease.}
See footnote\footnote{there used to be a conjecture by MTH to the contrary of this section header.
A little thought revealed the case that is described here.}

Note that in general, enforcing the presence of a contested \pss may increase the 
    \SWIPSD score.
For example, consider the ranked inputs:
\begin{compactenum}
\item \newick{((A,B),C)}
\item \newick{((A,C),D)}
\item \newick{((A,D),B)}
\end{compactenum}
The third tree displays a grouping that is not contested by either of the other 2 groups, but
    if we force that grouping to be present in the full tree, that full tree cannot 
    display both of the higher ranked \pss.
In essence, the set of \pss implied by the first 2 tree is larger than the union of each tree's set
    of \pss; in this case, a novel \pss: \vvps{A,B,C}{D} must be true of every 
    tree that displays the \pss from the first 2 trees.
These examples of ``implied'' or ``emergent'' \pss seem to require that the certain
    patterns of overlap and omission of leaves in the 2 statements that are being combined.

Enforcing an uncontested taxon into the final tree can decrease the score in cases like this:
\begin{compactenum}
\item \newick{((A,B1),C)}
\item \newick{((C,B2),A)}
\item taxonomy: \newick{(A,(B1,B2)B,C)}
\end{compactenum}
Neither input contests the monophyly of \texttt{B}, but the first 2 statements cannot 
    both be true if \texttt{B} is monophyletic.

Thus the decomposition into uncontested groups is not a trick to speed up
    the identification of an optimal summary tree.
Rather, it is a way to make the summary transparent and easy to fix:
    if a biologist sees that a taxon which they know to be non-monophyletic
    is uncontested, then he/she only needs to upload a tree demonstrating
    non-monophyly to cause this group to be treated differently in the next 
    construction of the summary tree.

\subsubsection{implementation notes: otc-uncontested-decompose}
This operation is implemented in the {\tt otc-uncontested-decompose}.
It takes a taxonomic tree and each of the input trees (in ranked order), and a flag
    that specifies which output directory should hold the subproblems.
It uses an embedding approach outlined in algorithm \ref{embedTree}.
When that procedure is over, every non-leaf node in the taxonomic tree
    has data structures that store information about every edge in 
    an input tree that connects a child which aligns to this node or one of its descendants
    to its parent.
If the input tree has more structure about the taxon, then the pairing of paths will be in the 
    LoopPaths lookup table, and if the pairing just passes through a node it will be listed in
    the ExitPaths lookup table.

We can detect whether or not a node is contested by counting the number of nodes deeper in the 
    taxonomy that serve as ancestral nodes in the ExitPaths for a particular input tree.
If there is only one such node, then the input tree does not contest the taxon.
If there are no such nodes, then the input tree's root maps to this taxon.
If there are multiple nodes, then the input tree contests this taxon.
\ProofWriteupNeeded

The traversal to decompose the tree will alter the taxonomic tree, so the taxonomic tree 
     is cloned and embedded into the original taxonomic tree.
This assures that none of the taxonomic information is lost.

After producing the embedding for all of the trees, the taxonomy is traversed in post-order
    fashion.
If a node is contested, then the branch that leads to its parent is collapsed and
    all of its path pairing are moved deeper in the tree.
This may cause them to go from the category of ExitPaths to LoopPaths (if the parent of the
    taxon is also the ancestral taxonomic node of the path pairing, then it will become
    a loop of that parental taxon).
All of the LoopPaths of the collapsed taxon become LoopPaths of the parent.

If the taxon is not contested, then all of the trees that are embedded in the node
    are written out (in their ranked order) as subproblems.
The path pairings that exit the node are then assigned the OTT ID of the uncontested taxon.
When the subproblem trees are written out, any edge of an input tree that ends is an uncontested
    taxon is treated as a terminal edge.
The OTT ID associated with the path pairing (the ID of the uncontested taxon) is what is
    written as the leaf label for the tree.
In this way the slices of the input trees are tiled into different subproblems and the tips
    of the subproblems refer either to a true tip in the taxonomy, or two an uncontested taxon.
This will enable grafting of the solved subproblems back together by simple ID matching.

\subsection{Subproblem simplifications}\label{simplificationTheory}
Consider the case of having a series of \pss to add in ranked order (where the
    rank-based are extreme enough to allow a greedy addition strategy).
This elides the problem of ordering statements discussed in section \ref{orderPSsTheory}. 
Here we assume that an order has been established.
They are designed assuming a greedy solver that is given a list of \pss and must decide
    for each whether to add it to the solution or reject it as incompatible with the 
    current solution.

There are some simplification steps which should be applicable in a stack-based
    approach to produce a smaller subproblem.
By stack-based we mean:
\begin{compactenum}
    \item apply simplification 1 to reduce the problem size.
    \item apply simplification 2 to reduce the problem size.
    \item[$\ldots$]
    \item[$n$] apply simplification $n$ to reduce the problem size.
    \item[$n + 1$] solve the reduced subproblem
    \item[$n + 2$]``undo'' simplification $n$ to augment the solution
    \item[$n + 3$] ``undo'' simplification $n-1$ to augment the solution
    \item[$\ldots$]
    \item[$2n + 1$] ``undo'' simplification $1$ to augment the solution
    \item[$2n +2$] assure that all mentioned tips are present in the solution.
\end{compactenum}

These simplifications should not result in a worse \SWIPSD score for the set trees.
However, we lack any guarantees about how they interact with heuristic solutions
    of the reduced subproblem.
Furthermore, we lack guarantees about how the simplifications affect a strategy that always
    returns an optimal summary tree, but which does not guarantee that it will
    find the full set of optimal summary trees.

The simplifications can be applied iteratively until no more simplifications are possible.
They are designed assuming a greedy solver that is given a list of \pss and must decide
    for each whether to add it to the solution or reject it as incompatible with the 
    current solution.

The input set of \ps subproblems is simple in that all of the labelled tips are treated
    as terminal taxa for the purpose of the summary.
The inputs do contain trivial statements to assure that all leaves are include (e.g. 
     the taxonomic tree is often a polytomy of all tips).

\subsubsection{Prune tips that only occur in trivial \ps or exclude groups}
No \ps has any support for these tips attaching anywhere above the root of the tree.
So attaching them all a tree the attaches them all at the  root of the subproblem will
    be among the optimal solutions.
\simplification (1) find the set of tips that only occur in trivial \ps or in the exclude
    groups of \pss;
    (2) record these taxon labels;
    (3) record a copy of all \pss that are affected by pruning of these IDs and a mapping
        of the \ps that will result from the pruning
    (4) prune all such tips;
\undoActions (1) restore the original statements

\subsubsection{Remove any trivial \ps}
\simplification (1) record any trivial split
\undoActions (1) restore the original statements

\subsubsection{Remove any redundant \ps}
\simplification if a \ps occurs twice in the list (1) record the second and subsequent positions. (2) remove the lower ranked \ps
\undoActions (1) restore the original statements

\subsubsection{Conditional addition of any ``dominated'' \pss}
Consider a pair of \pss: $a=\vvps{a_i}{a_e}$ and $b=\vvps{b_i}{b_e}$.
We say that $a$ is dominated by $b$ iff: 
    $a_i \subset b_i$ and $a_e\subset b_e$.
If $a$ is dominated by $b$ and $b$ is higher ranked than $a$, then we can 
    note that $a$ need not be attempted if $b$ is accepted into the solution.
$a$ contains less information, so adding it will not alter the solution.
If $b$ is rejected, then it is possible that $a$ will be add-able, however.


\simplification if a \ps occurs twice in the list (1) record the second and subsequent positions. (2) remove the lower ranked \ps
\undoActions (1) restore the original statements

\subsubsection{Prune ``dominated'' tips}
Note that all of the compatibility/conflict decisions rely on tests for whether or not
    set of labels is empty -- the presence of multiple tips rather than 1 tip in a
    required or prohibited set will not affect a decision about whether or not any
    \ps will be accepted into the solution or rejected.

Consider a pair of tip labels $a$ and $b$ and a set of \pss $\mathcal{P}$.
We say that $a$ is dominated by $b$ iff, $\forall (\vvps{i}{e}) \in \mathcal{P}$ one of the 
    following applies:
    $a\notin \leafLabels{p}$
    or $(a \in i \mbox{ and } b \in i)$
    or $(a \in e \mbox{ and } b \in e)$.

In other words, if $b$ is on the same ``side'' as $a$ for every \ps that $a$ occurs in, then
 $a$ is dominated by $b$.
 \simplification  If there exists a tip label $a$ that is dominated by $b$:
    (1) copy every \ps that $a$ occurs in and record how it will map to a \ps with $a$ pruned.
    (2) prune $a$ from all of the \ps
    (3) set up bookkeeping to note whether any affected \ps is accepted or rejected for the solution.
\undoActions for every affected \ps that was accepted. Add the stored \ps. This should place $a$ correctly on the solution.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Pipeline}
\stepInput reference taxonomy and a ranked list of trees (study ID + tree ID pairs).
\stepOutput a summary tree and annotations about which input trees supports each branch.

\comment{the step numbering here does {\em not} agree with the sub dir numbering in
    the \otc supertree subdir. That dir structure needs to be updated.}

\subsection{Prune the reference taxonomy to remove some flags}
\stepExplanation Not all taxa in OTT are reliable enough to belong in the 
    summary tree.
The reference taxonomy producing software flags taxa in several ways 
(\href{https://github.com/OpenTreeOfLife/reference-taxonomy/wiki/taxon-flags}{which are listed here}
\TODO{we still need real documentation of the flagging system}).

\stepInput ``raw'' reference taxonomy with flags produced by \url{https://github.com/OpenTreeOfLife/reference-taxonomy}
\stepOutput $\taxonomy$, the complete taxonomy for synthesis with some taxa pruned.
This will determine the leaf label set of the final summary tree.
\currImpl has been performed by treemachine. Note that there appear to be
    a couple of issues with that impl. See 
\href{https://github.com/OpenTreeOfLife/treemachine/commit/48211803f137ad0b7c096c28d1c10d32f671194f}{this comment} and
\href{https://github.com/OpenTreeOfLife/treemachine/issues/168}{issue 168}.
(2) Documentation needed on which flags are pruned and why.
(3) We should serve this tree somewhere as it is a crucial input for the rest of the pipeline.
\currURL \TODO{Temp url} \url{http://phylo.bio.ku.edu/ot/taxonomy.tre} holds the tree (obtained
    from either Joseph Brown or Ruchi Chaudhary via email) which MTH has been using as \taxonomy.

\subsection{Snapshot input studies}
\stepExplanation For a make-based system it would be useful to copy the incoming \nexson files
    to a snapshot location if they differ from the version of that study that is already found
    in that staging location.
\stepInput local copy of \texttt{phylesystem} git repo, list of trees (study+tree ID + optional git SHA) to be used
\stepOutput (1) copies of the \nexson files from the specified SHA. (2) record of tree identifiers
\currImpl None - similar operation done by \gcmdr.
\implTODO Flat file implementation needed
\currURL None
 
\subsection{Snapshot of input trees}
\stepExplanation The study files may contain multiple trees, for a make-base system it would
    be good to have a timestamped file for each tree
\stepInput snapshot of \nexson from previous step. list of tree identifiers.
\stepOutput (1) one \nexson for each tree. The current naming convention of studyID\_treeID.nexson could
    be used - there is no need to support multiple git-SHAs per tree.
\currImpl None - similar operation done by \gcmdr.
\implTODO Flat file implementation needed
\currURL None
 
\subsection{Pruning of input trees}
\stepExplanation To improve the chance of having a correct rooting, we prune
    the trees to just the ingroup.
    We also prune the tree down such that they contain no more than 1 exemplar of any
    terminal taxon and there are no cases of the taxon for one tip containing
    the taxon mapped to another tip.
\stepInput snapshot of \nexson trees from previous step.
\stepOutput (1) \phyloInputs, the input set of tree represented as
    one newick tree for each input tree with internal node labels that 
    correspond to the node ID in the \nexson of the MRCA node.
    (2) a record of the pruning edits performed.
\currImpl None - similar operation done by \gcmdr.
\implTODO (1) Flat file implementation needed. 
    (2) record of edits needed. 
    (3) identifiers for the internal nodes would be nice for reporting the provenance of edges in the 
    summary tree. 
    (4) We should serve these trees somewhere as they are crucial inputs for the rest of the pipeline.
\currURL \TODO{Temp url} \url{http://phylo.bio.ku.edu/ot/pruned-input-trees.tar.gz} is an
    archive of a set of these tree - without the node identifiers and with nodes that have 
    out-degree=1 (obtained from either Joseph Brown or Ruchi Chaudhary via email) which MTH has been
    using as \phyloInputs.

    \subsection{Expand tips mapped to non-terminal taxa}\label{expandedPhyloStep}
\stepExplanation As explained in section \ref{expandNonTermPar}, expanding tips that are mapped
    to non-terminal taxa to the full set of their terminal descendants and attaching these
    tips to the parent of the taxon should generate a tree that correctly represents
    what the input tree says (without erroneously claiming that the tree supports monophyly).
    
    A clever implementation would note whether a descendant terminal taxon occurs in other
        trees in the $\phyloInputs$ corpus. 
    If there are multiple terminal descendant taxa in the expansion that only occur in the 
        taxonomy, then it should be fine to let the expansions just contain 1 of these tips, $x$.
    This would mean that the others are pruned in the next step, but will be placed in the 
    correct spot in the final summary tree because they should attach at the same parent node
    as the single exemplar, $x$. Failing to take this optimization will only mean that the 
    pruned taxonomy is too large.
\stepInput \taxonomy and \phyloInputs
\stepOutput \expandedPhylo -- the set of phylogenetic inputs expanded such that no leaf is mapped 
    to a non-terminal taxon.
    \currImpl None
    \implTODO \TODO{write this}
\currURL We should probably post this set of trees, as many tools don't deal with tips that 
are mapped to non-terminal taxa. So these trees may be the most accessible set of inputs
for most interested parties.
 
\subsection{Prune taxonomy down to tips represented in \expandedPhylo}\label{prunedTaxonomyStep}
\stepExplanation This is just an optimization step.
Each terminal taxon that is only found in \taxonomy, can be placed on the 
    final summary tree by creating a tree for the overlapping taxonomic
    inputs and then grafting on the ``taxonomy only'' lineages.
This pruning makes the inputs for the subsequent steps smaller
\stepInput \taxonomy and \expandedPhylo
\stepOutput \prunedTaxonomy the pruned taxonomy
\currImpl \otcprune can perform this
\implTODO
\currURL may want to post this somewhere. 

\subsection{Decompose the inputs into subproblems of uncontested taxa}\label{decomposeStep}
\stepExplanation The decision to force uncontested taxa in the final
    summary means that we can separate the problems into non-overlapping
    subproblems.
\stepInput \prunedTaxonomy and \expandedPhylo
\stepOutput subproblems. Currently expressed (1) as one newick tree file  per subproblem
    with the name \texttt{SUBPROBLEMID.tre}, and
    (2) a file called \texttt{SUBPROBLEMID-tree-names.txt} with a treefile name 
    on each line or ``TAXONOMY'' indicating the source of each tree. 
    the \texttt{SUBPROBLEMID} is `ott' followed by the OTT ID.
\currImpl \otcdecompose
\implTODO 
\currURL \url{http://phylo.bio.ku.edu/ot/export-sub-temp.tar.gz} has a snapshot, but
those subproblems were not produced with the non-terminal tips expanded to terminals
so there is some wonkiness - such tips are pruned if the taxon is contested, but their
ID sets still affect the embedding of the deeper nodes in the tree. This needs 
to be rerun after step \ref{expandedPhyloStep} is completed.
 
\subsection{Simplify subproblems}\label{simplifyStep}
\stepExplanation As (to be) described in section \ref{simplificationTheory} there
    are several operations that can be performed that will reduce the size of the
    subproblems but which should not compromise our ability to obtain the same
    subproblem solution.
    Many subproblems are trivially solvable, so the tool that does this
    will also be a crude solver.
\stepInput The set of subproblems, a simplified-problems directory, and a solutions directory
\stepOutput When possible, subproblem solutions and simplifications will be written to the 2 output directories.
\currImpl an implementation is started, but far from complete \href{https://github.com/OpenTreeOfLife/peyotl/blob/supertree/scripts/supertree/simplify_subproblems.py}{in the supertree branch of peyotl}
\implTODO \TODO {finish}
\currURL 
 
\subsection{Solve subproblems}
\stepExplanation Attempt to find an admissible summary tree for the set of 
    summaries that are the \MSWIPSD set.
    We probably want (1) a brute force implementation that we can use for small
    subproblems so we do not have to worry about errors from finding local 
    optima, and (2) one or more heuristics.
\stepInput a ``raw'' subproblem from step \ref{decomposeStep} or a simplified 
    subproblem from step \ref{simplifyStep}.
\stepOutput a tree for each subproblem - stored in a solutions dir under the name
    \texttt{SUBPROBLEMID.tre}
\currImpl treemachine may provide one solver.
\implTODO exact impl and, perhaps we need another approximate solver
\currURL 
 
\subsection{Collapse unsupported nodes}
\stepExplanation If the solver does not guarantee that no unsupported nodes will
    be introduced, then we can collapse them at this point.
    As noted in section \ref{unsupportedTheory}, this should be done
    iteratively rather than by identifying all unsupported edges and 
    collapsing all of them.
    The latter approach would collapse too many edges.
\stepInput the solutions directory holding all of the subproblem solution trees.
\stepOutput a supported solutions directory holding all of the subproblem solution trees
    which contain no unsupported nodes.
\currImpl None
\implTODO \TODO{write this}

\subsection{Assemble pruned summary tree from subproblem solutions}\label{assemblyStep}
\stepExplanation Because the problems do not overlap, and the file names 
    match the tip labels (when a tip of one subproblem is actually an uncontested
    non-terminal taxon in \prunedTaxonomy), this is a simple grafting procedure.\\
    Note that each subproblem is supported by the taxonomy (at a minimum), so this
    step cannot introduce unsupported groups.
\stepInput the solutions directory holding all of the subproblem solution trees.
\stepOutput \prunedSummary -- the summary tree pruned down to the leaf set of \prunedTaxonomy.
\currImpl None
\implTODO \TODO{write this}
 
\subsection{Graft the pruned taxonomy-only taxa back onto the tree}
\stepExplanation ``phylo-referencing'' style logic can be used to place the
    taxa that were pruned in \ref{prunedTaxonomyStep}
\stepInput \prunedSummary and \taxonomy
\stepOutput \summaryTree -- the final summary tree
\currImpl None
\implTODO \TODO{write this}

\subsection{Create annotations for nodes in $\summaryTree$}\label{annotationsStep}
\stepExplanation At minimum, we would want statements of which
    input nodes support which nodes in $\summaryTree$.
    But we could also noted nodes displayed, nodes in conflict, 
    and whether or not the node was constrained because it was a contested taxon.
\stepInput \summaryTree and \expandedPhylo
\stepOutput some as yet undefined format for expressing these annotations.
\currImpl None
\implTODO \TODO{write this}

\subsection{Serve \summaryTree and the annotations}
\stepExplanation we may be able to compile the annotations into a set of
    static files to be served up to the current tree browser.
    Or we may wish have a full database-driven web service
    \stepInput \summaryTree and annotations from \ref{annotationsStep}
\stepOutput a web services API comparable to the
    \href{https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-of-Life-APIs#tree-of-life}{tree-of-life part of the API}.//
    some of the services in that API would definitely require a db rather than just flat-files (e.g the MRCA and induced\_subtree)
\currImpl None unless we load the tree into treemachine and add methods for
    serving up the annotations that are not coming from the graph-of-life
\implTODO \TODO{write this}


\newpage
\section{Algorithms}
\begin{algorithm} \caption{EmbedPhyloIntoTaxonomicScaffold}\label{embedTree} \begin{algorithmic}
\REQUIRE the taxonomic tree $\taxonomy$.
\REQUIRE an input tree, $T$ with a unique identifier $\mbox{id}(T)$
\FOR{each node $n_i$ in $\nodes{T}$}
    \STATE{$z(n_i) \leftarrow \mbox{AlignNodes}(\taxonomy, n_i)$}
\ENDFOR
\FOR{each node $n_i$ in $\nodes{T}$}
    \IF{$n_i \neq \treeRoot{T}$}
    \STATE $y_i \leftarrow \parent{n_i}$
    \STATE$\mbox{EmbedEdge}(\taxonomy, {y_i}, z(y_i), n_i, z(n_i), \mbox{id}(T))$
    \ENDIF
\ENDFOR
\end{algorithmic}\end{algorithm}
\begin{algorithm} \caption{AlignNodes}\label{alignNodes} \begin{algorithmic}
\REQUIRE the taxonomic tree $\taxonomy$.
\REQUIRE a node from input tree, $n$.
\IF{isLeaf($n$)}
    \RETURN the node in $\taxonomy$ that is mapped to the same taxonomic identifier that $n$ is mapped to.
\ELSE
    \RETURN the node in $\taxonomy$ that is the least inclusive taxon that is an ancestor of all of 
    taxonomic labels in $\leafLabels{n}$.
\ENDIF
\end{algorithmic}\end{algorithm}
\begin{algorithm} \caption{EmbedEdge}\label{embedEdge} \begin{algorithmic}
\REQUIRE the taxonomic tree $\taxonomy$.
\REQUIRE a node from input tree, $n$, and it pair node in $\taxonomy$, $z(n)$
\REQUIRE the parent node of $n$, called $y$ and its pair node in $\taxonomy$, $z(y)$
\REQUIRE an identifier, $t$, that uniquely identifies the tree that contains $n$ and $y$.
\REQUIRE Each non-leaf node in $\taxonomy$ has 2 lookup tables: \textsc{LoopPaths} and \textsc{ExitPaths}
\STATE $p \leftarrow \left[n, z(n), y, z(y)\right]$ \COMMENT{$p$ is called the ``path pairing'' information}
\IF{$z(n) = z(y)$}
    \STATE $z(y).\textsc{LoopPaths}[t] \leftarrow p$
\ELSE
    \STATE $c\leftarrow z(n)$
    \WHILE{$c \neq z(y)$}
        \STATE $c.\textsc{ExitPaths}[t] \leftarrow p$
        \STATE $c \leftarrow \parent{c}$
    \ENDWHILE
\ENDIF
\end{algorithmic}
\end{algorithm}


\section{Subproblem solver approaches}\label{subproblemSolver}
MTH had some email conversation with David Bryant -- some of the ideas here came from 
    that conversation.

If we had a complete ordering of splits, we could use a variant of \textsc{BUILD} \citep{AhoSSU1981}
to generate a set of consistent splits, $\mathcal{C}$.
The procedure for doing that is outlined in algorithm \ref{consistentSplitsFromRankedList}.
Note that the original BUILD has scaling $O(MN)$ where $N$ is the number of leaves in the leaf set
    and $M$ is the number of input splits.
\citet{JanssonLL2012} discuss how \citet{HenzingerKW1999} provide a DP approach that 
    reduces the runtime to $\min\{O(MN^{1/2}), O(M + N^2 \log N)\}$, and 
    \citet{HolmLT2001} improve this to $\min\{O(N + M\log^2 N), O(M + N^2 \log N)\}$

\begin{algorithm}
    \caption{ConsistentSplitsFromRankedList}\label{consistentSplitsFromRankedList}
\begin{algorithmic}
\REQUIRE An ordered list of $M$ splits, $\mathcal{R} = [R_1, R_2, R_3, \ldots, R_M]$
\STATE $\mathcal{C} = [R_1]$
\FOR{each split $i$ in $[2, 3 \ldots M]$}
    \STATE $\mathcal{T} \leftarrow \mathcal{C} + R_i$ \COMMENT{where `+' means concatenating 2 lists}
    \IF{\textsc{BUILD}$(\mathcal{T})$ does not return null}
        \STATE $\mathcal{C} \leftarrow \mathcal{T}$
    \ENDIF
\ENDFOR
\RETURN $\mathcal{C}$
\end{algorithmic}
\end{algorithm}


If the number of splits in $\mathcal{C}$ is not too large, then we could use the (exponential) algorithm
    of \citet{JanssonLL2012} to find a \textsc{MinRS} tree to find a solution to the subproblem.

\textsc{Note: it looks like \citet{ByrkaGJ2010} have a method for finding a set of consistent splits}

\subsection{Partial rankings}
(this section has some crude thoughts and no good solution.)

We have a complete ordering of tree, but this only generates a partial ordering on splits. 
If two splits are both first encountered (during traversal through all groups in all trees) 
    in the same tree, then the ordering of the splits is undetermined.

A clumsy way to deal with this would be to use branch and bound:  Add splits in a greedy fashion for a postorder traversal, and then add splits in a preorder traversal.
Whichever order yields the largest number of splits accepted, treat that as a bound.
Then start investigating constraints that force the inclusion of some of the excluded splits.

The set of splits in a tree which are incompatible with a previously excluded split can be discarded as a preprocessing step.

\section{Subproblem simplifications}
\subsection{Using the intersection leaf set}
Shortcut: prune the next tree to the intersection of its leaf set and the
leaf set of the current batch of consistent phylogenetic statements.
Any statement in the pruned tree that is inconsistent, need not be considered.
(since the conflict detection uses a smaller graph, it might be faster than using the full leaf set).

This is for the context of building up a set of consistent phylogenetic statements by adding
    phylogenetic statements from a new tree.
Let $T_i$ be the new tree.
Let $\mathcal{A}_{i-1}$ denote a set of phylogenetic statements (from the consistent trees 1 to $i-1$).

Let $T_i^{\ast}$ and $\mathcal{A}_{i-1}^{\ast}$ denote $T_i$ and $\mathcal{A}_{i-1}$ pruned down by removing any leaves that are not in $\leafLabels{\mathcal{A}_{i-1}} \cap \leafLabels{T_i}$ and (in the case of $T_i$) any nodes that have an out-degree $<2$ as a result of this pruning.


\begin{theorem}
If a phylogenetic statement from $T_i^{\ast}$ is not consistent
    with $\mathcal{A}_{i-1}^{\ast}$,
    then the corresponding statement from $T_i$ will not be consistent with $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: By the definition of compatability, adding new leaves cannot make an tree that is incompatible with a statement in $\mathcal{A}_{i-1}^{\ast}$ compatible with the fuller version of that statement.


\subsection{Testing consistency by pruning off new leaves}
Shortcut: prune off any ``new'' leaves from the next tree 
and then test for consistency.
(since the conflict detection uses a smaller graph, it might be faster than using the full leaf set).


Let $T_i^{\dag}$ denote $T_i$ pruned down by removing any leaves that are not in $\leafLabels{\mathcal{A}_{i-1}}$  any nodes that have an out-degree $<2$ as a result of this pruning.


Each internal node in $T_i$ that is not a common ancestor of all of the labels in $\leafLabels{T_i^{\dag}}$ can be partitioned into one of 3 categories:
\begin{compactenum}
    \item $\mathcal{N}^{\ddag}(T_i)$ is the set of nodes that have more than one child subtree with an contains a leaf in $\leafLabels{T_i^{\dag}}$.
    \item $\mathcal{N}^{\dag}(T_i)$ is the set of nodes that have exactly one child subtree with an contains a leaf in $\leafLabels{T_i^{\dag}}$.
    For every member of $\mathcal{N}^{\dag}(T_i)$ called $n$ let $D^{\dag}(n)$ denote the first descendant node that is either a leaf of $T_i$ or in $\mathcal{N}^{\ddag}(T_i)$.
    \item $\mathcal{N}^{\circ}(T_i)$ is the set of nodes that are not the ancestor of any label in $\leafLabels{T_i^{\dag}}$.
\end{compactenum}


\begin{theorem}
If a phylogenetic statement derived from a member of $\mathcal{N}^{\ddag}(T_i)$ is consistent with  $\mathcal{A}$ if and only if the corresponding member of $\mathcal{N}^{\ddag}(T_i^{\dag})$ is consistent with  $\mathcal{A}$.
\end{theorem}
Proof: To be written


\begin{theorem}
Every phylogenetic statement derived from a member of $\mathcal{N}^{\circ}(T_i)$ is consistent with  $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: To be written. True because these statents are only making statements about previously unmentioned taxa.

\begin{theorem}
If $D^{\dag}(n)$ is a leaf, then the phylogenetic statement that corresponds
    to $n$ is consistent with $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: To be written. True becaus we are only adding sister groups to a previously trivial group.


\begin{theorem}
If $D^{\dag}(n)$ is an internal node, then the phylogenetic statement
    that corresponds to $n$ will conflict with $\mathcal{A}_{i-1}$ if
    and only if the phylogenetic statement that corresponds to 
    $D^{\dag}(n)$ conflicts with $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: To be written. If $D^{\dag}(n)$ is consistent with $\mathcal{A}_{i-1}$
then there must be at least one tree that has an edge that separates all of the 
descendants of $D^{\dag}(n)$ from all of the (mentioned) ancestors
of $n$.
Attaching the ``new'' subtree along this edge produces a tree that proves that 
  $D^{\dag}(n)$ is also consistent with $\mathcal{A}_{i-1}$.

In the case of $D^{\dag}(n)$ conflicting with $\mathcal{A}_{i-1}$, we know
that an edge separating a subset of the descendants from the mentioned
ancestors cannot be found. Adding more leaves will not alter that.


%\begin{theorem}
%One can identify the most resolved collapsed form of $T_i$ which is 
%    consistent with $\mathcal{A}_{i-1}$ by pruning both $\mathcal{A}_{i-1}$ and
%    $T_i$ down to the intersection of the leaf labels set: $\leafLabels{\mathcal{A}_{i-1}} \cap \leafLabels{T_i}$
%\end{theorem}


\section{Acknowledgements}
Thanks to David Bryant for suggestions on the subproblem solver and
    for pointing me to the work of J.~Jansson in the context of the 
    the section \ref{subproblemSolver} and \ref{minrs}.

\input{glossary.tex}

\bibliography{otcetera}
\end{document}