-
Notifications
You must be signed in to change notification settings - Fork 3
/
summarizing-taxonomy-plus-trees.tex
950 lines (818 loc) · 53.9 KB
/
summarizing-taxonomy-plus-trees.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
\documentclass[11pt]{article}
\input{preamble.tex}
\usepackage{hyperref}
\hypersetup{backref, linkcolor=blue, citecolor=black, colorlinks=true, hyperindex=true}
\begin{document}
The source for this in the doc subdirectory of the otcetera
repo \url{https://github.com/OpenTreeOfLife/otcetera/tree/master/doc}.
\begin{center}
{\bf Summarizing a taxonomy and multiple estimates of phylogenetic trees} \\
{Mark T.~Holder$^{1,2,\ast}$. feel free to contribute and add your name}
\end{center}
\tableofcontents
\section{Background}
The \otol project is attempting to build a platform for summarizing what is known
about phylogenetic relationships across all of Life.
Presenting an easy-to-interpret summary of trees that have been ``curated''
is one component of that effort.
The project has decided that this summary should include a tree which
\begin{compactenum}
\item can be served and browsed;
\item contains annotations indicating which input trees support a particular grouping;
\item has all of the tips of the taxonomy;
\item displays as many of the groups in the input trees as is feasible;
\item may utilize ranking of trees;
\item (tentative - not sure if everyone is on board with this one) has no unsupported groups;\label{noUnsupportedReq}
\item (tentative) does not ``prefer'' lack of resolution (defined more thoroughly below)
\end{compactenum}
This results in a new form of a supertree problem described below as the ``taxonomy-based tree summary'' problem.
\subsection{Taxonomy-based supertree}
This is a novel name for a special form of the supertree problem.
A taxonomy-based supertree has at least one input (the taxonomic tree) which is complete.
\subsubsection{Taxonomy-based summary tree}
Many supertree approaches seek to maximize accuracy according to some notion
of distance between the true tree and estimated true.
The summary that we seek has requirements (e.g. \ref{noUnsupportedReq})
which lead to less resolved trees.
Such trees may fail to display all of the well-supported (or even uncontested)
rooted triples, but such trees also make it easier for users to see the
connection between the summary tree and the input trees.
Thus, the phrase ``Taxonomy-based summary tree'' is used here to describe a
taxonomy-based supertree which tries to maximize some notion of
explaining the set of input trees (rather than a supertree designed to
display the highest number of groups in the inputs, or some other criterion).
\subsubsection{Taxonomy-based summary of ranked input trees}
We have been pursuing a strategy that uses a ranking of trees.
In cases of conflict, the grouping that is compatible with the
higher ranked tree is shown in the tree (if it is not contradicted by
another grouping from trees of even higher rank).
The taxonomic tree is considered to be the lowest ranked input.
Applying a ranking to the input trees is biologically questionable (because
the importance of an input \ps should be based on the support for that
grouping -- it is rarely the case that tree-wide rank would adequately
describe the degree of statistical support for that group).
Using tree-based ranks also introduces subjectivity into the summary building process.
Nevertheless, tree-rank based summaries represent a reasonable starting point
because they are easy to explain and the ranking permits some algorithmic
simplifications.
\subsection{The Sum of Weighted Input \PSs Displayed Score}
Let $\mathcal{P}$ be a multiset of \pss.
If each member, $i$, in this set is assigned a weight, $w_i$, then the
sum of weighted input \pss displayed score, $S_w$, for a tree $T$ is:
\begin{equation}
S_w(T, \mathcal{P}) = \sum_{i\in \mathcal{P}} w_i \displaysPred{T}{i}
\end{equation}
where {\displaysPred{T}{x}} is an identity function that evaluates to 1 if tree $T$
displays $x$, and to 0 otherwise.
\subsubsection{\MSWIPSD problem}
Trying to find the set of trees that maximize $S_w$ is one natural
goal for a summarization procedure.
We can call this the ``\MSWIPSD''
problem,
and the set of trees that maximize the score are denoted:
\begin{equation}
\mathcal{S}_{w}(\mathcal{P}) = \argmax_T S_w(T,\mathcal{P})
\end{equation}
Tree-based ranking can be viewed as a means of providing weights to \pss.
Each tree is converted to a set of \pss, and the tree's weight is
assigned to each \ps in the set.
The union of the tree's sets becomes the multiset, $\mathcal{P}$, referred to
in the definition of the score above.
\href{https://github.com/OpenTreeOfLife/treemachine/wiki/MaxWeightOfInputTreeEdgesDisplayed}{This link}
sketches out a proof of how an extreme form of tree-based ranks
can lead to greedy tree addition strategy can be guaranteed to
find the set of trees that solve the \MSWIPSD problem.
If the difference in tree ranks are sufficiently large from one tree to the next,
then there is no need to consider skipping a \ps from a high ranking tree
even if not considering that grouping would result in all of the \ps
from the lower ranking trees being displayed.
Unfortunately, that proof only applies to an exact algorithm which returns
all possible trees that maximize the score.
We do not know of a polynomial-time algorithm for solving that problem.
\subsection{Trees without unsupported groups}\label{unsupportedTheory}
It turns out that our ``no unsupported groups'' rule corresponds to:
\begin{compactenum}
\item Finding a tree, $T$, that maximizes the score by displaying as many high-ranking input \pss as possible.
\item Let \pssInOptimalTree denote the set of input \pss displayed by this tree.
\item Let \tripleSetInOptimal denote a set of rooted triples that is sufficient to encode the information in \pssInOptimalTree.
\item Collapse edges in $T$ to create $T^{\ast}$ which is a minor-minimal tree with respect to \tripleSetInOptimal
\end{compactenum}
\citet{JanssonLL2012} define ``minor-minimal'' from \citet{Semple2003}:
\begin{quote}
``If $T$ is a phylogenetic tree consistent with $\mathcal{R}$ and it is not possible to obtain a tree consistent with $\mathcal{R}$ by contracting an internal edge of $T$, then $T$ is called minor-minimal with respect to $\mathcal{R}$.''
\end{quote}
A minor-minimal tree will display no unsupported groups.
Consider an edge connecting parent $\parent{V}$ to its child $V$ in a complete tree $T$.
This edge is supported (or ``the node $V$ is supported'') in the sense of the \MSWIPSD score if
collapsing the edge leads to a lower (worse) score.
For an edge to be supported in this sense, it must display at least 1 input \pss, and
the tree that would be created by collapsing the edge to a polytomy must
{\em not} display those \ps.
Equivalently, we can state the conditions for node $V$ (or its subtending edge) being supported
by an input \ps derived from node $x$ of tree $t$ as:
\begin{eqnarray}
\leafLabels{V} \cap \leafLabels{t} & = & \leafLabels{x}\\
\left(\leafLabels{\parent{V}} \cap \leafLabels{t}\right) - \leafLabels{x} & \neq &\emptyset\hskip 5em \mbox{and}\\
\leafLabels{c} \cap \leafLabels{t} & \neq & \leafLabels{x} \hskip 2em \forall c \in \children{V}
\end{eqnarray}
where $\children{V}$ is the set of nodes that are children of $V$.
The first condition guarantees that the node $V$ displays the \ps made by $x$.
The second condition assures that if the edge leading to $V$ were collapsed, the resulting polytomy would not display the \ps.
The final condition assures that none of the children of $V$ also display this \ps from $x$;
if any child displayed the \ps, then collapsing the edge leading to $V$ would still
yield a tree that still displays the \ps derived from $x$.
If we demand that a summary tree contains no unsupported groups (in the previous sense of support), we
are constraining the set of summary trees such that, for every internal node in the summary tree
there is at least one \ps in the input set that supports it.
This restricts the number of edges in the tree: as the sum of the number of internal nodes in the inputs becomes an upper bound
on the number of internal nodes in the summary tree.
However, restricting the summary to only contain supported groups does not decrease the
number of \pss from the inputs that are displayed.
Furthermore, any maximal scoring solution can be converted to a maximal scoring
solution by collapsing edges one at a time and checking for the existence of further
unsupported groups.
If $\mathcal{S}_s$ is the subset of $\mathcal{S}_w$ which do not contain any unsupported nodes,
then $\mathcal{S}_s$ is never empty, but may be much smaller than $\mathcal{S}_w$ because
every resolution of $\mathcal{S}_s$ is a member of $\mathcal{S}_w$.
\subsubsection{Minimally resolved phylogenetic supertrees} \label{minrs}
\citet{JanssonLL2012} define a minimally resolved supertree in the context of
a supertree that is consistent with (displays) every member of a set of
rooted triplets.
They define a minimally resolved supertree (\textsc{MinRS} tree): for ``a set $\mathcal{R}$ of rooted
triples with the leaf label set $L$ $\ldots$[the \textsc{MinRS} tree is] a
rooted, unordered tree whose leaves are distinctly labeled by $L$ which has as few internal
nodes as possible an which is consistent with every rooted triple in $\mathcal{R}$.''
They present a polynomial time algorithm for finding the \textsc{MinRS} tree for pectinate tree shapes.
Their general algorithm for finding \textsc{MinRS} in $2^{O(n\log p)}$ time where $n$ is the number of
leaves and $p$ is the largest outdegree of any internal nodes in the output.
\subsubsection{trees that are minor-minimal with respect to set of triples}
Page 277 of \citet{JanssonLL2012} points out that the concept of minor-minimal trees from \citet{Semple2003} is not the same as minimally resolved trees. It also points out that the BUILD tree is minor-minimal. \textsc{Note: I should have read \citet{JanssonLL2012} further before writing the next paragraph.}
Note that, every internal node of a \textsc{MinRS} tree will be ``supported'', but the
``no unsupported nodes'' rule is not the same as the \textsc{MinRS} rule.
The ``no unsupported nodes'' rule is more lenient in the sense of admitting more supertrees
Consider the inputs (in terms of rooted triples): $A,B\mid C$ and $D,E\mid F$.
The supertree $(((A,B),C,D,E),F)$ is one of several trees which has no unsupported nodes, but
only the tree $((A,B,D,E),C,F)$ is the only \textsc{MinRS} tree.
\subsubsection{\textsc{MinRS} behaving badly}
Suppose we have taxa $h$uman, $g$orilla, $d$og, $c$at, $f$ugu, $t$una,
$s$hark, and $r$ay. Let us suppose that we have two studies. One focuses on mammals and
contributes $\{gh|d,cd|h\}$. The other focuses on fish, and contributes
$\{ft|r,sr|f\}$. The \textsc{BUILD} tree is then
$((h,g),(c,d),(f,t),(s,r))$. It is possible to merge groups to obtain
the \textsc{MinRS} tree $((h,g,f,t),(c,d,s,r))$. This tree has fewer
nodes, but seems worse. %to me: Ben
This example illustrates the problem of what to do when triplets
fall into multiple non-overlapping groups. \citet{JanssonLL2012}
mentions the possibility of merging groups in the \textsc{BUILD} tree
to obtain a tree with fewer internal nodes. However, it might be
preferable to leave groups unmerged if there is no triplet in the
input tree to support the merger. Perhaps it would be possible to
indicate visually which sister branches can be merged without
contradicting an input tree, since such groups could be merge by the
addition of a new triplet to the input set.
\subsection{Evaluating how well a single summary tree summarizes a set of summary trees}\label{treeAdmissibility}
As discussed above, we might try to seek the set of trees that contain no unsupported nodes and that
maximize the \SWIPSD score.
However, one of the requirements is that we return a single tree.
\subsubsection{Number of fully resolved trees that maximize/fail-to-maximize the \SWIPSD score}
One can interpret an unresolved tree as a set of trees - specifically the set of trees that
can be produced by resolving the tree.
We may be able to formalize a score for a single summary tree, $T$, by considering
the set of trees that can be produced by resolving it, calling this set $\mathcal{R}(T)$.
Specifically, if our primary summary is a set of trees $\mathcal{S}$, we may want
to evaluate $T$ by the number of trees in $\mathcal{S}$ which are not found
in $\mathcal{R}(T)$
and the number (or proportion) of trees in $\mathcal{R}(T)$ which
are not in $\mathcal{S}$.
Both of these statistics would be small if $T$ is a good summary of $\mathcal{S}$; they
would be zero if the set of trees to be summarized is identical to the resolutions
of the tree.
Note this pair of sets (the ``false negative'' and ``false positive'' sets) are the
sets that are calculated when calculating the symmetric difference between
sets (and related statistics such as the Robinson-Foulds distance in phylogenetics).
Given the difficulty enumerating either $\mathcal{S}$ or $\mathcal{R}(T)$, we may not be able to easily
apply this method of scoring trees often.
It would also be difficult to figure out an appropriate weighting of false positives vs false negatives.
However, there may be cases in which we compare two very similar trees, $A$ and $B$, it is obvious
that $A$ has a lower value than $B$ for one statistic and an equal-or-lower value for the other.
Using an analogy to statistics, we would say that $A$ dominates $B$ and that $B$ is an inadmissible summary.
\subsubsection{The number of unsupported triples implied}
Consider the inputs (in terms of rooted triples): $A,B\mid C$ and $D,E\mid F$.
All valid solutions in $\mathcal{S}$ will display both of these triples (because they are compatible).
These are the only 2 ``supported'' triplets.
One supertree without unsupported nodes $(((A,B,D),C,E),F)$
implies 16 rooted triplets (14 unsupported triples).
One supertree without unsupported nodes $(((A,B),C,D,E),F)$
implies 13 rooted triplets (11 unsupported triples).
The \textsc{MinRS} tree $((A,B,D,E),C,F)$ displays 12 triplets (10 unsupported triplets).
The \textsc{BUILD} tree $((A,B),C,(D,E),F)$, which has no unsupported nodes, displays only 8 triplets (6 unsupported)
Thus, the final tree might be preferred on the basis of implying the lowest number of unsupported triplets, even
though it has a higher number of internal nodes than the \textsc{MinRS} tree.
\subsection{Relationship between a series of trees and a series \pss}\label{orderPSsTheory}
In terms of the \MSWIPSD problem, the set of input trees can converted to a multiset of input \pss without
altering the solution because no aspect of the scoring system depends on whether the input \ps
which are displayed were derived from the same tree.
As mentioned above, using a ranking system that very strongly favors the more highly ranked trees
simplifies the search for a solution to the \MSWIPSD problem.
Thus a ranked list of trees (highest priority to lowest) can be mapped to a list of sets of \pss.
A greedy approach that tries builds up a solution by adding one \ps at a time that could generate the optimal
set of summary trees could be guaranteed to work if the input order is correct.
The greedy solver would have to accept as many splits as possible, and avoid rejecting a \ps unnecessarily,
but it would be greedy in the sense that it does not have to ``look ahead'' or reconsider a split
that it has accepted or rejected.
Unfortunately, it is not clear how to convert the ranked list of trees to a ordering of the splits.
The ranked list of sets of \pss that can be naturally derived from the ranked list of trees only provides
a partial order.
I need to dig through my notes, but I think that there are cases for which the order of adding subtrees within
a tree affects the output.
Checking all possible input orders would be one solution.
Currently, otcetera and peyotl-based supertree steps just use a postorder traversal (which is arbitrary with
respect to the order of sister groups). \NeedsAlgorithmicWork
\subsection{The interpretation of input trees with tips mapped to non-terminal taxa}
Some of the input phylogenetic estimates may have leaves that are not mapped to terminal taxa.
The correct biological interpretation of such labels is not clear.
Some possible meanings of a leaf in a tree being mapped to a non-terminal taxon, $A$:
\begin{compactenum}
\item $A$ should be a terminal taxon - the reference taxonomy is incorrect.
\item the taxon $A$ is asserted or assumed by the authors of the study to be monophyletic.\label{itmMonophyleticTip}
\item at least one descendant taxon of $A$ occurs at this point in the tree, but it is not known which descendant.\label{itmUnknownTip}
\item the phylogenetic analysis was conducted using a ``chimeric'' set of character data drawn from multiple
members of the taxon $A$.
\item a non-extant lineage was sampled and included in the phylogenetic analysis. That tip is thought to be:
\begin{compactenum}
\item the most recent common ancestor of taxon $A$,
\item an ancestor of taxa that are descendants of $A$ (but we don't know which one), OR
\item an extinct taxon that is a member of $A$.
\end{compactenum}
\end{compactenum}
Presumably case \ref{itmUnknownTip} is the most common case in our corpus.
Even if case \ref{itmMonophyleticTip} is the case for some trees+leaf combinations, I presume
that we would want the source of such phylogenetic claims to be more transparent.
Thus, I assume that we do {\em not} want such tips to be interpreted as providing evidence
for monophyly of the non-terminal tip.
\subsubsection{expanding non-terminals to the contained terminals}
If the taxon is monophyletic based on other trees in the corpus, these cases are not too not problematic.
One could simply transform the tip to a polytomy containing all of the terminal taxa that are
descendants of the mapped taxon as children of the polytomy.
This input representation would imply that the input tree supported the monophyly of the taxon, but
that could be rectified by making note of the fact that the polytomy was an expansion of a tip
and later suppressing any annotation that claims that the tree supports monophyly.
One could also follow this expansion procedure in the case of contested taxa.
However, that would presumably entail more care to ensure that the \ps that corresponds
to the polytomy does not contribute to the topological decisions during the
tree construction.
\subsubsection{expanding non-terminals to the contained terminals attached to the parent of the leaf}\label{expandNonTermPar}
As described above, expanding the non-terminals to their contained terminals requires
some bookkeeping to note that the internal node produced should not generate a \ps that
is taken to be an input.
If we are calculated ``supported by'' statements about the summary tree as a post-processing step (rather
than propagating that information at every step of the pipeline), it is sufficient
to transform the input tree with a non-terminal taxon mapping to a tree that does not
claim monophyly for the non-terminal taxon.
This can be done by creating the polytomy of terminal leaves at the parent node of the node that is
mapped to a non-terminal taxon (and pruning the ``barren'' leaf that is the remnant of the tip
mapped to the non-terminal taxon).
\textbf{BEN: while this doesn't impose monophyly of the non-terminal taxon, it does seem to impose monophyly of non-terminal taxa + sisters. I'm not sure why this is correct. The ``optimizing assignment'' below seems clearly correct.}
\textbf{BEN: after thinking about this, I suppose we could interpret the clade (A,B) in an input tree as stipulating that
then groups A and B are jointly monophyletic (so $A \cup B$ is monophyletic). Not that this is bulletproof, but instead clarifying that we could phrase this as an interpretation rule for input trees. By interpreting input trees in this way, we assign the blame for cases where this is wrong to the input trees that contain such statements.}
\textbf{BEN: when the input trees are in fact created under the interpration that (A,B) means ``there exists an A and a B'' that form a monophyletic group, then this is naturally still problematic. So the problem doesn't go away.}
\subsubsection{optimizing the assignment to a terminal taxon}
Under the unknown tip case (case \ref{itmUnknownTip} above), we could treat the
correct assignment of the non-terminal tip to a terminal tip as an
unknown, latent variable to be optimized.
In other words, we would try to make the assignments in such a way as to
maximize the \SWIPSD score.
This sounds like it would lead to a combinatorial explosion in complexity when
we have multiple trees that use the same non-terminal taxon in the corpus.
So, as far as I know, we have not seriously considered this.
\subsubsection{pruning non-terminal taxon tips}
We have many tips that are pruned from the input trees because they are
not correctly mapped to a taxon in the reference taxonomy.
We could adopt the unknown tip (case \ref{itmUnknownTip} above) interpretation,
and prune these tips.
This seems a bit draconian and wasteful - particularly given the the study
curation tool does not warn about non-terminal mapping.
\subsubsection{pruning non-terminal taxon tips if the terminal taxon is not monophyletic}
We might view leaves mapped to non-terminal taxa as hopelessly ambiguous whenever the non-terminal
taxon is not monophyletic (based on other trees).
Thus, we could prune these cases when other trees reject monophyly.
This seems more reasonable that unconditionally pruning them, but more difficult to implement.
A higher ranked \ps might contest the monophyly of the taxon, but that \ps might be
in conflict with even more highly ranked \pss.
There may be some clever trick for determining whether a taxon will be monophyletic in the
final tree without performing synthesis iteratively.
\subsubsection{pruning non-terminal taxon tips if the terminal taxon is contested}
This is a proposal that is intermediate between the previous 2 proposals.
It is easy to test for whether or not a taxon is contested.
However, if there are high ranking \pss that support the monophyly of the taxon,
then this procedure may prune tips that are not really ambiguous given
the full data.
\subsection{Proposed formalization of the goal}
It would be great if the summary draft tree would be:
an admissible summary tree ({\em sensu} section \ref{treeAdmissibility}) of the set of trees
that maximize the ranked-tree \SWIPSD score.
Unfortunately that is probably infeasible. I think that a reasonable back up would be to
produce a summary tree which:
\begin{compactenum}
\item displays every uncontested taxon, and
\item shows an admissible summary of the \MSWIPSD set for each subproblem that is
created by tiling the tree into the contested subproblems.
\end{compactenum}
\subsection{Decomposition into uncontested taxon subproblems}
One can efficiently:
\begin{compactenum}
\item determine whether a taxon context is contested,
\item resolve any polytomy in an input tree which can be resolved by constraining
the set of uncontested taxa, and
\item produce a subproblem for each uncontested taxon. Each subproblem just contains
a subset of each input tree that overlaps with this subproblem
\end{compactenum}
\subsubsection{Constraining uncontested taxa may force the \SWIPSD score to decrease.}
See footnote\footnote{there used to be a conjecture by MTH to the contrary of this section header.
A little thought revealed the case that is described here.}
Note that in general, enforcing the presence of a contested \pss may increase the
\SWIPSD score.
For example, consider the ranked inputs:
\begin{compactenum}
\item \newick{((A,B),C)}
\item \newick{((A,C),D)}
\item \newick{((A,D),B)}
\end{compactenum}
The third tree displays a grouping that is not contested by either of the other 2 groups, but
if we force that grouping to be present in the full tree, that full tree cannot
display both of the higher ranked \pss.
In essence, the set of \pss implied by the first 2 tree is larger than the union of each tree's set
of \pss; in this case, a novel \pss: \vvps{A,B,C}{D} must be true of every
tree that displays the \pss from the first 2 trees.
These examples of ``implied'' or ``emergent'' \pss seem to require that the certain
patterns of overlap and omission of leaves in the 2 statements that are being combined.
Enforcing an uncontested taxon into the final tree can decrease the score in cases like this:
\begin{compactenum}
\item \newick{((A,B1),C)}
\item \newick{((C,B2),A)}
\item taxonomy: \newick{(A,(B1,B2)B,C)}
\end{compactenum}
Neither input contests the monophyly of \texttt{B}, but the first 2 statements cannot
both be true if \texttt{B} is monophyletic.
Thus the decomposition into uncontested groups is not a trick to speed up
the identification of an optimal summary tree.
Rather, it is a way to make the summary transparent and easy to fix:
if a biologist sees that a taxon which they know to be non-monophyletic
is uncontested, then he/she only needs to upload a tree demonstrating
non-monophyly to cause this group to be treated differently in the next
construction of the summary tree.
\subsubsection{implementation notes: otc-uncontested-decompose}
This operation is implemented in the {\tt otc-uncontested-decompose}.
It takes a taxonomic tree and each of the input trees (in ranked order), and a flag
that specifies which output directory should hold the subproblems.
It uses an embedding approach outlined in algorithm \ref{embedTree}.
When that procedure is over, every non-leaf node in the taxonomic tree
has data structures that store information about every edge in
an input tree that connects a child which aligns to this node or one of its descendants
to its parent.
If the input tree has more structure about the taxon, then the pairing of paths will be in the
LoopPaths lookup table, and if the pairing just passes through a node it will be listed in
the ExitPaths lookup table.
We can detect whether or not a node is contested by counting the number of nodes deeper in the
taxonomy that serve as ancestral nodes in the ExitPaths for a particular input tree.
If there is only one such node, then the input tree does not contest the taxon.
If there are no such nodes, then the input tree's root maps to this taxon.
If there are multiple nodes, then the input tree contests this taxon.
\ProofWriteupNeeded
The traversal to decompose the tree will alter the taxonomic tree, so the taxonomic tree
is cloned and embedded into the original taxonomic tree.
This assures that none of the taxonomic information is lost.
After producing the embedding for all of the trees, the taxonomy is traversed in post-order
fashion.
If a node is contested, then the branch that leads to its parent is collapsed and
all of its path pairing are moved deeper in the tree.
This may cause them to go from the category of ExitPaths to LoopPaths (if the parent of the
taxon is also the ancestral taxonomic node of the path pairing, then it will become
a loop of that parental taxon).
All of the LoopPaths of the collapsed taxon become LoopPaths of the parent.
If the taxon is not contested, then all of the trees that are embedded in the node
are written out (in their ranked order) as subproblems.
The path pairings that exit the node are then assigned the OTT ID of the uncontested taxon.
When the subproblem trees are written out, any edge of an input tree that ends is an uncontested
taxon is treated as a terminal edge.
The OTT ID associated with the path pairing (the ID of the uncontested taxon) is what is
written as the leaf label for the tree.
In this way the slices of the input trees are tiled into different subproblems and the tips
of the subproblems refer either to a true tip in the taxonomy, or two an uncontested taxon.
This will enable grafting of the solved subproblems back together by simple ID matching.
\subsection{Subproblem simplifications}\label{simplificationTheory}
Consider the case of having a series of \pss to add in ranked order (where the
rank-based are extreme enough to allow a greedy addition strategy).
This elides the problem of ordering statements discussed in section \ref{orderPSsTheory}.
Here we assume that an order has been established.
They are designed assuming a greedy solver that is given a list of \pss and must decide
for each whether to add it to the solution or reject it as incompatible with the
current solution.
There are some simplification steps which should be applicable in a stack-based
approach to produce a smaller subproblem.
By stack-based we mean:
\begin{compactenum}
\item apply simplification 1 to reduce the problem size.
\item apply simplification 2 to reduce the problem size.
\item[$\ldots$]
\item[$n$] apply simplification $n$ to reduce the problem size.
\item[$n + 1$] solve the reduced subproblem
\item[$n + 2$]``undo'' simplification $n$ to augment the solution
\item[$n + 3$] ``undo'' simplification $n-1$ to augment the solution
\item[$\ldots$]
\item[$2n + 1$] ``undo'' simplification $1$ to augment the solution
\item[$2n +2$] assure that all mentioned tips are present in the solution.
\end{compactenum}
These simplifications should not result in a worse \SWIPSD score for the set trees.
However, we lack any guarantees about how they interact with heuristic solutions
of the reduced subproblem.
Furthermore, we lack guarantees about how the simplifications affect a strategy that always
returns an optimal summary tree, but which does not guarantee that it will
find the full set of optimal summary trees.
The simplifications can be applied iteratively until no more simplifications are possible.
They are designed assuming a greedy solver that is given a list of \pss and must decide
for each whether to add it to the solution or reject it as incompatible with the
current solution.
The input set of \ps subproblems is simple in that all of the labelled tips are treated
as terminal taxa for the purpose of the summary.
The inputs do contain trivial statements to assure that all leaves are include (e.g.
the taxonomic tree is often a polytomy of all tips).
\subsubsection{Prune tips that only occur in trivial \ps or exclude groups}
No \ps has any support for these tips attaching anywhere above the root of the tree.
So attaching them all a tree the attaches them all at the root of the subproblem will
be among the optimal solutions.
\simplification (1) find the set of tips that only occur in trivial \ps or in the exclude
groups of \pss;
(2) record these taxon labels;
(3) record a copy of all \pss that are affected by pruning of these IDs and a mapping
of the \ps that will result from the pruning
(4) prune all such tips;
\undoActions (1) restore the original statements
\subsubsection{Remove any trivial \ps}
\simplification (1) record any trivial split
\undoActions (1) restore the original statements
\subsubsection{Remove any redundant \ps}
\simplification if a \ps occurs twice in the list (1) record the second and subsequent positions. (2) remove the lower ranked \ps
\undoActions (1) restore the original statements
\subsubsection{Conditional addition of any ``dominated'' \pss}
Consider a pair of \pss: $a=\vvps{a_i}{a_e}$ and $b=\vvps{b_i}{b_e}$.
We say that $a$ is dominated by $b$ iff:
$a_i \subset b_i$ and $a_e\subset b_e$.
If $a$ is dominated by $b$ and $b$ is higher ranked than $a$, then we can
note that $a$ need not be attempted if $b$ is accepted into the solution.
$a$ contains less information, so adding it will not alter the solution.
If $b$ is rejected, then it is possible that $a$ will be add-able, however.
\simplification if a \ps occurs twice in the list (1) record the second and subsequent positions. (2) remove the lower ranked \ps
\undoActions (1) restore the original statements
\subsubsection{Prune ``dominated'' tips}
Note that all of the compatibility/conflict decisions rely on tests for whether or not
set of labels is empty -- the presence of multiple tips rather than 1 tip in a
required or prohibited set will not affect a decision about whether or not any
\ps will be accepted into the solution or rejected.
Consider a pair of tip labels $a$ and $b$ and a set of \pss $\mathcal{P}$.
We say that $a$ is dominated by $b$ iff, $\forall (\vvps{i}{e}) \in \mathcal{P}$ one of the
following applies:
$a\notin \leafLabels{p}$
or $(a \in i \mbox{ and } b \in i)$
or $(a \in e \mbox{ and } b \in e)$.
In other words, if $b$ is on the same ``side'' as $a$ for every \ps that $a$ occurs in, then
$a$ is dominated by $b$.
\simplification If there exists a tip label $a$ that is dominated by $b$:
(1) copy every \ps that $a$ occurs in and record how it will map to a \ps with $a$ pruned.
(2) prune $a$ from all of the \ps
(3) set up bookkeeping to note whether any affected \ps is accepted or rejected for the solution.
\undoActions for every affected \ps that was accepted. Add the stored \ps. This should place $a$ correctly on the solution.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Pipeline}
\stepInput reference taxonomy and a ranked list of trees (study ID + tree ID pairs).
\stepOutput a summary tree and annotations about which input trees supports each branch.
\comment{the step numbering here does {\em not} agree with the sub dir numbering in
the \otc supertree subdir. That dir structure needs to be updated.}
\subsection{Prune the reference taxonomy to remove some flags}
\stepExplanation Not all taxa in OTT are reliable enough to belong in the
summary tree.
The reference taxonomy producing software flags taxa in several ways
(\href{https://github.com/OpenTreeOfLife/reference-taxonomy/wiki/taxon-flags}{which are listed here}
\TODO{we still need real documentation of the flagging system}).
\stepInput ``raw'' reference taxonomy with flags produced by \url{https://github.com/OpenTreeOfLife/reference-taxonomy}
\stepOutput $\taxonomy$, the complete taxonomy for synthesis with some taxa pruned.
This will determine the leaf label set of the final summary tree.
\currImpl has been performed by treemachine. Note that there appear to be
a couple of issues with that impl. See
\href{https://github.com/OpenTreeOfLife/treemachine/commit/48211803f137ad0b7c096c28d1c10d32f671194f}{this comment} and
\href{https://github.com/OpenTreeOfLife/treemachine/issues/168}{issue 168}.
(2) Documentation needed on which flags are pruned and why.
(3) We should serve this tree somewhere as it is a crucial input for the rest of the pipeline.
\currURL \TODO{Temp url} \url{http://phylo.bio.ku.edu/ot/taxonomy.tre} holds the tree (obtained
from either Joseph Brown or Ruchi Chaudhary via email) which MTH has been using as \taxonomy.
\subsection{Snapshot input studies}
\stepExplanation For a make-based system it would be useful to copy the incoming \nexson files
to a snapshot location if they differ from the version of that study that is already found
in that staging location.
\stepInput local copy of \texttt{phylesystem} git repo, list of trees (study+tree ID + optional git SHA) to be used
\stepOutput (1) copies of the \nexson files from the specified SHA. (2) record of tree identifiers
\currImpl None - similar operation done by \gcmdr.
\implTODO Flat file implementation needed
\currURL None
\subsection{Snapshot of input trees}
\stepExplanation The study files may contain multiple trees, for a make-base system it would
be good to have a timestamped file for each tree
\stepInput snapshot of \nexson from previous step. list of tree identifiers.
\stepOutput (1) one \nexson for each tree. The current naming convention of studyID\_treeID.nexson could
be used - there is no need to support multiple git-SHAs per tree.
\currImpl None - similar operation done by \gcmdr.
\implTODO Flat file implementation needed
\currURL None
\subsection{Pruning of input trees}
\stepExplanation To improve the chance of having a correct rooting, we prune
the trees to just the ingroup.
We also prune the tree down such that they contain no more than 1 exemplar of any
terminal taxon and there are no cases of the taxon for one tip containing
the taxon mapped to another tip.
\stepInput snapshot of \nexson trees from previous step.
\stepOutput (1) \phyloInputs, the input set of tree represented as
one newick tree for each input tree with internal node labels that
correspond to the node ID in the \nexson of the MRCA node.
(2) a record of the pruning edits performed.
\currImpl None - similar operation done by \gcmdr.
\implTODO (1) Flat file implementation needed.
(2) record of edits needed.
(3) identifiers for the internal nodes would be nice for reporting the provenance of edges in the
summary tree.
(4) We should serve these trees somewhere as they are crucial inputs for the rest of the pipeline.
\currURL \TODO{Temp url} \url{http://phylo.bio.ku.edu/ot/pruned-input-trees.tar.gz} is an
archive of a set of these tree - without the node identifiers and with nodes that have
out-degree=1 (obtained from either Joseph Brown or Ruchi Chaudhary via email) which MTH has been
using as \phyloInputs.
\subsection{Expand tips mapped to non-terminal taxa}\label{expandedPhyloStep}
\stepExplanation As explained in section \ref{expandNonTermPar}, expanding tips that are mapped
to non-terminal taxa to the full set of their terminal descendants and attaching these
tips to the parent of the taxon should generate a tree that correctly represents
what the input tree says (without erroneously claiming that the tree supports monophyly).
A clever implementation would note whether a descendant terminal taxon occurs in other
trees in the $\phyloInputs$ corpus.
If there are multiple terminal descendant taxa in the expansion that only occur in the
taxonomy, then it should be fine to let the expansions just contain 1 of these tips, $x$.
This would mean that the others are pruned in the next step, but will be placed in the
correct spot in the final summary tree because they should attach at the same parent node
as the single exemplar, $x$. Failing to take this optimization will only mean that the
pruned taxonomy is too large.
\stepInput \taxonomy and \phyloInputs
\stepOutput \expandedPhylo -- the set of phylogenetic inputs expanded such that no leaf is mapped
to a non-terminal taxon.
\currImpl None
\implTODO \TODO{write this}
\currURL We should probably post this set of trees, as many tools don't deal with tips that
are mapped to non-terminal taxa. So these trees may be the most accessible set of inputs
for most interested parties.
\subsection{Prune taxonomy down to tips represented in \expandedPhylo}\label{prunedTaxonomyStep}
\stepExplanation This is just an optimization step.
Each terminal taxon that is only found in \taxonomy, can be placed on the
final summary tree by creating a tree for the overlapping taxonomic
inputs and then grafting on the ``taxonomy only'' lineages.
This pruning makes the inputs for the subsequent steps smaller
\stepInput \taxonomy and \expandedPhylo
\stepOutput \prunedTaxonomy the pruned taxonomy
\currImpl \otcprune can perform this
\implTODO
\currURL may want to post this somewhere.
\subsection{Decompose the inputs into subproblems of uncontested taxa}\label{decomposeStep}
\stepExplanation The decision to force uncontested taxa in the final
summary means that we can separate the problems into non-overlapping
subproblems.
\stepInput \prunedTaxonomy and \expandedPhylo
\stepOutput subproblems. Currently expressed (1) as one newick tree file per subproblem
with the name \texttt{SUBPROBLEMID.tre}, and
(2) a file called \texttt{SUBPROBLEMID-tree-names.txt} with a treefile name
on each line or ``TAXONOMY'' indicating the source of each tree.
the \texttt{SUBPROBLEMID} is `ott' followed by the OTT ID.
\currImpl \otcdecompose
\implTODO
\currURL \url{http://phylo.bio.ku.edu/ot/export-sub-temp.tar.gz} has a snapshot, but
those subproblems were not produced with the non-terminal tips expanded to terminals
so there is some wonkiness - such tips are pruned if the taxon is contested, but their
ID sets still affect the embedding of the deeper nodes in the tree. This needs
to be rerun after step \ref{expandedPhyloStep} is completed.
\subsection{Simplify subproblems}\label{simplifyStep}
\stepExplanation As (to be) described in section \ref{simplificationTheory} there
are several operations that can be performed that will reduce the size of the
subproblems but which should not compromise our ability to obtain the same
subproblem solution.
Many subproblems are trivially solvable, so the tool that does this
will also be a crude solver.
\stepInput The set of subproblems, a simplified-problems directory, and a solutions directory
\stepOutput When possible, subproblem solutions and simplifications will be written to the 2 output directories.
\currImpl an implementation is started, but far from complete \href{https://github.com/OpenTreeOfLife/peyotl/blob/supertree/scripts/supertree/simplify_subproblems.py}{in the supertree branch of peyotl}
\implTODO \TODO {finish}
\currURL
\subsection{Solve subproblems}
\stepExplanation Attempt to find an admissible summary tree for the set of
summaries that are the \MSWIPSD set.
We probably want (1) a brute force implementation that we can use for small
subproblems so we do not have to worry about errors from finding local
optima, and (2) one or more heuristics.
\stepInput a ``raw'' subproblem from step \ref{decomposeStep} or a simplified
subproblem from step \ref{simplifyStep}.
\stepOutput a tree for each subproblem - stored in a solutions dir under the name
\texttt{SUBPROBLEMID.tre}
\currImpl treemachine may provide one solver.
\implTODO exact impl and, perhaps we need another approximate solver
\currURL
\subsection{Collapse unsupported nodes}
\stepExplanation If the solver does not guarantee that no unsupported nodes will
be introduced, then we can collapse them at this point.
As noted in section \ref{unsupportedTheory}, this should be done
iteratively rather than by identifying all unsupported edges and
collapsing all of them.
The latter approach would collapse too many edges.
\stepInput the solutions directory holding all of the subproblem solution trees.
\stepOutput a supported solutions directory holding all of the subproblem solution trees
which contain no unsupported nodes.
\currImpl None
\implTODO \TODO{write this}
\subsection{Assemble pruned summary tree from subproblem solutions}\label{assemblyStep}
\stepExplanation Because the problems do not overlap, and the file names
match the tip labels (when a tip of one subproblem is actually an uncontested
non-terminal taxon in \prunedTaxonomy), this is a simple grafting procedure.\\
Note that each subproblem is supported by the taxonomy (at a minimum), so this
step cannot introduce unsupported groups.
\stepInput the solutions directory holding all of the subproblem solution trees.
\stepOutput \prunedSummary -- the summary tree pruned down to the leaf set of \prunedTaxonomy.
\currImpl None
\implTODO \TODO{write this}
\subsection{Graft the pruned taxonomy-only taxa back onto the tree}
\stepExplanation ``phylo-referencing'' style logic can be used to place the
taxa that were pruned in \ref{prunedTaxonomyStep}
\stepInput \prunedSummary and \taxonomy
\stepOutput \summaryTree -- the final summary tree
\currImpl None
\implTODO \TODO{write this}
\subsection{Create annotations for nodes in $\summaryTree$}\label{annotationsStep}
\stepExplanation At minimum, we would want statements of which
input nodes support which nodes in $\summaryTree$.
But we could also noted nodes displayed, nodes in conflict,
and whether or not the node was constrained because it was a contested taxon.
\stepInput \summaryTree and \expandedPhylo
\stepOutput some as yet undefined format for expressing these annotations.
\currImpl None
\implTODO \TODO{write this}
\subsection{Serve \summaryTree and the annotations}
\stepExplanation we may be able to compile the annotations into a set of
static files to be served up to the current tree browser.
Or we may wish have a full database-driven web service
\stepInput \summaryTree and annotations from \ref{annotationsStep}
\stepOutput a web services API comparable to the
\href{https://github.com/OpenTreeOfLife/opentree/wiki/Open-Tree-of-Life-APIs#tree-of-life}{tree-of-life part of the API}.//
some of the services in that API would definitely require a db rather than just flat-files (e.g the MRCA and induced\_subtree)
\currImpl None unless we load the tree into treemachine and add methods for
serving up the annotations that are not coming from the graph-of-life
\implTODO \TODO{write this}
\newpage
\section{Algorithms}
\begin{algorithm} \caption{EmbedPhyloIntoTaxonomicScaffold}\label{embedTree} \begin{algorithmic}
\REQUIRE the taxonomic tree $\taxonomy$.
\REQUIRE an input tree, $T$ with a unique identifier $\mbox{id}(T)$
\FOR{each node $n_i$ in $\nodes{T}$}
\STATE{$z(n_i) \leftarrow \mbox{AlignNodes}(\taxonomy, n_i)$}
\ENDFOR
\FOR{each node $n_i$ in $\nodes{T}$}
\IF{$n_i \neq \treeRoot{T}$}
\STATE $y_i \leftarrow \parent{n_i}$
\STATE$\mbox{EmbedEdge}(\taxonomy, {y_i}, z(y_i), n_i, z(n_i), \mbox{id}(T))$
\ENDIF
\ENDFOR
\end{algorithmic}\end{algorithm}
\begin{algorithm} \caption{AlignNodes}\label{alignNodes} \begin{algorithmic}
\REQUIRE the taxonomic tree $\taxonomy$.
\REQUIRE a node from input tree, $n$.
\IF{isLeaf($n$)}
\RETURN the node in $\taxonomy$ that is mapped to the same taxonomic identifier that $n$ is mapped to.
\ELSE
\RETURN the node in $\taxonomy$ that is the least inclusive taxon that is an ancestor of all of
taxonomic labels in $\leafLabels{n}$.
\ENDIF
\end{algorithmic}\end{algorithm}
\begin{algorithm} \caption{EmbedEdge}\label{embedEdge} \begin{algorithmic}
\REQUIRE the taxonomic tree $\taxonomy$.
\REQUIRE a node from input tree, $n$, and it pair node in $\taxonomy$, $z(n)$
\REQUIRE the parent node of $n$, called $y$ and its pair node in $\taxonomy$, $z(y)$
\REQUIRE an identifier, $t$, that uniquely identifies the tree that contains $n$ and $y$.
\REQUIRE Each non-leaf node in $\taxonomy$ has 2 lookup tables: \textsc{LoopPaths} and \textsc{ExitPaths}
\STATE $p \leftarrow \left[n, z(n), y, z(y)\right]$ \COMMENT{$p$ is called the ``path pairing'' information}
\IF{$z(n) = z(y)$}
\STATE $z(y).\textsc{LoopPaths}[t] \leftarrow p$
\ELSE
\STATE $c\leftarrow z(n)$
\WHILE{$c \neq z(y)$}
\STATE $c.\textsc{ExitPaths}[t] \leftarrow p$
\STATE $c \leftarrow \parent{c}$
\ENDWHILE
\ENDIF
\end{algorithmic}
\end{algorithm}
\section{Subproblem solver approaches}\label{subproblemSolver}
MTH had some email conversation with David Bryant -- some of the ideas here came from
that conversation.
If we had a complete ordering of splits, we could use a variant of \textsc{BUILD} \citep{AhoSSU1981}
to generate a set of consistent splits, $\mathcal{C}$.
The procedure for doing that is outlined in algorithm \ref{consistentSplitsFromRankedList}.
Note that the original BUILD has scaling $O(MN)$ where $N$ is the number of leaves in the leaf set
and $M$ is the number of input splits.
\citet{JanssonLL2012} discuss how \citet{HenzingerKW1999} provide a DP approach that
reduces the runtime to $\min\{O(MN^{1/2}), O(M + N^2 \log N)\}$, and
\citet{HolmLT2001} improve this to $\min\{O(N + M\log^2 N), O(M + N^2 \log N)\}$
\begin{algorithm}
\caption{ConsistentSplitsFromRankedList}\label{consistentSplitsFromRankedList}
\begin{algorithmic}
\REQUIRE An ordered list of $M$ splits, $\mathcal{R} = [R_1, R_2, R_3, \ldots, R_M]$
\STATE $\mathcal{C} = [R_1]$
\FOR{each split $i$ in $[2, 3 \ldots M]$}
\STATE $\mathcal{T} \leftarrow \mathcal{C} + R_i$ \COMMENT{where `+' means concatenating 2 lists}
\IF{\textsc{BUILD}$(\mathcal{T})$ does not return null}
\STATE $\mathcal{C} \leftarrow \mathcal{T}$
\ENDIF
\ENDFOR
\RETURN $\mathcal{C}$
\end{algorithmic}
\end{algorithm}
If the number of splits in $\mathcal{C}$ is not too large, then we could use the (exponential) algorithm
of \citet{JanssonLL2012} to find a \textsc{MinRS} tree to find a solution to the subproblem.
\textsc{Note: it looks like \citet{ByrkaGJ2010} have a method for finding a set of consistent splits}
\subsection{Partial rankings}
(this section has some crude thoughts and no good solution.)
We have a complete ordering of tree, but this only generates a partial ordering on splits.
If two splits are both first encountered (during traversal through all groups in all trees)
in the same tree, then the ordering of the splits is undetermined.
A clumsy way to deal with this would be to use branch and bound: Add splits in a greedy fashion for a postorder traversal, and then add splits in a preorder traversal.
Whichever order yields the largest number of splits accepted, treat that as a bound.
Then start investigating constraints that force the inclusion of some of the excluded splits.
The set of splits in a tree which are incompatible with a previously excluded split can be discarded as a preprocessing step.
\section{Subproblem simplifications}
\subsection{Using the intersection leaf set}
Shortcut: prune the next tree to the intersection of its leaf set and the
leaf set of the current batch of consistent phylogenetic statements.
Any statement in the pruned tree that is inconsistent, need not be considered.
(since the conflict detection uses a smaller graph, it might be faster than using the full leaf set).
This is for the context of building up a set of consistent phylogenetic statements by adding
phylogenetic statements from a new tree.
Let $T_i$ be the new tree.
Let $\mathcal{A}_{i-1}$ denote a set of phylogenetic statements (from the consistent trees 1 to $i-1$).
Let $T_i^{\ast}$ and $\mathcal{A}_{i-1}^{\ast}$ denote $T_i$ and $\mathcal{A}_{i-1}$ pruned down by removing any leaves that are not in $\leafLabels{\mathcal{A}_{i-1}} \cap \leafLabels{T_i}$ and (in the case of $T_i$) any nodes that have an out-degree $<2$ as a result of this pruning.
\begin{theorem}
If a phylogenetic statement from $T_i^{\ast}$ is not consistent
with $\mathcal{A}_{i-1}^{\ast}$,
then the corresponding statement from $T_i$ will not be consistent with $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: By the definition of compatability, adding new leaves cannot make an tree that is incompatible with a statement in $\mathcal{A}_{i-1}^{\ast}$ compatible with the fuller version of that statement.
\subsection{Testing consistency by pruning off new leaves}
Shortcut: prune off any ``new'' leaves from the next tree
and then test for consistency.
(since the conflict detection uses a smaller graph, it might be faster than using the full leaf set).
Let $T_i^{\dag}$ denote $T_i$ pruned down by removing any leaves that are not in $\leafLabels{\mathcal{A}_{i-1}}$ any nodes that have an out-degree $<2$ as a result of this pruning.
Each internal node in $T_i$ that is not a common ancestor of all of the labels in $\leafLabels{T_i^{\dag}}$ can be partitioned into one of 3 categories:
\begin{compactenum}
\item $\mathcal{N}^{\ddag}(T_i)$ is the set of nodes that have more than one child subtree with an contains a leaf in $\leafLabels{T_i^{\dag}}$.
\item $\mathcal{N}^{\dag}(T_i)$ is the set of nodes that have exactly one child subtree with an contains a leaf in $\leafLabels{T_i^{\dag}}$.
For every member of $\mathcal{N}^{\dag}(T_i)$ called $n$ let $D^{\dag}(n)$ denote the first descendant node that is either a leaf of $T_i$ or in $\mathcal{N}^{\ddag}(T_i)$.
\item $\mathcal{N}^{\circ}(T_i)$ is the set of nodes that are not the ancestor of any label in $\leafLabels{T_i^{\dag}}$.
\end{compactenum}
\begin{theorem}
If a phylogenetic statement derived from a member of $\mathcal{N}^{\ddag}(T_i)$ is consistent with $\mathcal{A}$ if and only if the corresponding member of $\mathcal{N}^{\ddag}(T_i^{\dag})$ is consistent with $\mathcal{A}$.
\end{theorem}
Proof: To be written
\begin{theorem}
Every phylogenetic statement derived from a member of $\mathcal{N}^{\circ}(T_i)$ is consistent with $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: To be written. True because these statents are only making statements about previously unmentioned taxa.
\begin{theorem}
If $D^{\dag}(n)$ is a leaf, then the phylogenetic statement that corresponds
to $n$ is consistent with $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: To be written. True becaus we are only adding sister groups to a previously trivial group.
\begin{theorem}
If $D^{\dag}(n)$ is an internal node, then the phylogenetic statement
that corresponds to $n$ will conflict with $\mathcal{A}_{i-1}$ if
and only if the phylogenetic statement that corresponds to
$D^{\dag}(n)$ conflicts with $\mathcal{A}_{i-1}$.
\end{theorem}
Proof: To be written. If $D^{\dag}(n)$ is consistent with $\mathcal{A}_{i-1}$
then there must be at least one tree that has an edge that separates all of the
descendants of $D^{\dag}(n)$ from all of the (mentioned) ancestors
of $n$.
Attaching the ``new'' subtree along this edge produces a tree that proves that
$D^{\dag}(n)$ is also consistent with $\mathcal{A}_{i-1}$.
In the case of $D^{\dag}(n)$ conflicting with $\mathcal{A}_{i-1}$, we know
that an edge separating a subset of the descendants from the mentioned
ancestors cannot be found. Adding more leaves will not alter that.
%\begin{theorem}
%One can identify the most resolved collapsed form of $T_i$ which is
% consistent with $\mathcal{A}_{i-1}$ by pruning both $\mathcal{A}_{i-1}$ and
% $T_i$ down to the intersection of the leaf labels set: $\leafLabels{\mathcal{A}_{i-1}} \cap \leafLabels{T_i}$
%\end{theorem}
\section{Acknowledgements}
Thanks to David Bryant for suggestions on the subproblem solver and
for pointing me to the work of J.~Jansson in the context of the
the section \ref{subproblemSolver} and \ref{minrs}.
\input{glossary.tex}
\bibliography{otcetera}
\end{document}