Skip to content
This repository has been archived by the owner on Sep 22, 2019. It is now read-only.

111 groupings in synthetic tree that are not supported by any input source tree #156

Closed
mtholder opened this issue Jan 30, 2015 · 15 comments

Comments

@mtholder
Copy link
Member

Background

Issue #78 started because @ruchiherself's code identified cases in which a grouping in the synthetic tree conflicted with every tree in the input set. The definition of conflict is discussed in the "Conflict between trees and taxonomies" section of the supplemental material.

I started pursuing this using code that uses a slightly different criterion for flagging groups that I think are indicative of bugs in treemachine (or our failure to capture the inputs precisely enough, such that the inputs actually differ from what was fed into treemachine. Or bugs in the checking tools).

This issue separates discussion of the problematic cases detected by the definition that I am using from the cases that Ruchi's code flags.

"unsupported"

UPDATE I've revised this because Ruchi pointed out that I was not being consistent. The original text is not at the bottom of this post.

When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetric distance (RF) between the synthetic tree and any of the input trees will not increase.

For each input tree t, in the set of input trees T (which includs the taxonomic tree):

  1. use S(t) to denote the synthetic tree S restricted to the leaf set of t
  2. let r(S, t) be the RF distance between S(t) and t

If Y is the synthetic tree with some edge y collapsed, then we say that y is supported if
r(Y, t) > r(S, t) for any t.

Software

I have written 2 tools to help find these cases:

  • checktaxonnodes checks all named nodes in the synthetic tree against their definition in OTT.
  • findunsupportededges looks for internal nodes in the synthetic tree that:
    • do not have a name and
    • which are not supported by any non-taxonomic input

These are in the examples subtree of NCL. I forked NCL to the Open Tree group to make it easier for any of us to modify it.

I've posted the contents of the standard output stream and the standard error stream.

There are 111 groupings that findunsupportededges found which are unsupported.

checktaxonnodes found 22 problems - those are reported on issue #154.

Differences from what Ruchi's code is calculating.

Under the Wilkinson terminology (if I'm understanding it correctly) if we had the synthetic tree of:

S = (A, (B, (C, D)))

from two inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))

then I think the clade (C, D) would be considered irrelevant on both trees. Ruchi's code is reporting conflicting cases, so this would not be reported.

Under the "unsupported" definition, that I am using, this grouping would be considered unsupported because the tree with the polytomy: (A,(B,C,D)) fits the inputs just as well. Intuitively there is no information in the inputs indicating that C is closer to D than it is to B, so it seems like we should be returning the polytomy.

This difference in evaluation explains why my software classifies this group to be unsupported, while Ruchi's code considers pg_2644_6164.tre to support it. The source tree does indeed have a grouping of (Aspidocarya + Parabaena) + (Tinomiscium + Tinospora). But if we back up one node in the synthetic tree, we see that the sister group is Calycocarpum. Calycocarpum is not sampled in pg_2644_6164.tre. So, according to that source tree there is no reason that you could not have any resolution of the 3 way polytomy: (Calycocarpum, (Aspidocarya, Parabaena), (Tinomiscium, Tinospora))

I think that Ruchi's list will be a subset of my list because of cases like this one. And this does not imply a bug in either - just different classification schemes.

Why I think this is a problem

All supertree methods have some quirks, so the presence of a few groupings that are not intuitive is not a problem per se. But I think these groups indicate that there is a bug in synthesis.

It could be that I am just misunderstanding the TAG procedure. If that is the case, I would appreciate some one correcting me. I thought that a valid description of the synthesis procedure would be:

  1. Add inputs to the TAG one at a time.

  2. For each node in an input tree t_i we create set of edges to a LICA node. These nodes may include to other taxa (because of other input trees). Crucially:

    A. This is the only operation that adds edges to the graph.

    B. The parent node of the edge will always be the MRCA of a larger set of leaves than the childe node - even when restricte to the leaf set of t_i.

    C. Thus, t_i will support any edge that is created by its introduction into the TAG.

    D. Thus, every edge in the TAG will be supported by at least one input.

  3. the synthesis operation only decides what edges to "trace" to make a tree. It does not create new edges.

If all of that is correct, then every edge/grouping in the synthetic tree should be supported by at least one input. So my checktaxonnode and findunsupportednodes programs should also report no problems.

updated: typo in the first word of the description fixed. Doh!

Original incorrect definition of unsupported

just for the record. here is the text that was originally above...

When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetrict distance (RF) between the synthetic tree and the set of inputs would not change.

We can calculate the total RF distance for the synthetic tree S as follows:

For each input tree t, in the set of input trees T (which includs the taxonomic tree):

  1. use S(t) to denote the synthetic tree S restricted to the leaf set of t
  2. let r(S, t) be the RF distance between S(t) and t

Then the total RF distance R(S,T) is simply the sum of r(S,t)

@blackrim
Copy link
Member

Just to be clear, tree6165 does support that (Aspidocarya + Parabaena) +
(Tinomiscium + Tinospora) grouping with Calycocarpum though (those are all
sampled in tree6165). So that isn't one of the unsupported nodes is it?
Unless I am missing something there.

There were some others that were pointed out in the previous emails that I
will check up on. Cody may need to explain how we place the non
monophyletic taxa because that may be where this is coming up. Otherwise,
there is likely a bug.

On Fri, Jan 30, 2015 at 8:08 AM, Mark T. Holder notifications@github.com
wrote:

Backgroud

Issue #78 #78
started because @ruchiherself https://github.com/ruchiherself's code
identified cases in which a grouping in the synthetic tree conflicted with
every tree in the input set. The definition of conflict is discussed in the "Conflict
between trees and taxonomies" section of the supplemental material
https://docs.google.com/document/d/1qq9VZccfPMG9Xic0wmp5BXMur98KrjXOY3-ZVuKzz1U/edit#heading=h.l47v7xs1he4q
.

I started pursuing this using code that uses a slightly different
criterion for flagging groups that I think are indicative of bugs in
treemachine (or our failure to capture the inputs precisely enough, such
that the inputs actually differ from what was fed into treemachine. Or bugs
in the checking tools).

This issue separates discussion of the problematic cases detected by the
definition that I am using from the cases that Ruchi's code flags.
"unsupported"

When I say that a group/node/edge in the synthetic tree is "unsupported"
in this thread, I mean: If we were to collapse this group into its parent,
then the total Robinson-Foulds symmetrict distance (RF) between the
synthetic tree and the set of inputs would not change.

We can calculate the total RF distance for the synthetic tree S as
follows:

For each input tree t, in the set of input trees T (which includs the
taxonomic tree):

use S(t) to denote the synthetic tree S restricted to the leaf set
of t
2.

let r(S, t) be the RF distance between S(t) and t

Then the total RF distance R(S,T) is simply the sum of r(S,t)
Software

I have written 2 tools to help find these cases:

  • checktaxonnodes checks all named nodes in the synthetic tree against
    their definition in OTT.
  • findunsupportededges looks for internal nodes in the synthetic tree
    that:
    • do not have a name and
    • which are not supported by any non-taxonomic input

These are in the examples subtree of NCL. I forked NCL to the Open Tree
group https://github.com/OpenTreeOfLife/ncl to make it easier for any
of us to modify it.

I've posted the contents of the standard output stream
http://phylo.bio.ku.edu/ot/findunsupportededges-out.txt and the standard
error stream http://phylo.bio.ku.edu/ot/findunsupportededges-err.txt.

There are 111 groupings that findunsupportededges found which are
unsupported.

checktaxonnodes found 22 problems - those are reported on issue #154
#154.
Differences from what Ruchi's code is calculating.

Under the Wilkinson terminology (if I'm understanding it correctly) if we
had the synthetic tree of:

S = (A, (B, (C, D)))

from two inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))

then I think the clade (C, D) would be considered irrelevant on both
trees. Ruchi's code is reporting conflicting cases, so this would not be
reported.

Under the "unsupported" definition, that I am using, this grouping would
be considered unsupported because the tree with the polytomy: (A,(B,C,D))
fits the inputs just as well. Intuitively there is no information in the
inputs indicating that C is closer to D than it is to B, so it seems like
we should be returning the polytomy.

This difference in evaluation explains why my software classifies this
group
https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3840208 to
be unsupported, while Ruchi's code considers pg_2644_6164.tre
https://tree.opentreeoflife.org/curator/study/view/2644?tab=trees&tree=tree6164
to support it. The source tree does indeed have a grouping of (Aspidocarya

  • Parabaena) + (Tinomiscium + Tinospora). But if we back up one node in
    the synthetic tree
    https://tree.opentreeoflife.org/opentree/otol.draft.22@3840209, we see
    that the sister group is Calycocarpum. Calycocarpum is not sampled in
    pg_2644_6164.tre. So, according to that source tree there is no reason that
    you could not have any resolution of the 3 way polytomy: (Calycocarpum,
    (Aspidocarya, Parabaena), (Tinomiscium, Tinospora))

I think that Ruchi's list will be a subset of my list because of cases
like this one. And this does not imply a bug in either - just different
classification schemes.
Why I think this is a problem

All supertree methods have some quirks, so the presence of a few groupings
that are not intuitive is not a problem per se. But I think these
groups indicate that there is a bug in synthesis.

It could be that I am just misunderstanding the TAG procedure
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003223.
If that is the case, I would appreciate some one correcting me. I thought
that a valid description of the synthesis procedure would be:

Add inputs to the TAG one at a time.
2.

For each node in an input tree t_i we create set of edges to a LICA
node. These nodes may include to other taxa (because of other input trees).
Crucially:

A. This is the only operation that adds edges to the graph.

B. The parent node of the edge will always be the MRCA of a larger set
of leaves than the childe node - even when restricte to the leaf set of
t_i.

C. Thus, t_i will support any edge that is created by its
introduction into the TAG.

D. Thus, every edge in the TAG will be supported by at least one input.
3.

the synthesis operation only decides what edges to "trace" to make a
tree. It does not create new edges.

If all of that is correct, then every edge/grouping in the synthetic tree
should be supported by at least one input. So my checktaxonnode and
findunsupportednodes programs should also report no problems.


Reply to this email directly or view it on GitHub
#156.

@mtholder
Copy link
Member Author

But that source tree (6165) has Calycocarpum as sister to a group containing Aspidocarya, Parabaena, Tinomiscium, Tinospora but also Orthogynium.

In the synthetic tree, Orthogynium attachs well outside (https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3573300/Orthogynium) of this group. Which is why my findunsupportednodes does not give that tree credit for supporting that split

@blackrim
Copy link
Member

OK. This one looks to me like it is likely a non monophyly thing. I will
check on that. The sentence needs to be deleted from the response to
reviewers anyway because i am pretty sure with the nonmonophyly (regardless
of whether this is a case of that) there are edges added but Cody will need
to chime in on that (I will ask him to if he doesn't and I see him in a
bit).

On Fri, Jan 30, 2015 at 9:17 AM, Mark T. Holder notifications@github.com
wrote:

But that source tree (6165) has Calycocarpum as sister to a group
containing Aspidocarya, Parabaena, Tinomiscium, Tinospora but
also Orthogynium.

In the synthetic tree, Orthogynium attachs well outside (
https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3573300/Orthogynium)
of this group. Which is why my findunsupportednodes does not give that
tree credit for supporting that split


Reply to this email directly or view it on GitHub
#156 (comment)
.

@mtholder
Copy link
Member Author

I think Orthogynium is a monotypic taxon.

@blackrim
Copy link
Member

Yeah, I don't mean within that group I mean within the Menispermoideae

On Fri, Jan 30, 2015 at 11:04 AM, Mark T. Holder notifications@github.com
wrote:

I think Orthogynium is a monotypic taxon.


Reply to this email directly or view it on GitHub
#156 (comment)
.

@blackrim
Copy link
Member

Or within the Menispermaceae rather

On Fri, Jan 30, 2015 at 11:25 AM, Stephen Smith blackrim@gmail.com wrote:

Yeah, I don't mean within that group I mean within the Menispermoideae

On Fri, Jan 30, 2015 at 11:04 AM, Mark T. Holder <notifications@github.com

wrote:

I think Orthogynium is a monotypic taxon.


Reply to this email directly or view it on GitHub
#156 (comment)
.

@mtholder
Copy link
Member Author

I don't understand what you mean by "it is likely a non monophyly thing."

Menispermaceae is not a tip label in any of the input trees. So I don't understand how it being non-monophyletic is different from other cases of conflict between different sources of phylogenetic information.

Could you or @chinchliff or @josephwb confirm that the 3 numbered points that I list above are a correct characterization of the synthesis. procedure.

I suppose I should add another statement:
#4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups.

if that is not the case (or any of my previous 3 statements are incorrect) then the 111 groupings reported here may just be a wart of the procedure and not a bug.

Edit. markdown cause my #4 to show up as 1. fixed.

@ruchiherself
Copy link

Mark, I want to add something here. For our input (taxonomy + 484 other
trees) your definition of 'unsupported' and my definition for unsupported
are (probably) the same. The example under "Differences from what Ruchi's
code is calculating." is correct. But remember we have taxonomy in the
input. Both synthetic tree and taxonomy are of the same size (# of leaves)
and taxonomy can never compute 'irrelevant' for any node of the synthetic
tree. So the disagreement between our definitions (explained through this
example) doesn't really apply in our case. For example, for the given tree
S = (A, (B, (C, D))) we will always have a input tree of the same size. In
this input tree either C and D will make a clade or not, so support or no
support, respectively.

I have already thoroughly studied some of your identified nodes (or
groupings) that I didn't have in my list. I have this feeling that our
lists of unsupported nodes will be identical. However all my analysis is
based on the Newick strings that I received from Joseph. If they are wrong
then I can not guarantee anything.

On Fri, Jan 30, 2015 at 7:08 AM, Mark T. Holder notifications@github.com
wrote:

Backgroud

Issue #78 #78
started because @ruchiherself https://github.com/ruchiherself's code
identified cases in which a grouping in the synthetic tree conflicted with
every tree in the input set. The definition of conflict is discussed in the "Conflict
between trees and taxonomies" section of the supplemental material
https://docs.google.com/document/d/1qq9VZccfPMG9Xic0wmp5BXMur98KrjXOY3-ZVuKzz1U/edit#heading=h.l47v7xs1he4q
.

I started pursuing this using code that uses a slightly different
criterion for flagging groups that I think are indicative of bugs in
treemachine (or our failure to capture the inputs precisely enough, such
that the inputs actually differ from what was fed into treemachine. Or bugs
in the checking tools).

This issue separates discussion of the problematic cases detected by the
definition that I am using from the cases that Ruchi's code flags.
"unsupported"

When I say that a group/node/edge in the synthetic tree is "unsupported"
in this thread, I mean: If we were to collapse this group into its parent,
then the total Robinson-Foulds symmetrict distance (RF) between the
synthetic tree and the set of inputs would not change.

We can calculate the total RF distance for the synthetic tree S as
follows:

For each input tree t, in the set of input trees T (which includs the
taxonomic tree):

use S(t) to denote the synthetic tree S restricted to the leaf set
of t
2.

let r(S, t) be the RF distance between S(t) and t

Then the total RF distance R(S,T) is simply the sum of r(S,t)
Software

I have written 2 tools to help find these cases:

  • checktaxonnodes checks all named nodes in the synthetic tree against
    their definition in OTT.
  • findunsupportededges looks for internal nodes in the synthetic tree
    that:
    • do not have a name and
    • which are not supported by any non-taxonomic input

These are in the examples subtree of NCL. I forked NCL to the Open Tree
group https://github.com/OpenTreeOfLife/ncl to make it easier for any
of us to modify it.

I've posted the contents of the standard output stream
http://phylo.bio.ku.edu/ot/findunsupportededges-out.txt and the standard
error stream http://phylo.bio.ku.edu/ot/findunsupportededges-err.txt.

There are 111 groupings that findunsupportededges found which are
unsupported.

checktaxonnodes found 22 problems - those are reported on issue #154
#154.
Differences from what Ruchi's code is calculating.

Under the Wilkinson terminology (if I'm understanding it correctly) if we
had the synthetic tree of:

S = (A, (B, (C, D)))

from two inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))

then I think the clade (C, D) would be considered irrelevant on both
trees. Ruchi's code is reporting conflicting cases, so this would not be
reported.

Under the "unsupported" definition, that I am using, this grouping would
be considered unsupported because the tree with the polytomy: (A,(B,C,D))
fits the inputs just as well. Intuitively there is no information in the
inputs indicating that C is closer to D than it is to B, so it seems like
we should be returning the polytomy.

This difference in evaluation explains why my software classifies this
group
https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3840208 to
be unsupported, while Ruchi's code considers pg_2644_6164.tre
https://tree.opentreeoflife.org/curator/study/view/2644?tab=trees&tree=tree6164
to support it. The source tree does indeed have a grouping of (Aspidocarya

  • Parabaena) + (Tinomiscium + Tinospora). But if we back up one node in
    the synthetic tree
    https://tree.opentreeoflife.org/opentree/otol.draft.22@3840209, we see
    that the sister group is Calycocarpum. Calycocarpum is not sampled in
    pg_2644_6164.tre. So, according to that source tree there is no reason that
    you could not have any resolution of the 3 way polytomy: (Calycocarpum,
    (Aspidocarya, Parabaena), (Tinomiscium, Tinospora))

I think that Ruchi's list will be a subset of my list because of cases
like this one. And this does not imply a bug in either - just different
classification schemes.
Why I think this is a problem

All supertree methods have some quirks, so the presence of a few groupings
that are not intuitive is not a problem per se. But I think these
groups indicate that there is a bug in synthesis.

It could be that I am just misunderstanding the TAG procedure
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003223.
If that is the case, I would appreciate some one correcting me. I thought
that a valid description of the synthesis procedure would be:

Add inputs to the TAG one at a time.
2.

For each node in an input tree t_i we create set of edges to a LICA
node. These nodes may include to other taxa (because of other input trees).
Crucially:

A. This is the only operation that adds edges to the graph.

B. The parent node of the edge will always be the MRCA of a larger set
of leaves than the childe node - even when restricte to the leaf set of
t_i.

C. Thus, t_i will support any edge that is created by its
introduction into the TAG.

D. Thus, every edge in the TAG will be supported by at least one input.
3.

the synthesis operation only decides what edges to "trace" to make a
tree. It does not create new edges.

If all of that is correct, then every edge/grouping in the synthetic tree
should be supported by at least one input. So my checktaxonnode and
findunsupportednodes programs should also report no problems.


Reply to this email directly or view it on GitHub
#156.

@mtholder
Copy link
Member Author

Ruchi, you are correct that we have the full taxonomy, but it is highly unresolved. You can easily expand the case that I gave earlier. Consider:

S = (A, (B, (C, D)))

from 3 inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))
t_3 = (A, (B, C, D))

I think that your code would say that the (C, D) group is irrelevant wrt the first 2 trees, and permitted by the third. So not "in conflict"

My code would call it "unsupported".

@ruchiherself
Copy link

My 87 nodes include this case too. I am counting all those nodes that have
0 support, but may have permit, conflict, or irrelevant from all the input
trees.

In your example, (C,D) group will get irrelevant from first two input trees
and permit from the last input tree as you said. So for (C,D)
group, support = 0, permit = 1, conflict = 0, and irrelevant = 2. So (C,D)
group must be in my list since support is 0 for it.

On Fri, Jan 30, 2015 at 1:26 PM, Mark T. Holder notifications@github.com
wrote:

Ruchi, you are correct that we have the full taxonomy, but it is highly
unresolved. You can easily expand the case that I gave earlier. Consider:

S = (A, (B, (C, D)))

from 3 inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))
t_3 = (A, (B, C, D))

I think that your code would say that the (C, D) group is irrelevant wrt
the first 2 trees, and permitted by the third. So not "in conflict"

My code would call it "unsupported".


Reply to this email directly or view it on GitHub
#156 (comment)
.

@mtholder
Copy link
Member Author

Ah. I see. thanks for clarifying. But I think that our codes would diverge on:

S = (A, (B, (C, D)))

from 4 inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))
t_3 = (A, (C, D))
t_4 = (A, (B, C, D))

My code would call still call the (C,D) clade "unsupported" because none of the inputs say that C is closer to D than it is to B

@ruchiherself
Copy link

Wait...but your initial definition of "unsupported" doesn't approve of
that.


Unsupported: When I say that a group/node/edge in the synthetic tree is
"unsupported" in this thread, I mean: If we were to collapse this group
into its parent, then the total Robinson-Foulds symmetrict distance (RF)

between the synthetic tree and the set of inputs would not change.

So (C,D) is not "unsupported" group by your definition. Since if we
collapse (C,D) into its parent (B,C,D) in 'S' then R(S,T) becomes 0, but
it was 1 before collapsing.

On Fri, Jan 30, 2015 at 2:14 PM, Mark T. Holder notifications@github.com
wrote:

Ah. I see. thanks for clarifying. But I think that our codes would diverge
on:

S = (A, (B, (C, D)))

from 4 inputs:

t_1 = (A, (B, C))
t_2 = (A, (B, D))
t_3 = (A, (C, D))
t_4 = (A, (B, C, D))

My code would call still call the (C,D) clade "unsupported" because none
of the inputs say that C is closer to D than it is to B


Reply to this email directly or view it on GitHub
#156 (comment)
.

@mtholder
Copy link
Member Author

good point. I should have said that the RF distance stays the same or decreases. So the unresolved form of the synthetic tree is at least as good as the resolved form when there is an "unsupported" node.

Sorry for the confusion.

@mtholder
Copy link
Member Author

I hadn't been thinking of unresolved inputs clearly when I wrote this issue report.
What I should have said was:

By "unsupported" I mean that if we collapse the edge, the RF distance for the restricted synthetic tree to each of the source trees is unchanged or decreases.

my code doesn't calculate the total RF. It just tries to find (for every edge in the synthetic tree) at least 1 input tree that supports the edge. If collapsing the edge causes the RF to any of the input trees to increase, then it calls the edge supported. Sorry again for mis-stating this earlier.

@ruchiherself
Copy link

I think I understand it now. It's different from my count. I declare
support for a node in the synthetic tree, if there is at least one tree in
the input that has an identical clade (after restricting of course). But
Mark's definition finds support for a node in the synthetic tree if the RF
distance from at least one input tree goes up after collapsing this node. I
think the RF distance for only those input trees can go up who originally
had identical clade (or who were supporting by my definition). In
particular, RF from those input trees that initially had identical
clades can either stay the same or can go up. Remaining trees are either
irrelevant or their RF goes down (i.e., for permit or conflict cases).

My analysis should have the subset of Mark's nodes. I also think that these
extra nodes (Mark's nodes - my nodes) can be computed using Wilkinson et
al.'s strongest support (page 828 in that paper). I have computed those
number but never included them in the Science or PNAS paper. I can provide
them if they are useful.

On Fri, Jan 30, 2015 at 2:54 PM, Mark T. Holder notifications@github.com
wrote:

I hadn't been thinking of unresolved inputs clearly when I wrote this
issue report.
What I should have said was:

By "unsupported" I mean that if we collapse the edge, the RF distance for
the restricted synthetic tree to each of the source trees is unchanged
or decreases.

my code doesn't calculate the total RF. It just tries to find (for every
edge in the synthetic tree) at least 1 input tree that supports the edge.
If collapsing the edge causes the RF to any of the input trees to increase,
then it calls the edge supported. Sorry again for mis-stating this earlier.


Reply to this email directly or view it on GitHub
#156 (comment)
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants