-
Notifications
You must be signed in to change notification settings - Fork 6
Need documentation of how we munge names in newick export #147
Comments
Other related issues: OpenTreeOfLife/ot-base#10 |
Just a follow up example of a hard case. It is not a pretty state of affairs. OTT ID 572918 has the troublesome name: "Oerskovia sp. 7(2011)" That node is a child of OTT ID :125746 in the taxonomy and the synthetic tree. We have methods that return the subtree in newick for the taxonony and the synthetic tree, but none of them return that name correctly quoted in newick. Using "id" works - but does not return the name (obviously) Using "original_name" returns illegal newick. Using "name" returns a name which (becuase it is quoted) changes the spaces to underscores in the name Using "name_and_id" returns a pair of tokens for the name: 'Oerskovia_sp._7(2011)'_ott572918 I think that this is illegal (definitely is in NEXUS, but I think that it is in newick, too) Using tree_of_life/subtree return a munged name: 'Oerskovia_sp_7_2011_ott572918' that is quoted, but has underscores and lacks the punctuation. Details below: original_name
which is an illegal newick, because names with punctuation are not quoted. I suppose that one could argue that this is the correct behavior for this argument, but the fact that some names have parentheses implies to me that we should not support this. Using name:
returns
is legal, but has some names with _ in them (because they are quoted) and other with _ being translated to spaces. Using name_and_id
returns
which is illegal (I think) because some labels are now multiple tokens. For example: 'Oerskovia_sp._7(2011)'_ott572918 And using the tree_of_life service:
returns
includes name munging to give a single quoted 'Oerskovia_sp_7_2011_ott572918' |
Yes, we were going to include a proper newick writer in the jade OT-base It will be good to have this reference of the current issues when it comes On Wednesday, December 10, 2014, Mark T. Holder notifications@github.com
|
It looks like (for at least the draftversion2.tre newick) the decision about whether to quote is made before the substitution of _ for punctuation. So some labels like |
mtholder: for some reason, |
This should probably be fixed, for the same of invertibility... or is |
Submitted a PR fix for the quote error above. |
invertibility is not a lost cause in terms of regenerating the real OTT name from the newick/nexus. We can encode and string in newick or nexus. There are often 2 legal syntaxes in those formats for any string (a quoted form and an unquoted form). One can't reliably predict which of the two forms will be used (without looking at the code). So you can't go: But you can go: And you can go: |
Off the back of OpenTree v5 and the new naming scheme, I suggest that it might be useful if taxon names in the downloadable newick file did not contain braces and commas (and possibly not colons either). This makes it easy to parse the newick file using regular expressions and the like, without having to parse the actual tree structure, or parse the quoting of labels. That makes it a lot faster and less memory intensive to mess with the tree. Of course, this may make it impossible to maintain consistent labels between tree machine and taxomachine, so I can foresee objections. |
There's an issue for invertibility:
OpenTreeOfLife/germinator#76
|
OK @hyanwong I have posted a version of the tree with simplified names at http://phylo.bio.ku.edu/ot/opentree5.0_simplified_names.tre.gz and a log of the edits at http://phylo.bio.ku.edu/ot/munging_log.txt We'll still need to work this step into the pipeline and figure out a statndard name for this output. @bredelings and I both came up with tools to do this. His |
I should have mentioned that I replaced a few other characters that other users might want to avoid |
@mtholder thanks. Yes, putting the step into the pipeline would be useful, and documenting it. @jar398 I guess it is the braces that are most likely to cause problems - . An ugly solution to retain invertibility would be to e.g. replace () with <> (gt & lt signs only appear in 10 taxa on the tree) or {} (no current taxa contain curly braces). But that seems rather hacky. In http://phylo.bio.ku.edu/ot/munging_log.txt there are 510 names that contain commas: these are nearly all where the taxon name contains the authority or year of description. I don't know if this is something that you want to be included in the taxon name or not. |
I'd still vote for not munging the names, but if we do continue this we should explain it to users.
I think that the relevant code is:
https://github.com/OpenTreeOfLife/ot-base/blob/master/src/main/java/org/opentree/utils/GeneralUtils.java
I think that the explanation now is that we create "TAXONNAME_ottOTTID" as the label, then use normal newick escaping rules except:
A. all colons are converted to _
B. all spaces go to _ before making the quoting decision.
C. _ characters are ignored in the quoting decision.
But I'm not sure if the getNewick in https://github.com/OpenTreeOfLife/treemachine/blob/master/src/main/java/jade/tree/JadeNode.java
then does some character replacement
Email thread: https://groups.google.com/forum/#!topic/opentreeoflife/4_5DYH5deS0
The text was updated successfully, but these errors were encountered: