Skip to content
This repository has been archived by the owner on Sep 22, 2019. It is now read-only.

Need documentation of how we munge names in newick export #147

Open
mtholder opened this issue Nov 28, 2014 · 13 comments
Open

Need documentation of how we munge names in newick export #147

mtholder opened this issue Nov 28, 2014 · 13 comments
Labels

Comments

@mtholder
Copy link
Member

I'd still vote for not munging the names, but if we do continue this we should explain it to users.

I think that the relevant code is:
https://github.com/OpenTreeOfLife/ot-base/blob/master/src/main/java/org/opentree/utils/GeneralUtils.java

I think that the explanation now is that we create "TAXONNAME_ottOTTID" as the label, then use normal newick escaping rules except:

A. all colons are converted to _
B. all spaces go to _ before making the quoting decision.
C. _ characters are ignored in the quoting decision.

But I'm not sure if the getNewick in https://github.com/OpenTreeOfLife/treemachine/blob/master/src/main/java/jade/tree/JadeNode.java
then does some character replacement

Email thread: https://groups.google.com/forum/#!topic/opentreeoflife/4_5DYH5deS0

@mtholder
Copy link
Member Author

Other related issues: OpenTreeOfLife/ot-base#10
#131
OpenTreeOfLife/taxomachine#74

@mtholder
Copy link
Member Author

Just a follow up example of a hard case. It is not a pretty state of affairs. OTT ID 572918 has the troublesome name: "Oerskovia sp. 7(2011)"

That node is a child of OTT ID :125746 in the taxonomy and the synthetic tree.

We have methods that return the subtree in newick for the taxonony and the synthetic tree, but none of them return that name correctly quoted in newick. taxonomy/subtree takes a label_format argument for how to label the tips.

Using "id" works - but does not return the name (obviously)

Using "original_name" returns illegal newick.

Using "name" returns a name which (becuase it is quoted) changes the spaces to underscores in the name

Using "name_and_id" returns a pair of tokens for the name: 'Oerskovia_sp._7(2011)'_ott572918 I think that this is illegal (definitely is in NEXUS, but I think that it is in newick, too)

Using tree_of_life/subtree return a munged name: 'Oerskovia_sp_7_2011_ott572918' that is quoted, but has underscores and lacks the punctuation.

Details below:

original_name

curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"original_name" }'

returns:

{
"subtree" : "(Oerskovia turbata,Oerskovia sp. MP7,Oerskovia sp. 3146-i3a2,Oerskovia sp. MP4d,Oerskovia sp. 3146-i3b,Oerskovia sp. MP6d,Oerskovia sp. Tibet-YD4604-7,Oerskovia sp. Tibet-YD4604-5,Oerskovia sp. Eab19,Oerskovia sp. Bra16,Oerskovia sp. Ms17,uncultured Oerskovia sp.,Oerskovia sp. Ms38,Oerskovia sp. Ms37,Oerskovia sp. YIM 100718,Oerskovia sp. CHP-ZH25,Oerskovia sp. K2011,Oerskovia sp. K2012,Oerskovia sp. YIM 100122,Oerskovia sp. SAUK 6039,Oerskovia enterophila,Oerskovia sp. Y1,Oerskovia sp. Lgg15.9,Oerskovia sp. L1911,Oerskovia sp. YIM 100566,Oerskovia paurometabola,Oerskovia sp. SAUK6041,Oerskovia sp. SAUK6045,Oerskovia sp. VTT E-073039,Oerskovia sp. YIM 48801,Oerskovia sp. KBS0722,Oerskovia sp. 7(2011),Oerskovia sp. 463-2,Oerskovia sp. R-32754,Cellulomonas sp. UFZ-B529,Oerskovia sp. CATR-180,Oerskovia sp. B19,Oerskovia sp. B18,Oerskovia sp. B6,Oerskovia sp. B28,Oerskovia ginkgo,Oerskovia sp. 27(2011),Oerskovia sp. 26(2011),(Oerskovia turbata NBRC 15015)Oerskovia turbata,Oerskovia sp. SAUK 6042,Oerskovia sp. YIM 69644,Oerskovia sp. SAUK6219,Oerskovia sp. SAUK6230,Oerskovia jenensis,Oerskovia sp. S10(2012),Oerskovia sp. LCB39,Oerskovia sp. I_Gauze_W_12_3,Oerskovia sp. IHB B 3473,Oerskovia sp. B17,Oerskovia sp. PG1-2/67,Oerskovia sp. R-45820)Oerskovia;"
}

which is an illegal newick, because names with punctuation are not quoted. I suppose that one could argue that this is the correct behavior for this argument, but the fact that some names have parentheses implies to me that we should not support this.

Using name:

 curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name"}'

returns

{
  "subtree" : "(Oerskovia_turbata,Oerskovia_sp._MP7,Oerskovia_sp._3146-i3a2,Oerskovia_sp._MP4d,Oerskovia_sp._3146-i3b,Oerskovia_sp._MP6d,Oerskovia_sp._Tibet-YD4604-7,Oerskovia_sp._Tibet-YD4604-5,Oerskovia_sp._Eab19,Oerskovia_sp._Bra16,Oerskovia_sp._Ms17,uncultured_Oerskovia_sp.,Oerskovia_sp._Ms38,Oerskovia_sp._Ms37,Oerskovia_sp._YIM_100718,Oerskovia_sp._CHP-ZH25,Oerskovia_sp._K2011,Oerskovia_sp._K2012,Oerskovia_sp._YIM_100122,Oerskovia_sp._SAUK_6039,Oerskovia_enterophila,Oerskovia_sp._Y1,Oerskovia_sp._Lgg15.9,Oerskovia_sp._L1911,Oerskovia_sp._YIM_100566,Oerskovia_paurometabola,Oerskovia_sp._SAUK6041,Oerskovia_sp._SAUK6045,Oerskovia_sp._VTT_E-073039,Oerskovia_sp._YIM_48801,Oerskovia_sp._KBS0722,'Oerskovia_sp._7(2011)',Oerskovia_sp._463-2,Oerskovia_sp._R-32754,Cellulomonas_sp._UFZ-B529,Oerskovia_sp._CATR-180,Oerskovia_sp._B19,Oerskovia_sp._B18,Oerskovia_sp._B6,Oerskovia_sp._B28,Oerskovia_ginkgo,'Oerskovia_sp._27(2011)','Oerskovia_sp._26(2011)',(Oerskovia_turbata_NBRC_15015)Oerskovia_turbata,Oerskovia_sp._SAUK_6042,Oerskovia_sp._YIM_69644,Oerskovia_sp._SAUK6219,Oerskovia_sp._SAUK6230,Oerskovia_jenensis,'Oerskovia_sp._S10(2012)',Oerskovia_sp._LCB39,Oerskovia_sp._I_Gauze_W_12_3,Oerskovia_sp._IHB_B_3473,Oerskovia_sp._B17,'Oerskovia_sp._PG1-2/67',Oerskovia_sp._R-45820)Oerskovia;"
}

is legal, but has some names with _ in them (because they are quoted) and other with _ being translated to spaces.

Using name_and_id

curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name_and_id"}'

returns

{
  "subtree" : "(Oerskovia_turbata_ott5255224,Oerskovia_sp._MP7_ott5371302,Oerskovia_sp._3146-i3a2_ott5371301,Oerskovia_sp._MP4d_ott5371300,Oerskovia_sp._3146-i3b_ott5371299,Oerskovia_sp._MP6d_ott5371298,Oerskovia_sp._Tibet-YD4604-7_ott5371297,Oerskovia_sp._Tibet-YD4604-5_ott5371296,Oerskovia_sp._Eab19_ott5161638,Oerskovia_sp._Bra16_ott5161637,Oerskovia_sp._Ms17_ott5161636,uncultured_Oerskovia_sp._ott5161635,Oerskovia_sp._Ms38_ott5161633,Oerskovia_sp._Ms37_ott5161632,Oerskovia_sp._YIM_100718_ott1081896,Oerskovia_sp._CHP-ZH25_ott1007916,Oerskovia_sp._K2011_ott866688,Oerskovia_sp._K2012_ott866687,Oerskovia_sp._YIM_100122_ott864224,Oerskovia_sp._SAUK_6039_ott856992,Oerskovia_enterophila_ott816580,Oerskovia_sp._Y1_ott732905,Oerskovia_sp._Lgg15.9_ott784893,Oerskovia_sp._L1911_ott677213,Oerskovia_sp._YIM_100566_ott714355,Oerskovia_paurometabola_ott697480,Oerskovia_sp._SAUK6041_ott606294,Oerskovia_sp._SAUK6045_ott606297,Oerskovia_sp._VTT_E-073039_ott654863,Oerskovia_sp._YIM_48801_ott647002,Oerskovia_sp._KBS0722_ott565557,'Oerskovia_sp._7(2011)'_ott572918,Oerskovia_sp._463-2_ott485674,Oerskovia_sp._R-32754_ott432589,Cellulomonas_sp._UFZ-B529_ott369560,Oerskovia_sp._CATR-180_ott375337,Oerskovia_sp._B19_ott385056,Oerskovia_sp._B18_ott385057,Oerskovia_sp._B6_ott385043,Oerskovia_sp._B28_ott385055,Oerskovia_ginkgo_ott385196,'Oerskovia_sp._27(2011)'_ott351018,'Oerskovia_sp._26(2011)'_ott351017,(Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp._SAUK_6042_ott282206,Oerskovia_sp._YIM_69644_ott224479,Oerskovia_sp._SAUK6219_ott190501,Oerskovia_sp._SAUK6230_ott190503,Oerskovia_jenensis_ott174409,'Oerskovia_sp._S10(2012)'_ott149450,Oerskovia_sp._LCB39_ott138899,Oerskovia_sp._I_Gauze_W_12_3_ott136674,Oerskovia_sp._IHB_B_3473_ott142547,Oerskovia_sp._B17_ott121144,'Oerskovia_sp._PG1-2/67'_ott106812,Oerskovia_sp._R-45820_ott87860)Oerskovia_ott125746;"
}

which is illegal (I think) because some labels are now multiple tokens. For example: 'Oerskovia_sp._7(2011)'_ott572918

And using the tree_of_life service:

curl -X POST http://devapi.opentreeoflife.org/v2/tree_of_life/subtree -H 'Content-type:appliction/json' -d '{"ott_id":125746, "label_format":"name_and_id" }'

returns

{
"newick" : "((Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp_MP7_ott5371302,Oerskovia_sp_B6_ott385043,'Oerskovia_sp_26_2011_ott351017',Oerskovia_sp_CHP-ZH25_ott1007916,Oerskovia_sp_CATR-180_ott375337,Oerskovia_sp_YIM_100122_ott864224,Oerskovia_sp_B18_ott385057,Oerskovia_sp_KBS0722_ott565557,Oerskovia_sp_SAUK6230_ott190503,Oerskovia_sp_3146-i3a2_ott5371301,Oerskovia_sp_YIM_69644_ott224479,Oerskovia_sp_K2012_ott866687,Oerskovia_sp_YIM_100566_ott714355,Oerskovia_sp_MP4d_ott5371300,Oerskovia_sp_Ms17_ott5161636,Oerskovia_sp_3146-i3b_ott5371299,'Oerskovia_sp_PG1-2_67_ott106812',Oerskovia_sp_MP6d_ott5371298,Oerskovia_sp_K2011_ott866688,Oerskovia_sp_B19_ott385056,Oerskovia_sp_L1911_ott677213,Oerskovia_sp_R-32754_ott432589,Oerskovia_sp_Ms38_ott5161633,'Oerskovia_sp_27_2011_ott351018',Oerskovia_sp_SAUK6045_ott606297,Oerskovia_sp_SAUK6219_ott190501,Oerskovia_sp_Ms37_ott5161632,Oerskovia_sp_R-45820_ott87860,Oerskovia_sp_Lgg15_9_ott784893,Oerskovia_sp_B17_ott121144,Oerskovia_sp_Tibet-YD4604-7_ott5371297,Oerskovia_sp_LCB39_ott138899,Oerskovia_sp_YIM_100718_ott1081896,Oerskovia_sp_Bra16_ott5161637,Oerskovia_sp_463-2_ott485674,Oerskovia_sp_Y1_ott732905,Oerskovia_sp_B28_ott385055,'Oerskovia_sp_7_2011_ott572918',Oerskovia_sp_SAUK6041_ott606294,Oerskovia_sp_SAUK_6039_ott856992,Oerskovia_sp_Tibet-YD4604-5_ott5371296,Oerskovia_sp_Eab19_ott5161638,Oerskovia_sp_VTT_E-073039_ott654863,'Oerskovia_sp_S10_2012_ott149450',Oerskovia_sp_SAUK_6042_ott282206,Oerskovia_sp_IHB_B_3473_ott142547,Oerskovia_sp_YIM_48801_ott647002,Oerskovia_turbata_ott5255224,Cellulomonas_sp_UFZ-B529_ott369560,Oerskovia_enterophila_ott816580,Oerskovia_ginkgo_ott385196,Oerskovia_jenensis_ott174409,Oerskovia_paurometabola_ott697480,Oerskovia_sp_I_Gauze_W_12_3_ott136674,uncultured_Oerskovia_sp_ott5161635)Oerskovia_ott125746;",
"tree_id" : "otol.draft.22"
}

includes name munging to give a single quoted 'Oerskovia_sp_7_2011_ott572918'

@chinchliff
Copy link
Member

Yes, we were going to include a proper newick writer in the jade OT-base
classes, so we could correctly process newick names, and use the same code
across treemachine and taxomachine. Joseph and I haven't wanted to mess
with the newick names until that is done, and it hasn't been done yet...

It will be good to have this reference of the current issues when it comes
time to write the corrected name writer.

On Wednesday, December 10, 2014, Mark T. Holder notifications@github.com
wrote:

Just a follow up example of a hard case. It is not a pretty state of
affairs. OTT ID 572918 has the troublesome name: "Oerskovia sp. 7(2011)"

That node is a child of OTT ID :125746 in the taxonomy and the synthetic
tree.

We have methods that return the subtree in newick for the taxonony and the
synthetic tree, but none of them return that name correctly quoted in
newick. taxonomy/subtree takes a label_format argument for how to label
the tips.

Using "id" works - but does not return the name (obviously)

Using "original_name" returns illegal newick.

Using "name" returns a name which (becuase it is quoted) changes the
spaces to underscores in the name

Using "name_and_id" returns a pair of tokens for the name:
'Oerskovia_sp._7(2011)'_ott572918 I think that this is illegal (definitely
is in NEXUS, but I think that it is in newick, too)

Using tree_of_life/subtree return a munged name:
'Oerskovia_sp_7_2011_ott572918' that is quoted, but has underscores and
lacks the punctuation.

Details below:

original_name

curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"original_name" }'

returns:

{
"subtree" : "(Oerskovia turbata,Oerskovia sp. MP7,Oerskovia sp. 3146-i3a2,Oerskovia sp. MP4d,Oerskovia sp. 3146-i3b,Oerskovia sp. MP6d,Oerskovia sp. Tibet-YD4604-7,Oerskovia sp. Tibet-YD4604-5,Oerskovia sp. Eab19,Oerskovia sp. Bra16,Oerskovia sp. Ms17,uncultured Oerskovia sp.,Oerskovia sp. Ms38,Oerskovia sp. Ms37,Oerskovia sp. YIM 100718,Oerskovia sp. CHP-ZH25,Oerskovia sp. K2011,Oerskovia sp. K2012,Oerskovia sp. YIM 100122,Oerskovia sp. SAUK 6039,Oerskovia enterophila,Oerskovia sp. Y1,Oerskovia sp. Lgg15.9,Oerskovia sp. L1911,Oerskovia sp. YIM 100566,Oerskovia paurometabola,Oerskovia sp. SAUK6041,Oerskovia sp. SAUK6045,Oerskovia sp. VTT E-073039,Oerskovia sp. YIM 48801,Oerskovia sp. KBS0722,Oerskovia sp. 7(2011),Oerskovia sp. 463-2,Oerskovia sp. R-32754,Cellulomonas sp. UFZ-B529,Oerskovia sp. CATR-180,Oerskovia sp. B19,Oerskovia sp. B18,Oerskovia sp. B6,Oerskovia sp. B28,Oerskovia ginkgo,Oerskovia sp. 27(2011),Oerskovia sp. 26(2011),(Oerskovia turbata NBRC 15015)Oerskovia turbata,Oe
rskovia sp. SAUK 6042,Oerskovia sp. YIM 69644,Oerskovia sp. SAUK6219,Oerskovia sp. SAUK6230,Oerskovia jenensis,Oerskovia sp. S10(2012),Oerskovia sp. LCB39,Oerskovia sp. I_Gauze_W_12_3,Oerskovia sp. IHB B 3473,Oerskovia sp. B17,Oerskovia sp. PG1-2/67,Oerskovia sp. R-45820)Oerskovia;"
}

which is an illegal newick, because names with punctuation are not quoted.
I suppose that one could argue that this is the correct behavior for this
argument, but the fact that some names have parentheses implies to me that
we should not support this.

Using name:

curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name"}'

returns

{
"subtree" : "(Oerskovia_turbata,Oerskovia_sp._MP7,Oerskovia_sp._3146-i3a2,Oerskovia_sp._MP4d,Oerskovia_sp._3146-i3b,Oerskovia_sp._MP6d,Oerskovia_sp._Tibet-YD4604-7,Oerskovia_sp._Tibet-YD4604-5,Oerskovia_sp._Eab19,Oerskovia_sp._Bra16,Oerskovia_sp._Ms17,uncultured_Oerskovia_sp.,Oerskovia_sp._Ms38,Oerskovia_sp._Ms37,Oerskovia_sp._YIM_100718,Oerskovia_sp._CHP-ZH25,Oerskovia_sp._K2011,Oerskovia_sp._K2012,Oerskovia_sp._YIM_100122,Oerskovia_sp._SAUK_6039,Oerskovia_enterophila,Oerskovia_sp._Y1,Oerskovia_sp._Lgg15.9,Oerskovia_sp._L1911,Oerskovia_sp._YIM_100566,Oerskovia_paurometabola,Oerskovia_sp._SAUK6041,Oerskovia_sp._SAUK6045,Oerskovia_sp._VTT_E-073039,Oerskovia_sp._YIM_48801,Oerskovia_sp._KBS0722,'Oerskovia_sp._7(2011)',Oerskovia_sp._463-2,Oerskovia_sp._R-32754,Cellulomonas_sp._UFZ-B529,Oerskovia_sp._CATR-180,Oerskovia_sp._B19,Oerskovia_sp._B18,Oerskovia_sp._B6,Oerskovia_sp._B28,Oerskovia_ginkgo,'Oerskovia_sp._27(2011)','Oerskovia_sp._26(2011)',(Oerskovia_turbata_NBRC_15015)Oerskovia_tu
rbata,Oerskovia_sp._SAUK_6042,Oerskovia_sp._YIM_69644,Oerskovia_sp._SAUK6219,Oerskovia_sp._SAUK6230,Oerskovia_jenensis,'Oerskovia_sp._S10(2012)',Oerskovia_sp._LCB39,Oerskovia_sp._I_Gauze_W_12_3,Oerskovia_sp._IHB_B_3473,Oerskovia_sp._B17,'Oerskovia_sp._PG1-2/67',Oerskovia_sp._R-45820)Oerskovia;"
}

is legal, but has some names with _ in them (because they are quoted) and
other with _ being translated to spaces.

Using name_and_id

curl -X POST http://devapi.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:applicatio/json' -d '{"ott_id":125746, "label_format":"name_and_id"}'

returns

{
"subtree" : "(Oerskovia_turbata_ott5255224,Oerskovia_sp._MP7_ott5371302,Oerskovia_sp._3146-i3a2_ott5371301,Oerskovia_sp._MP4d_ott5371300,Oerskovia_sp._3146-i3b_ott5371299,Oerskovia_sp._MP6d_ott5371298,Oerskovia_sp._Tibet-YD4604-7_ott5371297,Oerskovia_sp._Tibet-YD4604-5_ott5371296,Oerskovia_sp._Eab19_ott5161638,Oerskovia_sp._Bra16_ott5161637,Oerskovia_sp._Ms17_ott5161636,uncultured_Oerskovia_sp._ott5161635,Oerskovia_sp._Ms38_ott5161633,Oerskovia_sp._Ms37_ott5161632,Oerskovia_sp._YIM_100718_ott1081896,Oerskovia_sp._CHP-ZH25_ott1007916,Oerskovia_sp._K2011_ott866688,Oerskovia_sp._K2012_ott866687,Oerskovia_sp._YIM_100122_ott864224,Oerskovia_sp._SAUK_6039_ott856992,Oerskovia_enterophila_ott816580,Oerskovia_sp._Y1_ott732905,Oerskovia_sp._Lgg15.9_ott784893,Oerskovia_sp._L1911_ott677213,Oerskovia_sp._YIM_100566_ott714355,Oerskovia_paurometabola_ott697480,Oerskovia_sp._SAUK6041_ott606294,Oerskovia_sp._SAUK6045_ott606297,Oerskovia_sp._VTT_E-073039_ott654863,Oerskovia_sp._YIM_48801_ott647002,O
erskovia_sp._KBS0722_ott565557,'Oerskovia_sp._7(2011)'_ott572918,Oerskovia_sp._463-2_ott485674,Oerskovia_sp._R-32754_ott432589,Cellulomonas_sp._UFZ-B529_ott369560,Oerskovia_sp._CATR-180_ott375337,Oerskovia_sp._B19_ott385056,Oerskovia_sp._B18_ott385057,Oerskovia_sp._B6_ott385043,Oerskovia_sp._B28_ott385055,Oerskovia_ginkgo_ott385196,'Oerskovia_sp._27(2011)'_ott351018,'Oerskovia_sp._26(2011)'_ott351017,(Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp._SAUK_6042_ott282206,Oerskovia_sp._YIM_69644_ott224479,Oerskovia_sp._SAUK6219_ott190501,Oerskovia_sp._SAUK6230_ott190503,Oerskovia_jenensis_ott174409,'Oerskovia_sp._S10(2012)'_ott149450,Oerskovia_sp._LCB39_ott138899,Oerskovia_sp._I_Gauze_W_12_3_ott136674,Oerskovia_sp._IHB_B_3473_ott142547,Oerskovia_sp._B17_ott121144,'Oerskovia_sp._PG1-2/67'_ott106812,Oerskovia_sp._R-45820_ott87860)Oerskovia_ott125746;"
}

which is illegal (I think) because some labels are now multiple tokens.
For example: 'Oerskovia_sp._7(2011)'_ott572918

And using the tree_of_life service:

curl -X POST http://devapi.opentreeoflife.org/v2/tree_of_life/subtree -H 'Content-type:appliction/json' -d '{"ott_id":125746, "label_format":"name_and_id" }'

returns

{
"newick" : "((Oerskovia_turbata_NBRC_15015_ott4770673)Oerskovia_turbata_ott301645,Oerskovia_sp_MP7_ott5371302,Oerskovia_sp_B6_ott385043,'Oerskovia_sp_26_2011_ott351017',Oerskovia_sp_CHP-ZH25_ott1007916,Oerskovia_sp_CATR-180_ott375337,Oerskovia_sp_YIM_100122_ott864224,Oerskovia_sp_B18_ott385057,Oerskovia_sp_KBS0722_ott565557,Oerskovia_sp_SAUK6230_ott190503,Oerskovia_sp_3146-i3a2_ott5371301,Oerskovia_sp_YIM_69644_ott224479,Oerskovia_sp_K2012_ott866687,Oerskovia_sp_YIM_100566_ott714355,Oerskovia_sp_MP4d_ott5371300,Oerskovia_sp_Ms17_ott5161636,Oerskovia_sp_3146-i3b_ott5371299,'Oerskovia_sp_PG1-2_67_ott106812',Oerskovia_sp_MP6d_ott5371298,Oerskovia_sp_K2011_ott866688,Oerskovia_sp_B19_ott385056,Oerskovia_sp_L1911_ott677213,Oerskovia_sp_R-32754_ott432589,Oerskovia_sp_Ms38_ott5161633,'Oerskovia_sp_27_2011_ott351018',Oerskovia_sp_SAUK6045_ott606297,Oerskovia_sp_SAUK6219_ott190501,Oerskovia_sp_Ms37_ott5161632,Oerskovia_sp_R-45820_ott87860,Oerskovia_sp_Lgg15_9_ott784893,Oerskovia_sp_B17_ott1211
44,Oerskovia_sp_Tibet-YD4604-7_ott5371297,Oerskovia_sp_LCB39_ott138899,Oerskovia_sp_YIM_100718_ott1081896,Oerskovia_sp_Bra16_ott5161637,Oerskovia_sp_463-2_ott485674,Oerskovia_sp_Y1_ott732905,Oerskovia_sp_B28_ott385055,'Oerskovia_sp_7_2011_ott572918',Oerskovia_sp_SAUK6041_ott606294,Oerskovia_sp_SAUK_6039_ott856992,Oerskovia_sp_Tibet-YD4604-5_ott5371296,Oerskovia_sp_Eab19_ott5161638,Oerskovia_sp_VTT_E-073039_ott654863,'Oerskovia_sp_S10_2012_ott149450',Oerskovia_sp_SAUK_6042_ott282206,Oerskovia_sp_IHB_B_3473_ott142547,Oerskovia_sp_YIM_48801_ott647002,Oerskovia_turbata_ott5255224,Cellulomonas_sp_UFZ-B529_ott369560,Oerskovia_enterophila_ott816580,Oerskovia_ginkgo_ott385196,Oerskovia_jenensis_ott174409,Oerskovia_paurometabola_ott697480,Oerskovia_sp_I_Gauze_W_12_3_ott136674,uncultured_Oerskovia_sp_ott5161635)Oerskovia_ott125746;",
"tree_id" : "otol.draft.22"
}

includes name munging to give a single quoted
'Oerskovia_sp_7_2011_ott572918'


Reply to this email directly or view it on GitHub
#147 (comment)
.

@mtholder
Copy link
Member Author

It looks like (for at least the draftversion2.tre newick) the decision about whether to quote is made before the substitution of _ for punctuation. So some labels like Fibrobacteres/Acidobacteria group get converted to 'Fibrobacteres_Acidbacteria_group' with single quotes (even though we don't quote tokens with _ as the only odd character in other contexts).

@josephwb
Copy link
Member

mtholder: for some reason, "/" got put in the "newick-illegal" list. I'll fx that.

@jar398
Copy link
Member

jar398 commented Jan 26, 2015

​This should probably be fixed, for the same of invertibility... or is
invertibility a lost cause?

@josephwb
Copy link
Member

Submitted a PR fix for the quote error above.

@mtholder
Copy link
Member Author

invertibility is not a lost cause in terms of regenerating the real OTT name from the newick/nexus.

We can encode and string in newick or nexus.

There are often 2 legal syntaxes in those formats for any string (a quoted form and an unquoted form). One can't reliably predict which of the two forms will be used (without looking at the code). So you can't go:
newick -> internal representation -> newick
and guarantee the exact same form.

But you can go:
any string -> newick -> any string
exactly.

And you can go:
newick -> internal representation -> equivalent newick representation

@hyanwong
Copy link

Off the back of OpenTree v5 and the new naming scheme, I suggest that it might be useful if taxon names in the downloadable newick file did not contain braces and commas (and possibly not colons either). This makes it easy to parse the newick file using regular expressions and the like, without having to parse the actual tree structure, or parse the quoting of labels. That makes it a lot faster and less memory intensive to mess with the tree. Of course, this may make it impossible to maintain consistent labels between tree machine and taxomachine, so I can foresee objections.

@jar398
Copy link
Member

jar398 commented Apr 12, 2016 via email

@mtholder
Copy link
Member Author

OK @hyanwong I have posted a version of the tree with simplified names at http://phylo.bio.ku.edu/ot/opentree5.0_simplified_names.tre.gz and a log of the edits at http://phylo.bio.ku.edu/ot/munging_log.txt

We'll still need to work this step into the pipeline and figure out a statndard name for this output. @bredelings and I both came up with tools to do this. His otc-relabel-tree ex_2_tree1.tre --replace "/[;[\]()]/ /" invocation (in the code repo in https://github.com/mtholder/otcetera ) is probably going to be the version that we end up using. But I made the tree above with https://github.com/mtholder/otcetera/blob/master/tools/mungenames.cpp

@mtholder
Copy link
Member Author

I should have mentioned that I replaced a few other characters that other users might want to avoid

@hyanwong
Copy link

hyanwong commented Apr 13, 2016

@mtholder thanks. Yes, putting the step into the pipeline would be useful, and documenting it. @jar398 I guess it is the braces that are most likely to cause problems - . An ugly solution to retain invertibility would be to e.g. replace () with <> (gt & lt signs only appear in 10 taxa on the tree) or {} (no current taxa contain curly braces). But that seems rather hacky.

In http://phylo.bio.ku.edu/ot/munging_log.txt there are 510 names that contain commas: these are nearly all where the taxon name contains the authority or year of description. I don't know if this is something that you want to be included in the taxon name or not.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants