Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newick parsing: handle negative edge lengths, comments #177

Merged
merged 2 commits into from
Jun 17, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://crsl4.github.io/PhyloNetworks.jl/stable)
[![](https://img.shields.io/badge/docs-dev-blue.svg)](https://crsl4.github.io/PhyloNetworks.jl/dev)
[![codecov](https://codecov.io/gh/crsl4/PhyloNetworks.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/crsl4/PhyloNetworks.jl)
[![Coverage Status](https://coveralls.io/repos/crsl4/PhyloNetworks.jl/badge.svg?branch=master&service=github)](https://coveralls.io/github/crsl4/PhyloNetworks.jl?branch=master)

## Overview

Expand Down Expand Up @@ -73,4 +72,19 @@ with or without transgressive evolution after reticulations:
SI on [dryad](http://dx.doi.org/10.5061/dryad.nt2g6)
including a [tutorial for trait evolution](https://datadryad.org/bitstream/handle/10255/dryad.177752/xiphophorus_PCM_analysis.html?sequence=1)
and a [tutorial for network calibration](https://datadryad.org/bitstream/handle/10255/dryad.177754/xiphophorus_networks_calibration.html?sequence=1).


Continuous traits, accounting for within-species variation:

- Benjamin Teo, Jeffrey P. Rose, Paul Bastide & Cécile Ané (2022).
Accounting for intraspecific variation in continuous trait evolution
on a reticulate phylogeny.
[bioRxiv](https://doi.org/10.1101/2022.05.12.490814)

For a discrete trait (influence of gene flow on the trait,
ancestral state reconstruction, rates):

- Karimi, Grover, Gallagher, Wendel, Ané & Baum (2020). Reticulate evolution
helps explain apparent homoplasy in floral biology and pollination in baobabs
(*Adansonia*; Bombacoideae; Malvaceae).
[Systematic Biology](https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syz073/5613901?guestAccessKey=a32e7dd3-27fd-4a13-b171-7ff5d6da0e01),
69(3):462-478. doi: 10.1093/sysbio/syz073.
14 changes: 14 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,29 @@ and their use for trait evolution.

## References

for the package:
- Claudia Solís-Lemus, Paul Bastide and Cécile Ané (2017).
PhyloNetworks: a package for phylogenetic networks.
[Molecular Biology and Evolution](https://academic.oup.com/mbe/article/doi/10.1093/molbev/msx235/4103410/PhyloNetworks-a-package-for-phylogenetic-networks?guestAccessKey=230afceb-df28-4160-832d-aa7c73f86369)
34(12):3292–3298.
[doi:10.1093/molbev/msx235](https://doi.org/10.1093/molbev/msx235)

for trait evolution:
- Teo, Rose, Bastide & Ané (2022).
Accounting for intraspecific variation in continuous trait evolution
on a reticulate phylogeny.
[bioRxiv](https://doi.org/10.1101/2022.05.12.490814)
- Karimi, Grover, Gallagher, Wendel, Ané & Baum (2020). Reticulate evolution
helps explain apparent homoplasy in floral biology and pollination in baobabs
(*Adansonia*; Bombacoideae; Malvaceae).
Systematic Biology, 69(3):462-478.
[doi:10.1093/sysbio/syz073](https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syz073/5613901?guestAccessKey=a32e7dd3-27fd-4a13-b171-7ff5d6da0e01).
- Bastide, Solís-Lemus, Kriebel, Sparks, Ané (2018).
Phylogenetic Comparative Methods for Phylogenetic Networks with Reticulations.
Systematic Biology, 67(5):800–820.
[doi:10.1093/sysbio/syy033](https://doi.org/10.1093/sysbio/syy033).

for network inference:
- Claudia Solís-Lemus and Cécile Ané (2016).
Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting.
[PLoS Genet](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005896)
Expand Down
89 changes: 70 additions & 19 deletions src/readwrite.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
# removes white spaces from the IOStream/IOBuffer
# see skipchars(predicate, io::IO; linecomment=nothing) in io.jl
# https://github.com/JuliaLang/julia/blob/3b02991983dd47313776091720871201f75f644a/base/io.jl#L971
ncodeunits(c::Char) = write(devnull, c) # for Julia v0.7. Remove for Julia v1.0.3+
# replace "while !eof ... read(io, Char)" by readeach(io, Char)
# when ready to require Julia v1.6
# https://docs.julialang.org/en/v1/base/io-network/#Base.skipchars
function peekskip(io::IO, linecomment=nothing)
c = missing
while !eof(io)
Expand Down Expand Up @@ -42,16 +44,44 @@ function advance!(s::IO, numLeft::Array{Int,1})
return c
end

"""
readnexuscomment(s::IO, c::Char)

# auxiliary function to read a taxon name.
# allows names with letters and numbers: treats numbers as strings
# it also reads # as part of the name and returns pound=true
# it returns the node name as string as well to check if it exists already (as hybrid)
function readnodename(s::IO, c::Char, net::HybridNetwork, numLeft::Array{Int,1})
if !isValidSymbol(c)
a = read(s, String);
error("Expected digit, alphanum or # at the start of taxon name, but received $(c). remaining: $(a).");
Read (and do nothing with) nexus-style comments: `[& ... ]`
Assumption: 'c' is the next character to be read from s.
Output: nothing.

Comments can appear after (or instead of) a node or leaf name,
before or after an edge length, and after another comment.
"""
function readnexuscomment(s::IO, c::Char)
while c == '[' # a comment could be followed by another
# [ should be followed by &, otherwise bad newick string
# read [ and next character: don't skip spaces
read(s, Char) == '[' || error("I was supposed to read '['")
read(s, Char) == '&' || error("read '[' but not followed by &")
skipchars(!isequal(']'), s)
eof(s) && error("comment without ] to end it")
read(s, Char) # to read ] and advance s
c = peekskip(s)
end
return
end

"""
readnodename(s::IO, c::Char, net, numLeft)

Auxiliary function to read a taxon name during newick parsing.
output: tuple (number, name, pound_boolean)

Names may have numbers: numbers are treated as strings.
Accepts `#` as part of the name (but excludes it from the name), in which
case `pound_boolean` is true. `#` is used in extended newick to flag
hybrid nodes.

Nexus-style comments following the node name, if any, are read and ignored.
"""
function readnodename(s::IO, c::Char, net::HybridNetwork, numLeft::Array{Int,1})
pound = 0
name = ""
while isValidSymbol(c)
Expand All @@ -76,6 +106,7 @@ function readnodename(s::IO, c::Char, net::HybridNetwork, numLeft::Array{Int,1})
a = read(s, String);
error("strange node name with $(pound) # signs: $name. remaining: $(a).")
end
readnexuscomment(s,c)
return size(net.names,1)+1, name, pound == 1
end

Expand Down Expand Up @@ -249,6 +280,9 @@ end

Helper function for `parseEdgeData!`.
Read a single floating point edge data value in a tree topology.
Ignore (and skip) nexus-style comments before & after the value
(see [`readnexuscomment`](@ref)).

Return -1.0 if no value exists before the next colon, return the value as a float otherwise.
Modifies s by advancing past the next colon character.
Only call this function to read a value when you know a numerical value exists!
Expand All @@ -258,10 +292,18 @@ Only call this function to read a value when you know a numerical value exists!
"second colon : read without any double in left parenthesis $(numLeft[1]-1), ignored.",
"third colon : without gamma value after in $(numLeft[1]-1) left parenthesis, ignored"]
c = peekskip(s)

# Value is present
if isdigit(c) || c == '.'
if c == '[' # e.g. comments only, no value, but : after
readnexuscomment(s,c)
c = peekskip(s)
end
if isdigit(c) || c == '.' || c == '-'
# value is present: read it, and any following comment(s)
val = readFloat(s, c)
if val < 0.0
@error "expecting non-negative value but read '-', left parenthesis $(numLeft[1]-1). will set to 0."
val = 0.0
end
readnexuscomment(s,peekskip(s))
return val
# No value
elseif c == ':'
Expand All @@ -275,11 +317,12 @@ end
"""
parseEdgeData!(s::IO, edge, numberOfLeftParentheses::Array{Int,1})

Helper function for readSubtree!, fixes a bug from using setGamma
Helper function for readSubtree!.
Modifies `e` according to the specified edge length and gamma values in the tree topology.
Advances the stream `s` past any existing edge data.
Edges in a topology may optionally be followed by ":edgeLen:bootstrap:gamma"
where edgeLen, bootstrap, and gamma are decimal values.
Nexus-style comments `[&...]`, if any, are ignored.
"""
@inline function parseEdgeData!(s::IO, e::Edge, numLeft::Array{Int,1})
read(s, Char); # to read the first ":"
Expand Down Expand Up @@ -396,15 +439,19 @@ function readSubtree!(s::IO, parent::Node, numLeft::Array{Int,1}, net::HybridNet
# read the rest of the subtree (perform the recursive step!)
n = parseRemainingSubtree!(s, numLeft, net, hybrids)
c = peekskip(s);
if isValidSymbol(c) # internal node has name
# read potential internal node name (and skip comments)
num, name, pound = readnodename(s, c, net, numLeft);
if name != ""
hasname = true;
num, name, pound = readnodename(s, c, net, numLeft);
n.number = num; # n was given <0 number by parseRemainingSubtree!, now >0
c = peekskip(s);
end
else # leaf, it should have a name
hasname = true;
num, name, pound = readnodename(s, c, net, numLeft)
if name == ""
a = read(s, String);
error("Expected digit, alphanum or # at the start of taxon name, but received $(c). remaining: $(a).");
end
n = Node(num, true); # positive node number to leaves in the newick-tree description
# @debug "creating node $(n.number)"
end
Expand Down Expand Up @@ -453,6 +500,7 @@ end
"""
readTopology(file name)
readTopology(parenthetical description)
readTopology(IO)

Read tree or network topology from parenthetical format (extended Newick).
If the root node has a single child: ignore (i.e. delete from the topology)
Expand All @@ -461,6 +509,8 @@ the root node and its child edge.
Input: text file or parenthetical format directly.
The file name may not start with a left parenthesis, otherwise the file
name itself would be interpreted as the parenthetical description.
Nexus-style comments (`[&...]`) are ignored, and may be placed
after (or instead) of a node name, and before/after an edge length.

A root edge, not enclosed within a pair a parentheses, is ignored.
If the root node has a single edge, this one edge is removed.
Expand Down Expand Up @@ -491,13 +541,14 @@ function readTopology(s::IO,verbose::Bool)
elseif c == ','
continue;
elseif c == ')'
# read potential root name (or comments)
c = peekskip(s);
if isValidSymbol(c) # the root has a name
num, name, pound = readnodename(s, c, net, numLeft);
num, name, pound = readnodename(s, c, net, numLeft)
if name != ""
n.name = name
# log warning or error if pound > 0?
c = peekskip(s);
end
c = peekskip(s)
if(c == ':') # skip information on the root edge, if it exists
# @warn "root edge ignored"
while c != ';'
Expand Down
14 changes: 13 additions & 1 deletion test/test_relaxed_reading.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
@testset "test: newick parsing" begin
global net
@testset "readTopology White Symbol Tests" begin
@testset "readTopology: spaces and comments" begin
global n1, n2, n3
#Test newlines, spaces, tabs, and carriage returns
n1 = readTopology("(A,((B,#H1),(C,(D)#H1)));")
Expand All @@ -20,6 +20,18 @@ global net
resultant = readTopology("('H\tom\ro\nsa p \n\n\n\n\n\n\n\ni\t\t\t\t\t\t\t\ten s ', ((B,#H1),(C,(D)#H1)));")
expected = readTopology("('Homosapiens',((B,#H1),(C,(D)#H1)));")
@test writeTopology(resultant) == writeTopology(expected)

# nexus-style comments and negative edge or gamma values
n1 = (@test_logs (:error, r"expecting non-negative value") (:error, r"expecting non-negative value") readTopology("(E,((B)#H1[&com:1],((D:-1[&theta=2][&prob=0.3],C:[&length:4]1.2[&com 5])[&internalname=G],(#H1:::-0.1,A))))[&com=6];"))
@test writeTopology(n1) == "(E,((B)#H1:::1.0,((D:0.0,C:1.2),(#H1:::0.0,A))));"

# edge cases that should error
@test_throws Exception readTopology("(E,((B)#3,((D,C),(#3,A))));") # name not starting with letter
@test_logs (:warn,r"^Expected H") (:warn,r"^Expected H") readTopology("(E,((B)#P1,((D,C),(#P1,A))));")
@test_throws Exception readTopology("(E,((B)#H#h,((D,C),(#H#h,A))));") # 2 # sign in the name
@test_throws Exception readTopology("(E,((B)#H1,((D,C),((F)#H1,A))));") # both H1 internal
@test_throws Exception readTopology("(E,((B)#H1") # doesn't end with ;
@test_throws Exception readTopology(IOBuffer("E;")) # Expected beginning of tree with (
end
@testset "isMajor and gamma consistency" begin
net = readTopology("((((B)#H1)#H2,((D,C,#H2:::0.8),(#H1,A))));");
Expand Down