Likelihood calculation for subst models and pip (#5)

* Added TN93 to DNA models, fixed GTR Fixed DNA model row/column placement, the matrices were transposed before. Fixed also the summation on the diagonal, doesnt seem necessary with the other fix, but this way the written array for matrix values is easier to read. Added TN93 model. * Initial implementation of substitution likelihood Initial implementation of the phylogenetic lieklihood for models of DNA substitution. Tests for the computations from the CB ETH lecture and the Molecular Evolution book. * Minor changes to make clippy happy * Running coverage on all branches Running coverage on all branches on a push, and additionally on main on a pull request * trying to add manual trigger to code coverage * Fix to make tn93 tests work * Fixed stationarity tests for proteins and HIVB pi Fixed the tests for protein model stationarity. Fixed HIVB AA substitution model stationary frequencies. Added tests for HIVB model. * Fix missing empty fun for Alignment Added back a missing function for creating an empty alignment that is needed by the parsimony aligner. * Parsimony alignment fix (#2) * Minor fixes to make linter happy Minor fixes to make clippy happy, removed unnecessary vec!, into_iter() and the like. * Fixed sequence order to match leaves Fixed sequence order in the vector to match leaf indices. The code relied on that property to begin with, but it was neither being tested nor was it being enforced in any way which lead to very malformed alignments in IndelMaP. This contains the fix and a test to check that the order matches together with the data for the test. * Change LikelihoodCostFunction to trait Refactored the LikelihoodCostFunction into a trait for future different implementations. Removed duplicated EvoModel code from SubstitutionModel module. Added notes for things that need to be fixed: The character probabilities only work correctly for ACGT-, not for ambiguous chars at the moment, and not for proteins; Evolutionary model should be a trait to then work with PIP and anything else, will also fix the char probabilities when implemented. * Refactoring likeihood computation for subst models Refactoring likelihood computation, moved all relevant functions to the substitution model module. * Test data for subst likelihoods Test data files for substitution likelihood computation * EvolitionaryModel trait impl for DNA substitution Added the EvolutionaryModel trait which should be used for all models of sequence evolution. Implemented the trait for DNA substitution models. All generic implementations remain for all substitution models, but both DNA and protein have to independently implement EvoModel trait to avoid implementation conflicts for the template. Also made sure that the basic character probabilities are computed through the trait -- there was an error where ambiguous chars were ignored. EvolutionaryModelInfo trait now requires a model with the EvoModel trait rather than just a substitution model. Additionally, fixed likelihood computation for multiple sites, expected only one value before in the final array, but it should contain len(MSA) column likelihoods. * Removed new() from PhyloInfo Removed the new() method from PhyloInfo due to it being misleading, it didn't check for data validity (that tree tips/sequences correspond to one another, that the sequences are stored in the right order). * Tests for char probabilities at tips Added tests for computing nucleotide probabilities on sequences. * Code cleanup: removed unnecessary vec!'s Code cleanup for clippy: removed unnecessary vector creation in favour of passing slices. * Alignment likelihood tests Added alignment likelihood tests for sequences that are longer than 1: An exampkle from Huelsenbeck with 50 sites; A computed example from the CB lectures with X or N characters to check how ambiguous chars are processed. * Implemented EvolutionaryModel for protein subst Implemented the EvolutionaryModel trait for protein substitution models. Added tests for getting correct character probabilities. * Added likelihood calculation to ProteinSubst Added likelihood calculation to the protein substitution models. Added a test example with the likelihoods being close to what phyml estimates but not quite. The phyml_protein_nogap_example files are for that example, phyml wants .phy sequences and an unrooted tree. * Added tests for subst likelihood reversibility Added 2 test cases for substitution likelihood reversibility: 1. simple fabricated example, tn93 likelihoods are the same on two trees 2. rerooted the huelsenbeck example tree, GTR and k80 likelihoods are the same. * Refactor: EvoModel to own module Refactored to move the EvolutionaryModel trait to its own module so that evolutionary models are not just substitution models. * Added msa field to PhyloInfo Added the MSA field to the PhyloInfo struct, now when reading data from a file the msa will only be set if all the read sequences have the same length. Need to add tests for making sure that sequences are aligned where they need to be. * Added tests for aligned/unaligned read sequences Added some test that check that the sequences don't get used as the MSA if they are different lengths, otherwise they get copied to the MSA field. * Changed FreqVector and SubstMatrix to dynamic size Changed the FreqVector and the SubstMatrix type to dynamic sizing to make them usable when the parametrisation on N (number of chars) becomes different for a model with gaps. It also makes sure that all the data types we use are the same, no real point in using statically typed matrices when most will still be dynamically typed (e.g. the partial likelihood matrices). * Fix to make tn93 tests work * Parsimony alignment fix (#2) * Minor fixes to make linter happy Minor fixes to make clippy happy, removed unnecessary vec!, into_iter() and the like. * Fixed sequence order to match leaves Fixed sequence order in the vector to match leaf indices. The code relied on that property to begin with, but it was neither being tested nor was it being enforced in any way which lead to very malformed alignments in IndelMaP. This contains the fix and a test to check that the order matches together with the data for the test. * Protein substitution matrix transpose fix (#3) * Added struct for defining rounding Added a struct that contains 2 values -- whether to round some of the numbers and if yes, to how many decimal digits. Not necessary in general, but used in testing against the values produced by the python scripts, so now rounding is optional for parsimony scores derived from the substitution models and for the branch percentiles. * Added NodeIdx display and node id printing Implemented Display for NodeIdx so that it always prints what kind of node it is. Added a function that helps with logging -- generates string "with ID xxx" if there is an ID attached to the current node, or gives back an empty string. * Fixed protein matrices from by row to by cols The provided protein substitution matrices were actually given by rows whereas the Matrix struct reads them by columns. Transposed the matrices to match proper order of cols/rows. * Added tests for output to appease codecov Added tests for the helper functions to appease codecov so it lets me merge. * Initial implementation of substitution likelihood Initial implementation of the phylogenetic lieklihood for models of DNA substitution. Tests for the computations from the CB ETH lecture and the Molecular Evolution book. * Change LikelihoodCostFunction to trait Refactored the LikelihoodCostFunction into a trait for future different implementations. Removed duplicated EvoModel code from SubstitutionModel module. Added notes for things that need to be fixed: The character probabilities only work correctly for ACGT-, not for ambiguous chars at the moment, and not for proteins; Evolutionary model should be a trait to then work with PIP and anything else, will also fix the char probabilities when implemented. * Refactoring likeihood computation for subst models Refactoring likelihood computation, moved all relevant functions to the substitution model module. * EvolitionaryModel trait impl for DNA substitution Added the EvolutionaryModel trait which should be used for all models of sequence evolution. Implemented the trait for DNA substitution models. All generic implementations remain for all substitution models, but both DNA and protein have to independently implement EvoModel trait to avoid implementation conflicts for the template. Also made sure that the basic character probabilities are computed through the trait -- there was an error where ambiguous chars were ignored. EvolutionaryModelInfo trait now requires a model with the EvoModel trait rather than just a substitution model. Additionally, fixed likelihood computation for multiple sites, expected only one value before in the final array, but it should contain len(MSA) column likelihoods. * Implemented EvolutionaryModel for protein subst Implemented the EvolutionaryModel trait for protein substitution models. Added tests for getting correct character probabilities. * Changed FreqVector and SubstMatrix to dynamic size Changed the FreqVector and the SubstMatrix type to dynamic sizing to make them usable when the parametrisation on N (number of chars) becomes different for a model with gaps. It also makes sure that all the data types we use are the same, no real point in using statically typed matrices when most will still be dynamically typed (e.g. the partial likelihood matrices). * Commit to fix rebase merge issues Decided to rebase to main to use the Rounding and GapMultiplier classes, probably was a bad idea to do it right now. Adding small fixes to correct my rebasing blunders. * Added pip model definition Added the PIP model definition. Added HKY for DNA to use in the PIP tests. * Fixed HIVB Q matrix Finally fixed the HIVB Q matrix to match the one in the python version and the PhyML one. * Refactor to make likelihood creation uniform Changed signatures of the setup_dna_likelihood and setup_protein_likelihood methods to match each other and the substitution model creation signatures: the model name is now &str and both methods need a list of parameters. * Tests for sanity of substitution likelihood Added more test to check substitution likelihood computation sanity. Checking that protein likelihood is also reversible. * Tests for protein char probabilities + fix Added tests for getting protein character probabilities at the leaves and found a bug in the values, now fixed. * Added generic impls for PIPModel Added generic implementations of normalise, get_rate, get_p and get_stationary_distribution for any size PIP model. Added a unified method that makes a PIP matrix from a generic Substitution model. Added a specific implementation of the EvolutionaryModel trait for PIP with protein models. * More tests for DNA PIP model More test scenarios for the DNA PIP model, checking that it is created correctly when the parameters are provided properly and doesn't get created when there's not enough parameters given. * PIP protein model tests Added PIP protein model tests, checking that the stationary frequencies are correct and that the rates correspond to what the underlyin substitution model would define. * get_idx_by_id function added to tree Added a get_idx_by_id function to the tree struct to make node lookup easier in tests. * Added flag for normalisation in evo models Added a flag to normalise the model matrices to all evolutionary models. * Initial impl of PIP likelihood Initial implementation of the PIP likelihood with tests based on the python implementation. * Fixed strange protein example tree Fixed the phyml primate example tree to be rooted and removed the confusing duplicate tree for the nogap alignment. * Added test for PIP reversibility on rerooted tree Added a test to verify PIP reversibility for a rerooted tree and DNA sequences. * Added a test for protein PIP likelihood. * Fixed surv probability becoming NaN, improved phi Fixed survival probabilities becoming NaN when a branch length is set to 0.0, now that means that the survival probability becomes 1.0. Rearranged the phi computation to avoid huge numbers, using ln instead. * Protein models get normalised now Protein models now get normalised instead of ignoring the flag. * Added tests to PIP methods missing from coverage Added tests for methods in in the PIP models that were not covered before. * Making sure the EvoModel trait is used in tests Making sure the method called for PIP is the method from the EvolutionaryModel trait rather than the implementation of the model so that the trait's methods are tested properly and all are being covered. * SubstModels treating unknown chars as X Made sure that Substitution models treat potential unknown characters (including gaps) as ambiguous chars (X). * More tests for subst models Made sure that substitution model methods are called through EvolutionaryModel trait methods rather than directly. Added tests for too many parameters for different DNA models to improve coverage. * Minor changes to make clippy happy * Change LikelihoodCostFunction to trait Refactored the LikelihoodCostFunction into a trait for future different implementations. Removed duplicated EvoModel code from SubstitutionModel module. Added notes for things that need to be fixed: The character probabilities only work correctly for ACGT-, not for ambiguous chars at the moment, and not for proteins; Evolutionary model should be a trait to then work with PIP and anything else, will also fix the char probabilities when implemented. * Refactoring likeihood computation for subst models Refactoring likelihood computation, moved all relevant functions to the substitution model module. * EvolitionaryModel trait impl for DNA substitution Added the EvolutionaryModel trait which should be used for all models of sequence evolution. Implemented the trait for DNA substitution models. All generic implementations remain for all substitution models, but both DNA and protein have to independently implement EvoModel trait to avoid implementation conflicts for the template. Also made sure that the basic character probabilities are computed through the trait -- there was an error where ambiguous chars were ignored. EvolutionaryModelInfo trait now requires a model with the EvoModel trait rather than just a substitution model. Additionally, fixed likelihood computation for multiple sites, expected only one value before in the final array, but it should contain len(MSA) column likelihoods. * Implemented EvolutionaryModel for protein subst Implemented the EvolutionaryModel trait for protein substitution models. Added tests for getting correct character probabilities. * Added tests for aligned/unaligned read sequences Added some test that check that the sequences don't get used as the MSA if they are different lengths, otherwise they get copied to the MSA field. * Changed FreqVector and SubstMatrix to dynamic size Changed the FreqVector and the SubstMatrix type to dynamic sizing to make them usable when the parametrisation on N (number of chars) becomes different for a model with gaps. It also makes sure that all the data types we use are the same, no real point in using statically typed matrices when most will still be dynamically typed (e.g. the partial likelihood matrices). * Protein substitution matrix transpose fix (#3) * Added struct for defining rounding Added a struct that contains 2 values -- whether to round some of the numbers and if yes, to how many decimal digits. Not necessary in general, but used in testing against the values produced by the python scripts, so now rounding is optional for parsimony scores derived from the substitution models and for the branch percentiles. * Added NodeIdx display and node id printing Implemented Display for NodeIdx so that it always prints what kind of node it is. Added a function that helps with logging -- generates string "with ID xxx" if there is an ID attached to the current node, or gives back an empty string. * Fixed protein matrices from by row to by cols The provided protein substitution matrices were actually given by rows whereas the Matrix struct reads them by columns. Transposed the matrices to match proper order of cols/rows. * Added tests for output to appease codecov Added tests for the helper functions to appease codecov so it lets me merge. * Initial implementation of substitution likelihood Initial implementation of the phylogenetic lieklihood for models of DNA substitution. Tests for the computations from the CB ETH lecture and the Molecular Evolution book. * Refactoring likeihood computation for subst models Refactoring likelihood computation, moved all relevant functions to the substitution model module. * Commit to fix rebase merge issues Decided to rebase to main to use the Rounding and GapMultiplier classes, probably was a bad idea to do it right now. Adding small fixes to correct my rebasing blunders. * Fixing rebase merge blunders * Removed duplicate test Removed a duplicate test coming from merge mistake during rebase
acg-team · Nov 15, 2023 · b7b49c2 · b7b49c2
1 parent 6b37db3
commit b7b49c2
Show file tree

Hide file tree

Showing 29 changed files with 5,188 additions and 1,241 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,3 @@
+{
+    "git.ignoreLimitWarning": true
+}
diff --git a/phylo/data/Huelsenbeck_example.newick b/phylo/data/Huelsenbeck_example.newick
@@ -0,0 +1 @@
+((Species1:0.1,Species2:0.1):0.2,((Species3:0.1,Species4:0.1):0.1,Species5:0.2):0.1):0.0;
diff --git a/phylo/data/Huelsenbeck_example_long_DNA.fasta b/phylo/data/Huelsenbeck_example_long_DNA.fasta
@@ -0,0 +1,13 @@
+>Species2
+TGACTTTAAAGGACGACCCTACCAGGGCGGACACAAACGGACAGCGCAGC
+>Species4
+CGAGTTCAGAAGACGGCACCAACACAGCGGACGTATGCAGACGACGCACC
+>Species5
+TGCCCTTAGGAGGCGGCACTAACACCGCGGACGAGTGCGGACAACGTACC
+>Species1
+TAACTGTAAAGGACAACACTAGCAGGCCAGACGCACACGCACAGCGCACC
+>Species3
+CAAGTTTAGAAAACGGCACCAACACAACAGACGTATGCAACTGACGCACC
+
+
+
diff --git a/phylo/data/Huelsenbeck_example_reroot.newick b/phylo/data/Huelsenbeck_example_reroot.newick
@@ -0,0 +1 @@
+((Species5:0.20000000000000004,(Species1:0.10000000000000003,Species2:0.10000000000000003):0.30000000000000004):0.05,(Species3:0.10000000000000003,Species4:0.10000000000000003):0.05);
diff --git a/phylo/data/ambiguous_example.fasta b/phylo/data/ambiguous_example.fasta
@@ -0,0 +1,10 @@
+>orangutan
+XCCCCTCCCCTCATGTGTAC
+>chimp
+ACCCCTCCCCTCATGTGTAC
+>human
+ACCCCTCCCCTCATGTGTAC
+>gorilla
+ACCCCTCCCCTCATGTGTAC
+>unicorn
+TGCCCTCCCCTCATGTGTAC
diff --git a/phylo/data/ambiguous_example.newick b/phylo/data/ambiguous_example.newick
@@ -0,0 +1 @@
+(unicorn:15,(orangutan:13,(gorilla:10.25,(human:5.5,chimp:5.5):4.75):2.75):2);
diff --git a/phylo/data/ambiguous_example_N.fasta b/phylo/data/ambiguous_example_N.fasta
@@ -0,0 +1,10 @@
+>orangutan
+NCCCCTCCCCTCATGTGTAC
+>chimp
+ACCCCTCCCCTCATGTGTAC
+>human
+ACCCCTCCCCTCATGTGTAC
+>gorilla
+ACCCCTCCCCTCATGTGTAC
+>unicorn
+TGCCCTCCCCTCATGTGTAC
diff --git a/phylo/data/phyml_protein_example.fasta b/phylo/data/phyml_protein_example.fasta
@@ -0,0 +1,40 @@
+>Patas
+MASGILLNVKEEVTCPICLELLTEPLSLPCGHSFCQACITANHKKSMLYKEEERSCPVCRISYQPENIQPNRHVANIVEKLREVKLSPEEGQKVDHCARHGEKLLLFCQEDRKVICWLCERSQEHRGHHTFLMEEVAQEYHVKLQTALEMLRQKQQEAEKLEADIREEKASWKIQIDYDKTNVLADFEQLREILDWEESNELQYLEKEEEDILKSLTKSETKMVRQTQYVRELISDLEHRLQGSMMELLQGVDGIIKRIENMTLKKPETFHKNQRRVFRAPALKGMLDMFRELTDVRRYWVDVTLAPNNISHVVIAEDKRQVSSRNPQIMYWAQGKLF--------------------QSLKNFNYCTGILGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDAMYDVEQNENYQPKYGYWVI-------------------------------------------------------------GLQEGVKYSVFQD-----------GSSHTPFAPFIAPLSVIFCPDRVGVFVDYEACTVSFFNITNHGFLIYKFSQCSFSKPVFPYLNPRKCTVPMTLCSPSS
+>Colobus
+MASGILVNIKEEVTCPICLELLTEPLSLHCGHSFCQACITANHKKSMLYKEGERSCPVCRISYQPENIRPNRHVANIVEKLREVKLSPEEGQKVDHCARHGEKLLLFCQEDRKVICWLCERSQEHRGHHTFLMEEVAQEYHVKLQTALEMLRQKQQEAEKLEADIREEKASWKIQIDYDKTNVLADFEQLREILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQYMRELVSDLEHRLQGSVMELLQGVDGIIKRIEDMTLKKPKTFPKNQRRVFRAPDLKGMLDMFRELTDVRRYWVDVTLAPNNISHAVIAEDKRRVSSPNPQIMYRAQGTLF--------------------QSLKNFIYCTGVLGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDAMYNIEQNENYQPKYGYWVI-------------------------------------------------------------GLQEGVKYSVFQD-----------GSSHTPFAPFIVPLSVIICPDRVGVFVDYEACTVSFFNITNHGFLIYKFSQCSFSKPVFPYLNPRKCTVPMTLCSPSS
+>DLangur
+MASGILVNIKEEVTCPICLELLTEPLSLHCGHSFCQACITANHKKSMLYKEGERSCPVCRISYQPENIRPNRHVANIVEKLREVKLSPEEGQKVDHCARHGEKLLLFCQEDRKVICWLCERSQEHRGHHTFLMEEVAQEYHVKLQTALEMLRQKQQEAEKLEADIREEKASWKIQIDCDKTNVLADFEQLREILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQYMRELISDLEHRLQGSMMELLQGVDGIIKRIENMTLKKPKTFPKNQRRVFRAPDLKGILDMFRELTDVRRYWVDVTLAPNNISHAVIAEDKRQVSSPNPQIMCRARGTLF--------------------QSLKNFIYCTGVLGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDAMYNIEQNENYQPKYGYWVI-------------------------------------------------------------GLQEGVKYNVFQD-----------GSSHTPFAPFIVPLSVIICPDRVGVFVDYEACTVSFFNITNHGFLIYKFSQCSFSKPVFPYLNPRKCTVPMTLCSPSS
+>AGM_cDNA
+MASGILVNVKEEVTCPICLELLTEPLSLPCGHSFCQACITANHKESMLYKEEERSCPVCRISYQPENIQPNRHVANIVEKLREVKLSPEEGQKVDHCARHGEKLLLFCQEDSKVICWLCERSQEHRGHHTFLMEEVAQEYHVKLQTALEMLRQKQQEAEKLEADIREEKASWKIQIDYDKTNVSADFEQLREILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQYMRELISDLEHRLQGSMMELLQGVDGIIKRVENMTLKKPKTFHKNQRRVFRAPDLKGMLDMFRELTDVRRYWVDVTLAPNNISHAVIAEDKRQVSYRNPQIMYQSPGSLFGSLTNFSYCTGVPGSQSITSGKLTNFNYCTGVLGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDATYNIEQNENYQPKYGYWVI-------------------------------------------------------------GLQEGDKYSVFQD-----------GSSHTPFAPFIVPLSVIICPDRVGVFVDYEACTVSFFNITNHGFLIYKFSQCSFSKPVFPYLNPRKCTVPMTLCSPSS
+>Tant_cDNA
+MASGILLNVKEEVTCPICLELLTEPLSLPCGHSFCQACITANHKESMLYKEEERSCPVCRISYQPENIQPNRHVANIVEKLREVKLSPEEGQKVDHCARHGEKLLLFCQEDSKVICWLCERSQEHRGHHTFLMEEVAQEYHVKLQTALEMLRQKQQEAEKLEADIREEKASWKIQIDYDKTNVSADFEQLREILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQYMRELISDLEHRLQGSMMELLQGVDGIIKRIENMTLKKPKTFHKNQRRVFRAPDLKGMLDMFRELTDVRRYWVDVTLAPNNISHAVIAEDKRQVSYQNPQIMYQAPGSSFGSLTNFNYCTGVLGSQSITSRKLTNFNYCTGVLGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDATYNIEQNENYQPKYGYWVI-------------------------------------------------------------GLQEGDKYSVFQD-----------GSSHTPFAPFIVPLSVIICPDRVGVFVDYEACTVSFFNITNHGFLIYKFSQCSFSKPVFPYLNPRKCTVPMTLCSPSS
+>Rhes_cDNA
+MASGILLNVKEEVTCPICLELLTEPLSLHCGHSFCQACITANHKKSMLYKEGERSCPVCRISYQPENIQPNRHVANIVEKLREVKLSPEEGQKVDHCARHGEKLLLFCQEDSKVICWLCERSQEHRGHHTFLMEEVAQEYHVKLQTALEMLRQKQQEAEKLEADIREEKASWKIQIDYDKTNVSADFEQLREILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQYMRELISELEHRLQGSMMDLLQGVDGIIKRIENMTLKKPKTFHKNQRRVFRAPDLKGMLDMFRELTDARRYWVDVTLATNNISHAVIAEDKRQVSSRNPQIMYQAPGTLF------------------TFPSLTNFNYCTGVLGSQSITSGKHYWEVDVSKKSAWILGVCAGFQSDAMYNIEQNENYQPKYGYWVI-------------------------------------------------------------GLQEGVKYSVFQD-----------GSSHTPFAPFIVPLSVIICPDRVGVFVDYEACTVSFFNITNHGFLIYKFSQCSFSKPVFPYLNPRKCTVPMTLCSPSS
+>Baboon
+MASGILLNVKEEVTCPICLELLTEPLSLPCGHSFCQACITANHRKSMLYKEGERSCPVCRISYQPENIQPNRHVANIVEKLREVKLSPEEGLKVDHCARHGEKLLLFCQEDSKVICWLCERSQEHRGHHTFLMEEVAQEYHVKLQTALEMLRQKQQEAEKLEADIREEKASWKIQIDYDKTNVSADFEQLREILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQYMRELISDLEHRLQGSMMELLQGVDGIIKRIENMTLKKPKTFHKNQRRVFRAPDLKGMLDMFRELTDVRRYWVDVTLAPNNISHAVIAEDKRQVSSRNPQITYQAPGTLF------------------SFPSLTNFNYCTGVLGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDAMYNIEQNENYQPKYGYWVI-------------------------------------------------------------GLQEGVKYSVFQD-----------GSSHTPFAPFIVPLSVIICPDRVGVFVDYEACTVSFFNITNHGFLIYKFSQCSFSKPVFPYLNPRKCTVPMTLCSPSS
+>Gibbon
+MASGILVNVKEKVTCPICLELLTQPLSLDCGHSFCQACLTANHKTSMPDE-GERSCPVCRISYQHKNIRPNRHVANIVEKLREVKLSPEEGQKVDHCARHGKKLLLFCQEDRKVICWLCERSQEHRGHHTFLTEEVAQEYQMKLQAALQMLRQKQQEAEELEADIREEKASWKTQIQYDKTNILADFEQLRHILDWVESNELQNLEKEEKDVLKRLMRSEIEMVQQTQSVRELISDLEHRLQGSVMELLQGVDGVIKRMKNVTLKKPETFPKNRRRVFRAADLKVMLEVLRELRDVRRYWVDVTVAPNNISYAVISEDMRQVSSPEPQIIFEAQGTIS--------------------QTFVNFNYCTGILGSQSITSGKHYWEVDVSKKSAWILGVCAGLQPDAMYNIEQNENYQPKYGYWVI-------------------------------------------------------------GLEEGVKCNAFQD-----------GSIHTPSAPFVVPLSVNICPDRVGVFLDYEACTVSFFNITDHGFLIYKFSHCSFSQPVFPYLNPRKCTVPMTLCSPSS
+>Orangutan
+MASGILVNVKEEVTCPICLELLTQPLSLDCGHSFCQACLTANHKKSTLDK-GERSCPVCRVSYQPKNIRPNRHVANIVEKLREVKLSPE-GQKVDHCARHGEKLLLFCKEDGKVICWLCERSQEHRGHHTFLTEEVAQKYQVKLQAALEMLRQKQQEAEELEADIREEKASWKTQIQYDKTSVLADFEQLRDILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQSVRELISDVEHRLQGSVMELLQGVDGIIKRMQNVTLKKPETFPKNQRRVFRAPNLKGMLEVFRELTDVRRYWVDVTVAPNDISYAVISEDMRQVSCPEPQIIYGAQGTTY--------------------QTYVNFNYCTGILGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDAMYNIEQNENYQPQYGYWVI-------------------------------------------------------------GLEEGVKCSAFQD-----------GSFHNPSAPFIVPLSVIICPDRVGVFLDYEACTVSFFNITNHGFLIYKFSHCSFSQPVFPYLNPRKCRVPMTLCSPSS
+>Human
+MASGILVNVKEEVTCPICLELLTQPLSLDCGHSFCQACLTANHKKSMLDK-GESSCPVCRISYQPENIRPNRHVANIVEKLREVKLSPE-GQKVDHCARHGEKLLLFCQEDGKVICWLCERSQEHRGHHTFLTEEVAREYQVKLQAALEMLRQKQQEAEELEADIREEKASWKTQIQYDKTNVLADFEQLRDILDWEESNELQNLEKEEEDILKSLTNSETEMVQQTQSLRELISDLEHRLQGSVMELLQGVDGVIKRTENVTLKKPETFPKNQRRVFRAPDLKGMLEVFRELTDVRRYWVDVTVAPNNISCAVISEDKRQVSSPKPQIIYGARGTRY--------------------QTFVNFNYCTGILGSQSITSGKHYWEVDVSKKTAWILGVCAGFQPDAMCNIEKNENYQPKYGYWVI-------------------------------------------------------------GLEEGVKCSAFQD-----------SSFHTPSVPFIVPLSVIICPDRVGVFLDYEACTVSFFNITNHGFLIYKFSHCSFSQPVFPYLNPRKCGVPMTLCSPSS
+>Gorilla
+MASGILVNVKEEVTCPICLELLTQPLSLDCGHSFCQACLTANHKKSMLDK-GESSCPVCRISYQPENIRPNRHVANIVEKLREVKLSPE-GQKVDHCARHGEKLLLFCQEDGKVICWLCERSQEHRGHHTFLTEEVAQEYQVKLQAALEMLRQKQQEAEELEADIREEKASWKTQIQYDKTNVLADFEQLRDILDWEESNELQNLEKEEEDILKRLTKSETEMVQQTQSVRELISDLEHRLQGSVMELLQGVDGVIKRMENVTLKKPETFPKNRRRVFRAPDLKGMLEVFRELTDVRRYWVDVTVAPNNISCAVISEDMRQVSSPKPQIIYGAQGTRY--------------------QTFMNFNYCTGILGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDATCNIEKNENYQPKYGYWVI-------------------------------------------------------------GLEEGVKCSAFQD-----------GSFHTPSAPFIVPLSVIICPDRVGVFLDYEACTVSFFNITNHGFLIYKFSHCSFSQPVFPYLNPRKCRVPMTLCSPSS
+>Chimp
+MASGILVNVKEEVTCPICLELLTQPLSLDCGHSFCQACLTANHKKSMLDK-GESSCPVCRISYQPENIRPNRHVANIVEKLREVKLSPE-GQKVDHCAHHGEKLLLFCQEDGKVICWLCERSQEHRGHHTFLTEEVAREYQVKLQAALEMLRQKQQEAEELEADIREEKASWKTQIQYDKTNVLADFEQLRDILDWEESNELQNLEKEEEDILKSLTKSETEMVQQTQSVRELISDLERRLQGSVMELLQGVDGVIKRMENVTLKKPETFPKNQRRVFRAPDLKGMLEVFRELTDVRRYWVDVTVAPNNISCAVISEDMRQVSSPKPQIIYGARGTRY--------------------QTFMNFNYCTGILGSQSITSGKHYWEVDVSKKSAWILGVCAGFQPDAMCNIEKNENYQPKYGYWVI-------------------------------------------------------------GLEEGVKCSAFQD-----------GSFHTPSAPFIVPLSVIICPDRVGVFLDYEACTVSFFNITNHGSLIYKFSHCSFSQPVFPYLNPRKCGVPMTLCSPSS
+>Squirrel
+MASRILGSIKEEVTCPICLELLTEPLSLDCGHSFCQACITANHKESMLHQ-GERSCPLCRLPYQSENLRPNRHLASIVERLREVMLRPEERQNVDHCARHGEKLLLFCEQDGNIICWLCERSQEHRGHNTFLVEEVAQKYREKLQVALETMRQKQQDAEKLEADVRQEQASWKIQIQNDKTNIMAEFKQLRDILDCEESNELQNLEKEEKNILKRLVQSENDMVLQTQSVRVLISDLERRLQGSVVELLQDVDGVIKRIEKVTLQKPKTFLNEKRRVFRAPDLKRMLQVLKELTEVQRYWAHVTLVPSHPSYTIISEDGRQVRYQKPIR-----------------------------HLLVKVQYFYGVLGSPSITSGKHYWEVDVSNKRAWTLGVCVSLKCTANQSVSGTENYQPKNGYWVI-------------------------------------------------------------GLRNAGNYRAFQSSFEFR--DFLAGSRLTLSPPLIVPLFMTICPNRVGVFLDYEARTISFFNVTSNGFLIYKFSDCHFSYPVFPYFNPMTCELPMTLCSPRS
+>Howler
+MASKILVNIKEEVTCPICLELLTEPLSLDCGHSFCQACITANHKESR-----ERSCPLCRVSYHSENLRPNRHLANIAERLREVMLSPEEGQKVDRCARHGEKLLLFCQQHGNVICWLCERSEEHRGHRTSLVEEVAQKYREKLQAALEMMRQKEQDAEMLEADVREEQASWKIQIENDKTSTLAEFKQLRDILDCEESNELQKLEKEEENLLKRLVQSENDMVLQTQSIRVLIADLERRLQGSVMELLQGVEGVIKRIKNVTLQKPETFLNEKRRVFQAPDLKGMLQVFKELKEVQCYWAHVTLIPNHPSCTVISEDKREVRYQEQIHH----------------------------HPSMEVKYFYGILGSPSITSGKHYWEVDVSNKSAWILGVCVSLKCIG--NFPGIENYQPQNGYWVIGLRNADNYSAFQDAVPETENYQPKNRN-RFTGLQNADNCSAFQNAFPGIQSYQPKKSHLFTGLQNLSNYNAFQNKVQYNYIDFQDDSLSTPSAPLIVPLFMTICPKRVGVFLDYEACTVSFFNVTSNGYLIYKFSNCQFSYPVFPYFSPMTCELPMTLCSPSS
+>Spider
+MASEILLNIKEEVTCPICLELLTEPLSLDCGHSFCQACITANHKESTLHQ-GERSCPLCRVSYQSENLRPNRHLANIAERLREVMLSPEEGQKVDRCARHGEKLLLFCQQHGNVICWLCERSQEHRGHSTFLVEEVAQKYQEKLQVALEMMRQKQQDAEKLEADVREEQASWKIQIENDKTNILAEFKQLRDILDCEESNELQNLEKEEENLLKTLAQSENDMVLQTQSMRVLIADLEHRLQGSVMELLQDVEGVIKRIKNVTLQKPKTFLNEKRRVFRAPDLKGMLQVFKELKEVQCYWAHVTLVPSHPSCTVISEDERQVRYQEQIH-----------------------------QPSVKVKYFCGVLGSPGFTSGKHYWEVDVSDKSAWILGVCVSLKCTA--NVPGIENYQPKNGYWVIGLQNANNYSAFQDAVPGTENYQPKNGNRRNKGLRNADNYSAFRDTF------QPINDSWVTGLRNVDNYNAFQDAVKYS--DFQDGSCSTPSAPLMVPLFMTICPKRVGVFLDCKACTVSFFNVTSNGCLIYKFSKCHFSYPVFPYFSPMICKLPMTLCSPSS
+>Woolly
+MASEILVNIKEEVTCPICLDLLTEPLSLDCGHSFCQACITADHKESTLHQ-GERSCPLCRVGYQSENLRPNRHLANIAERLREVMLSPEEGQKVDRCARHGEKLLLFCQQHGNVICWLCERSQEHRGHSTFLVEEVAQKYREKLQVALEMMREKQQDAEKLEADVREEQASWKIQIKNDKTNILAEFKQLRDILDCEESNELQNLEKEEENLLKILAQSENDMVLQTQSMRVLIADLEHRLQGSVMELLQGVEGIIKRTTNVTLQKPKTFLNEKRRVFRAPNLKGMLQVFKELKEVQCYWAHVTLVPSHPSCAVISEDQRQVRYQKQRH-----------------------------RPSVKAKYFYGVLGSPSFTSGKHYWEVDVSNKSAWILGVCVSLKCTA--NVPGIENYQPKNGYWVIGLQNADNYSAFQDAVPGTEDYQPKNGCWRNTGLRNADNYSAFQDVF------QPKNDYWVTGLWNADNYNAFQDAGKYS--DFQDGSCSTPFAPLIVPLFMTIRPKRVGVFLDYEACTVSFFNVTSNGCLIYKFSNCHFSCPVFPYFSPMTCKLPMTLCSPSS
+>PMarmoset
+MASRILVNIKEEVTCPICLELLTEPLSLDCGHSFCQACITANHKESTLHQ-GERSCPLCRMSYPSENLRPNRHLANIVERLKEVMLSPEEGQKVDHCARHGEKLLLFCQQDGNVICWLCERSQEHRGHHTFLVEEVAEKYQGKLQVALEMMRQKQQDAEKLEADVREEQASWKIQIQNDKTNIMAEFKQLRDILDCEESKELQNLEKEEKNILKRLVQSESDMVLQTQSIRVLISDLERRLQGSVMELLQGVDDVIKRIEKVTLQKPKTFLNEKRRVFRAPDLKGMLQAFKELTEVQRYWAHVTLVPSHPSCTVISEDERQVRYQVPIH-----------------------------QPLVKVKYFYGVLGSLSITSGKHYWEVDVSNKRGWILGVCGSWKCNAKWNVLRPENYQPKNGYWVI-------------------------------------------------------------GLRNTDNYSAFQDAVKYS--DVQDGSRSVSSGPLIVPLFMTICPNRVGVFLDYEACTISFFNVTSNGFLIYKFSNCHFSYPVFPYFSPTTCELPMTLCSPSS
+>Tamarin
+MASRILVNIKEEVTCPICLELLTEPLSLDCGHSFCQACITANHKESTPHQ-GERSCPLCRMSYPSENLRPNRHLANIVERLKEVMLSPEEGQKVGHCARHGEKLLLFCEQDGNVICWLCERSQEHRGHHTLLVEEVAEKYQEKLQVALEMMRQKQQDAEKLEADVREEQASWKIQIRNDKTNIMAEFKQLRDILDCEESKELQNLEKEEKNILKRLVQSESDMVLQTQSMRVLISDLERRLQGSVLELLQGVDDVIKRIETVTLQKPKTFLNEKRRVFRAPDLKAMLQAFKELTEVQRYWAHVTLVPSHPSYAVISEDERQVRYQFQIH-----------------------------QPSVKVNYFYGVLGSPSITSGKHYWEVDVTNKRDWILGICVSFKCNAKWNVLRPENYQPKNGYWVI-------------------------------------------------------------GLQNTNNYSAFQDAVKYS--DFQIGSRSTASVPLIVPLFMTIYPNRVGVFLDYEACTVSFFNVTNNGFLIYKFSNCHFSYPVFPYFSPMTCELPMTLCSPSS
+>Titi
+MASRILVNIKEEVTCPICLELLTEPLSLDCGHSFCQACITANHKESTLHQ-GERSCPLCRISYPSENLRPNRHLANIVERLREVVLSPEEGQKVDLCARHGEKLLLFCQQDGNVICWLCERSQEHRGHHTFLVEEVAQTYRENLQVVLEMMRQKHQDAEKLEADVREEQASWKIQIQNDKTNIMAEFKQLRDILDCEESNELQNLEKEEKNILKRLVQSENDMVLQTQSISVLISDLEHRLQGSVMELLQGVDGVIKRVKNVTLQKPKTFLNEKRRVFRVPDLKGMLQVSKELTEVQRYWAHVTLVASHPSRAVISEDERQVRYQEWIH-----------------------------QSSGRVKYFYGVLGSPSITSGKHYWEVDVSNKSAWILGVCVSLKCAANRNGPGVENYQPKNGYWVI-------------------------------------------------------------GLRNADNYSAFQDSVKYN--DFQDGSRSTTYAPLIVPLFMTICPNRVGVFLDYEACTVSFFNVTSNGFLIYKFSNCHFSYPVFPYFSPMTCELPMTLCSPRS
+>Saki
+MASRILMNIKEEVTCPICLELLTEPLSLDCGHSFCQACITANHKKSMLHQ-GERSCPLCRISYPSENLRPNRHLANIVERLREVMLSPEEGQKVDHCARHGEKLLLFCQQDGNVICWLCERSQEHRGHHTLLVEEVAQTYRENLQVALETMRQKQQDAEKLEADVREEQASWKIQIRDDKTNIMAEFKQLRDILDCEESNELQILEKEEKNILKRLTQSENDMVLQTQSMGVLISDLEHRLQGSVMELLQGVDEVIKRVKNVTLQKPKTFLNEKRRVFRAPDLKGMLQVFKELTEVQRYWVHVTLVPSHLSCAVISEDERQVRYQERIH-----------------------------QSFGKVKYFYGVLGSPSIRSGKHYWEVDVSNKSAWILGVCVSLKCTANRNGPRIENYQPKNGYWVI-------------------------------------------------------------GLWNAGNYSAFQDSVKYS--DFQDGSHSATYGPLIVPLFMTICPNRVGVFLDYEACTVSFFNVTSNGFLIYKFSNCRFSDSVFPYFSPMTCELPMTLCSPRS
diff --git a/phylo/data/phyml_protein_example.newick b/phylo/data/phyml_protein_example.newick
@@ -0,0 +1 @@
+(((((((((Spider:0.03308191,Woolly:0.03582163):0.01444277,Howler:0.06799737):0.02909882,(((PMarmoset:0.02787360,Tamarin:0.03681352):0.01865025,Squirrel:0.08629746):0.01112128,(Saki:0.03881905,Titi:0.04068062):0.01757714):0.00607569):0.20818264,(((Chimp:0.01029099,Gorilla:0.00446741):0.00330774,Human:0.01513926):0.00720972,(Gibbon:0.05851278,Orangutan:0.03164833):0.00204286):0.02732959):0.04351306,(Colobus:0.00814254,DLangur:0.00661586):0.00733489):0.00608429,Patas:0.02612272):0.00687099,(AGM_cDNA:0.00495553,Tant_cDNA:0.00344975):0.00775707):0.00140317,Baboon:0.00482829):0.0,Rhes_cDNA:0.01205729):0.0;