Likelihood calculation for subst models and pip #5

junniest · 2023-11-14T20:39:09Z

Added likelihood computations for both simple substitution models and PIP.

Fixed DNA model row/column placement, the matrices were transposed before. Fixed also the summation on the diagonal, doesnt seem necessary with the other fix, but this way the written array for matrix values is easier to read. Added TN93 model.

Initial implementation of the phylogenetic lieklihood for models of DNA substitution. Tests for the computations from the CB ETH lecture and the Molecular Evolution book.

Running coverage on all branches on a push, and additionally on main on a pull request

Fixed the tests for protein model stationarity. Fixed HIVB AA substitution model stationary frequencies. Added tests for HIVB model.

Added back a missing function for creating an empty alignment that is needed by the parsimony aligner.

* Minor fixes to make linter happy Minor fixes to make clippy happy, removed unnecessary vec!, into_iter() and the like. * Fixed sequence order to match leaves Fixed sequence order in the vector to match leaf indices. The code relied on that property to begin with, but it was neither being tested nor was it being enforced in any way which lead to very malformed alignments in IndelMaP. This contains the fix and a test to check that the order matches together with the data for the test.

Refactored the LikelihoodCostFunction into a trait for future different implementations. Removed duplicated EvoModel code from SubstitutionModel module. Added notes for things that need to be fixed: The character probabilities only work correctly for ACGT-, not for ambiguous chars at the moment, and not for proteins; Evolutionary model should be a trait to then work with PIP and anything else, will also fix the char probabilities when implemented.

Refactoring likelihood computation, moved all relevant functions to the substitution model module.

Test data files for substitution likelihood computation

Added the EvolutionaryModel trait which should be used for all models of sequence evolution. Implemented the trait for DNA substitution models. All generic implementations remain for all substitution models, but both DNA and protein have to independently implement EvoModel trait to avoid implementation conflicts for the template. Also made sure that the basic character probabilities are computed through the trait -- there was an error where ambiguous chars were ignored. EvolutionaryModelInfo trait now requires a model with the EvoModel trait rather than just a substitution model. Additionally, fixed likelihood computation for multiple sites, expected only one value before in the final array, but it should contain len(MSA) column likelihoods.

Removed the new() method from PhyloInfo due to it being misleading, it didn't check for data validity (that tree tips/sequences correspond to one another, that the sequences are stored in the right order).

Added tests for computing nucleotide probabilities on sequences.

Code cleanup for clippy: removed unnecessary vector creation in favour of passing slices.

Added alignment likelihood tests for sequences that are longer than 1: An exampkle from Huelsenbeck with 50 sites; A computed example from the CB lectures with X or N characters to check how ambiguous chars are processed.

Implemented the EvolutionaryModel trait for protein substitution models. Added tests for getting correct character probabilities.

Added likelihood calculation to the protein substitution models. Added a test example with the likelihoods being close to what phyml estimates but not quite. The phyml_protein_nogap_example files are for that example, phyml wants .phy sequences and an unrooted tree.

Added 2 test cases for substitution likelihood reversibility: 1. simple fabricated example, tn93 likelihoods are the same on two trees 2. rerooted the huelsenbeck example tree, GTR and k80 likelihoods are the same.

Refactored to move the EvolutionaryModel trait to its own module so that evolutionary models are not just substitution models.

Added the MSA field to the PhyloInfo struct, now when reading data from a file the msa will only be set if all the read sequences have the same length. Need to add tests for making sure that sequences are aligned where they need to be.

Added some test that check that the sequences don't get used as the MSA if they are different lengths, otherwise they get copied to the MSA field.

Changed the FreqVector and the SubstMatrix type to dynamic sizing to make them usable when the parametrisation on N (number of chars) becomes different for a model with gaps. It also makes sure that all the data types we use are the same, no real point in using statically typed matrices when most will still be dynamically typed (e.g. the partial likelihood matrices).

* Minor fixes to make linter happy Minor fixes to make clippy happy, removed unnecessary vec!, into_iter() and the like. * Fixed sequence order to match leaves Fixed sequence order in the vector to match leaf indices. The code relied on that property to begin with, but it was neither being tested nor was it being enforced in any way which lead to very malformed alignments in IndelMaP. This contains the fix and a test to check that the order matches together with the data for the test.

* Added struct for defining rounding Added a struct that contains 2 values -- whether to round some of the numbers and if yes, to how many decimal digits. Not necessary in general, but used in testing against the values produced by the python scripts, so now rounding is optional for parsimony scores derived from the substitution models and for the branch percentiles. * Added NodeIdx display and node id printing Implemented Display for NodeIdx so that it always prints what kind of node it is. Added a function that helps with logging -- generates string "with ID xxx" if there is an ID attached to the current node, or gives back an empty string. * Fixed protein matrices from by row to by cols The provided protein substitution matrices were actually given by rows whereas the Matrix struct reads them by columns. Transposed the matrices to match proper order of cols/rows. * Added tests for output to appease codecov Added tests for the helper functions to appease codecov so it lets me merge.

Initial implementation of the phylogenetic lieklihood for models of DNA substitution. Tests for the computations from the CB ETH lecture and the Molecular Evolution book.

Refactored the LikelihoodCostFunction into a trait for future different implementations. Removed duplicated EvoModel code from SubstitutionModel module. Added notes for things that need to be fixed: The character probabilities only work correctly for ACGT-, not for ambiguous chars at the moment, and not for proteins; Evolutionary model should be a trait to then work with PIP and anything else, will also fix the char probabilities when implemented.

Refactoring likelihood computation, moved all relevant functions to the substitution model module.

Fixed the phyml primate example tree to be rooted and removed the confusing duplicate tree for the nogap alignment.

Added a test to verify PIP reversibility for a rerooted tree and DNA sequences.

Fixed survival probabilities becoming NaN when a branch length is set to 0.0, now that means that the survival probability becomes 1.0. Rearranged the phi computation to avoid huge numbers, using ln instead.

Protein models now get normalised instead of ignoring the flag.

Added tests for methods in in the PIP models that were not covered before.

Making sure the method called for PIP is the method from the EvolutionaryModel trait rather than the implementation of the model so that the trait's methods are tested properly and all are being covered.

Made sure that Substitution models treat potential unknown characters (including gaps) as ambiguous chars (X).

Made sure that substitution model methods are called through EvolutionaryModel trait methods rather than directly. Added tests for too many parameters for different DNA models to improve coverage.