Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likelihood calculation for subst models and pip #5

Merged
merged 68 commits into from
Nov 15, 2023

Conversation

junniest
Copy link
Contributor

Added likelihood computations for both simple substitution models and PIP.

Fixed DNA model row/column placement, the matrices were transposed before.
Fixed also the summation on the diagonal, doesnt seem necessary with the other
fix, but this way the written array for matrix values is easier to read.
Added TN93 model.
Initial implementation of the phylogenetic lieklihood for models of DNA
substitution.
Tests for the computations from the CB ETH lecture and the Molecular
Evolution book.
Running coverage on all branches on a push, and additionally on main on a pull request
Fixed the tests for protein model stationarity.
Fixed HIVB AA substitution model stationary frequencies.
Added tests for HIVB model.
Added back a missing function for creating an empty alignment that is
needed by the parsimony aligner.
* Minor fixes to make linter happy

Minor fixes to make clippy happy, removed unnecessary vec!, into_iter()
and the like.

* Fixed sequence order to match leaves

Fixed sequence order in the vector to match leaf indices. The code relied
on that property to begin with, but it was neither being tested nor
was it being enforced in any way which lead to very malformed alignments
in IndelMaP.
This contains the fix and a test to check that the order matches together
with the data for the test.
Refactored the LikelihoodCostFunction into a trait for future different
implementations.
Removed duplicated EvoModel code from SubstitutionModel module.
Added notes for things that need to be fixed:
    The character probabilities only work correctly for ACGT-, not
for ambiguous chars at the moment, and not for proteins;
    Evolutionary model should be a trait to then work with PIP and
anything else, will also fix the char probabilities when implemented.
Refactoring likelihood computation, moved all relevant functions to
the substitution model module.
Test data files for substitution likelihood computation
Added the EvolutionaryModel trait which should be used for all models
of sequence evolution.
Implemented the trait for DNA substitution models.
All generic implementations remain for all substitution models, but both
DNA and protein have to independently implement EvoModel trait to avoid
implementation conflicts for the template.
Also made sure that the basic character probabilities are computed through
the trait -- there was an error where ambiguous chars were ignored.
EvolutionaryModelInfo trait now requires a model with the EvoModel
trait rather than just a substitution model.

Additionally, fixed likelihood computation for multiple sites, expected only
one value before in the final array, but it should contain len(MSA) column
likelihoods.
Removed the new() method from PhyloInfo due to it being misleading,
it didn't check for data validity (that tree tips/sequences correspond to one
another, that the sequences are stored in the right order).
Added tests for computing nucleotide probabilities on sequences.
Code cleanup for clippy: removed unnecessary vector creation in favour of passing
slices.
Added alignment likelihood tests for sequences that are longer than 1:
    An exampkle from Huelsenbeck with 50 sites;
    A computed example from the CB lectures with X or N characters to
    check how ambiguous chars are processed.
Implemented the EvolutionaryModel trait for protein substitution models.
Added tests for getting correct character probabilities.
Added likelihood calculation to the protein substitution models.
Added a test example with the likelihoods being close to what phyml estimates
but not quite.
The phyml_protein_nogap_example files are for that example, phyml wants .phy
sequences and an unrooted tree.
Added 2 test cases for substitution likelihood reversibility:
    1. simple fabricated example, tn93 likelihoods are the same on two trees
    2. rerooted the huelsenbeck example tree, GTR and k80 likelihoods are
    the same.
Refactored to move the EvolutionaryModel trait to its own module so that
evolutionary models are not just substitution models.
Added the MSA field to the PhyloInfo struct, now when reading data from a file the
msa will only be set if all the read sequences have the same length.
Need to add tests for making sure that sequences are aligned where they need to be.
Added some test that check that the sequences don't get used as the MSA if
they are different lengths, otherwise they get copied to the MSA field.
Changed the FreqVector and the SubstMatrix type to dynamic sizing to make them usable
when the parametrisation on N (number of chars) becomes different for a model with gaps.
It also makes sure that all the data types we use are the same, no real point in using
statically typed matrices when most will still be dynamically typed (e.g. the partial
likelihood matrices).
* Minor fixes to make linter happy

Minor fixes to make clippy happy, removed unnecessary vec!, into_iter()
and the like.

* Fixed sequence order to match leaves

Fixed sequence order in the vector to match leaf indices. The code relied
on that property to begin with, but it was neither being tested nor
was it being enforced in any way which lead to very malformed alignments
in IndelMaP.
This contains the fix and a test to check that the order matches together
with the data for the test.
* Added struct for defining rounding

Added a struct that contains 2 values -- whether to round some of the numbers
and if yes, to how many decimal digits. Not necessary in general, but used in
testing against the values produced by the python scripts, so now rounding is
optional for parsimony scores derived from the substitution models and for the
branch percentiles.

* Added NodeIdx display and node id printing

Implemented Display for NodeIdx so that it always prints what kind of node
it is.
Added a function that helps with logging -- generates string "with ID xxx"
if there is an ID attached to the current node, or gives back an empty string.

* Fixed protein matrices from by row to by cols

The provided protein substitution matrices were actually given by rows
whereas the Matrix struct reads them by columns.
Transposed the matrices to match proper order of cols/rows.

* Added tests for output to appease codecov

Added tests for the helper functions to appease codecov so it lets me merge.
Initial implementation of the phylogenetic lieklihood for models of DNA
substitution.
Tests for the computations from the CB ETH lecture and the Molecular
Evolution book.
Refactored the LikelihoodCostFunction into a trait for future different
implementations.
Removed duplicated EvoModel code from SubstitutionModel module.
Added notes for things that need to be fixed:
    The character probabilities only work correctly for ACGT-, not
for ambiguous chars at the moment, and not for proteins;
    Evolutionary model should be a trait to then work with PIP and
anything else, will also fix the char probabilities when implemented.
Refactoring likelihood computation, moved all relevant functions to
the substitution model module.
Fixed the phyml primate example tree to be rooted and removed the confusing duplicate
tree for the nogap alignment.
Added a test to verify PIP reversibility for a rerooted tree and DNA sequences.
Fixed survival probabilities becoming NaN when a branch length is set to 0.0,
now that means that the survival probability becomes 1.0.
Rearranged the phi computation to avoid huge numbers, using ln instead.
Protein models now get normalised instead of ignoring the flag.
Added tests for methods in in the PIP models that were not covered before.
Making sure the method called for PIP is the method from the EvolutionaryModel
trait rather than the implementation of the model so that the trait's methods
are tested properly and all are being covered.
Made sure that Substitution models treat potential unknown characters (including gaps)
as ambiguous chars (X).
Made sure that substitution model methods are called through EvolutionaryModel trait
methods rather than directly.
Added tests for too many parameters for different DNA models to improve coverage.
Refactored the LikelihoodCostFunction into a trait for future different
implementations.
Removed duplicated EvoModel code from SubstitutionModel module.
Added notes for things that need to be fixed:
    The character probabilities only work correctly for ACGT-, not
for ambiguous chars at the moment, and not for proteins;
    Evolutionary model should be a trait to then work with PIP and
anything else, will also fix the char probabilities when implemented.
Refactoring likelihood computation, moved all relevant functions to
the substitution model module.
Added the EvolutionaryModel trait which should be used for all models
of sequence evolution.
Implemented the trait for DNA substitution models.
All generic implementations remain for all substitution models, but both
DNA and protein have to independently implement EvoModel trait to avoid
implementation conflicts for the template.
Also made sure that the basic character probabilities are computed through
the trait -- there was an error where ambiguous chars were ignored.
EvolutionaryModelInfo trait now requires a model with the EvoModel
trait rather than just a substitution model.

Additionally, fixed likelihood computation for multiple sites, expected only
one value before in the final array, but it should contain len(MSA) column
likelihoods.
Implemented the EvolutionaryModel trait for protein substitution models.
Added tests for getting correct character probabilities.
Added some test that check that the sequences don't get used as the MSA if
they are different lengths, otherwise they get copied to the MSA field.
Changed the FreqVector and the SubstMatrix type to dynamic sizing to make them usable
when the parametrisation on N (number of chars) becomes different for a model with gaps.
It also makes sure that all the data types we use are the same, no real point in using
statically typed matrices when most will still be dynamically typed (e.g. the partial
likelihood matrices).
* Added struct for defining rounding

Added a struct that contains 2 values -- whether to round some of the numbers
and if yes, to how many decimal digits. Not necessary in general, but used in
testing against the values produced by the python scripts, so now rounding is
optional for parsimony scores derived from the substitution models and for the
branch percentiles.

* Added NodeIdx display and node id printing

Implemented Display for NodeIdx so that it always prints what kind of node
it is.
Added a function that helps with logging -- generates string "with ID xxx"
if there is an ID attached to the current node, or gives back an empty string.

* Fixed protein matrices from by row to by cols

The provided protein substitution matrices were actually given by rows
whereas the Matrix struct reads them by columns.
Transposed the matrices to match proper order of cols/rows.

* Added tests for output to appease codecov

Added tests for the helper functions to appease codecov so it lets me merge.
Initial implementation of the phylogenetic lieklihood for models of DNA
substitution.
Tests for the computations from the CB ETH lecture and the Molecular
Evolution book.
Refactoring likelihood computation, moved all relevant functions to
the substitution model module.
Decided to rebase to main to use the Rounding and GapMultiplier classes,
probably was a bad idea to do it right now. Adding small fixes to correct
my rebasing blunders.
Copy link
Contributor

github-actions bot commented Nov 14, 2023

Test results

169 tests  +102   169 ✔️ +102   0s ⏱️ ±0s
    2 suites ±    0       0 💤 ±    0 
    1 files   ±    0       0 ±    0 

Results for commit ac204c5. ± Comparison against base commit c19493f.

This pull request removes 1 and adds 103 tests. Note that renamed tests count towards both.
substitution_models::substitution_models_tests::protein_p_matrix ‑ case_3_hivb
likelihood::likelihood_tests ‑ dna_ambig_example_likelihood
likelihood::likelihood_tests ‑ dna_cb_example_likelihood
likelihood::likelihood_tests ‑ dna_huelsenbeck_example_likelihood
likelihood::likelihood_tests ‑ dna_mol_evo_example_likelihood
likelihood::likelihood_tests ‑ dna_simple_likelihood
likelihood::likelihood_tests::huelsenbeck_example_dna_reversibility_likelihood ‑ case_1_jc69
likelihood::likelihood_tests::huelsenbeck_example_dna_reversibility_likelihood ‑ case_2_k80
likelihood::likelihood_tests::huelsenbeck_example_dna_reversibility_likelihood ‑ case_3_hky
likelihood::likelihood_tests::huelsenbeck_example_dna_reversibility_likelihood ‑ case_4_tn93
likelihood::likelihood_tests::huelsenbeck_example_dna_reversibility_likelihood ‑ case_5_gtr
…

♻️ This comment has been updated with latest results.

Copy link

codecov bot commented Nov 14, 2023

Codecov Report

Attention: 42 lines in your changes are missing coverage. Please review.

Comparison is base (c19493f) 68.24% compared to head (ac204c5) 77.76%.

Files Patch % Lines
phylo/src/substitution_models/dna_models.rs 89.65% 18 Missing ⚠️
phylo/src/pip_model/mod.rs 96.41% 12 Missing ⚠️
phylo/src/evolutionary_models/mod.rs 0.00% 3 Missing ⚠️
phylo/src/substitution_models/mod.rs 96.90% 2 Missing and 1 partial ⚠️
phylo/src/substitution_models/protein_models.rs 96.77% 3 Missing ⚠️
phylo/src/tree/mod.rs 83.33% 2 Missing ⚠️
phylo/src/phylo_info/mod.rs 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main       #5      +/-   ##
==========================================
+ Coverage   68.24%   77.76%   +9.51%     
==========================================
  Files          10       13       +3     
  Lines        1228     1902     +674     
  Branches      209      286      +77     
==========================================
+ Hits          838     1479     +641     
- Misses        384      417      +33     
  Partials        6        6              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Removed a duplicate test coming from merge mistake during rebase
@junniest junniest self-assigned this Nov 14, 2023
@junniest junniest requested a review from ClaIgl November 14, 2023 20:58
@junniest junniest merged commit b7b49c2 into main Nov 15, 2023
5 checks passed
@junniest junniest deleted the likelihood_calculation branch November 15, 2023 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants