fix: explicitly specify how the root and ambiguous states are handled… #1690

rneher · 2024-11-25T19:53:05Z

… during sequence reconstruction and mutation counting.

Checklist

Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests
Check if you need to update docs

jameshadfield · 2024-11-25T20:58:02Z

Running this on a mpox clade-I build which currently assigns all root mutations to the basal branch leading to clade Ib.

This PR: Branch leading to clade Ia

Augur refine divergence: 13 mutations.
Inferred mutations from ATGC to ATGC: 14 (e.g. A2963G)
Gap to base mutations: 205 (e.g. -19105T)
Base to gap mutations: 0

This PR: Branch leading to clade Ib

Augur refine divergence: 52 mutations.
Inferred mutations from ATGC to ATGC: 59
Gap to base mutations: 0
Base to gap mutations: 677

Pre-PR behaviour

Same divergence from augur refine (I didn't re-run the rule)
All mutations are on the branch leading to clade-Ib (not good, obviously)
- Inferred mutations from ATGC to ATGC: 73 (all on the branch leading to clade-Ib)
- Gap to base mutations: 0
- Base to gap mutations: 882

So it's a lot better - and perhaps this PR is correct within its remit - but something's not quite right. The total number of inferred ATGC-ATGC mutations (73) is unchanged, but this doesn't match the refined branch lengths (65). Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

Secondly, and this is just for my own interest, how are the gap mutations being allocated among sister branches? I'd have thought clade IIb gets 882 * 52/(52+13) = 706 but that branch gets 677.

I've got some tests here which use small contrived (50nt-ish) genomes where we can reason with the mutations so I'll take a look at them later.

jameshadfield · 2024-11-26T22:04:13Z

Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

I thought this was simply the subs/site scaled by some scalar, but it's actually doing the full inference and counting mutations, and this inference is modified by this PR (I missed this).

Rerunning mpox including the refine step makes the assigned mutations match the mutation count. The allocation of gap-mutations also follow the branch-lengths.

jameshadfield · 2024-11-26T23:36:41Z

I looked into the tests, starting with the first test here which infers ancestral mutations given the reference via --root-sequence. This is easy to reason with because it uses the toy genome I introduced here:

The results using this PR are stochastic, I observed the following three reconstructions back-to-back-to-back:

Run 1: Test succeeded (including the non-parsimonious reconstruction at pos 18 noted in the test)
Run 2: pos 14 inferred to be T at the root with 2 mutations needed (the parsimonious reconstruction with C at the root needs only 1). Similarly we now infer pos 7 to be G at the root (this is the same mutation pattern across the tree as pos 18).
Run 3: Similar to run 2 however inferring pos 7 to be G at the root but using the parsimonious reconstruction of pos 18 (i.e. it matches the root).

For the same test but looking at the case where we don't supply the root-sequence we see similar stochasticity, with differences in root-sequence reconstruction observed at pos 7, 18, 33, 43.

rneher · 2024-11-27T09:23:31Z

thanks for digging into this. This makes all sense to me and is the expected result as far as i can tell. providing a --root-sequence does not actually affect the reconstruction, it just decorates the branch to root with all mutations necessary to construct the inferred root sequence from the provided root sequence.

rneher · 2024-11-27T09:26:02Z

note that the sample C is a lot closer to the root (0.02) than Node AB (0.06). So in most cases, the root will agree with C.

and yes, we infer the actual mutations for the branch length in mutation units. this made sense from the perspective of very similar genomes, where you want branch length to correspond to the mutations in a direct discrete way.

victorlin

I didn't look into the main issue, but noticed a small Python syntax change to be addressed.

augur/refine.py

huddlej · 2024-12-17T22:48:58Z

I tried to make the augur ancestral tests deterministic by adding a --seed argument to the ancestral command that controls TreeTime's random seed. I updated the failing tests to match the new expected "random" behavior and was able to consistently get those tests passing locally. See the commit messages for the details. I expect at least one other test to remain broken and this does feel a bit like whack-a-mole when such a fundamental part of augur (TreeTime) becomes stochastic. Another approach would be to provide an interface through refine and ancestral to select the original behavior of those tools prior to this PR's change, just to enable deterministic behavior.

Do folks feel like the --seed approach is a step in the right direction or should we try something else?

victorlin · 2024-12-17T22:51:36Z

I think seeding is reasonable. It's also used in many augur filter functional tests with --subsample-seed.

jameshadfield · 2024-12-18T00:41:12Z

Do folks feel like the --seed approach is a step in the right direction or should we try something else?

Yes, this seems like a good direction, both for testing and facilitating reproducible builds. I'm also happy to skip certain tests (or parts of tests) where expedient.

codecov · 2024-12-18T17:52:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.82%. Comparing base (77ae31e) to head (d618f53).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1690      +/-   ##
==========================================
+ Coverage   72.79%   72.82%   +0.02%     
==========================================
  Files          79       79              
  Lines        8271     8272       +1     
  Branches     1691     1691              
==========================================
+ Hits         6021     6024       +3     
+ Misses       1961     1960       -1     
+ Partials      289      288       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

huddlej · 2024-12-18T17:54:42Z

Thanks, @victorlin and @jameshadfield! I managed to get the remaining broken tests to pass (stably?). If this looks good to you both and @rneher, then I can update the CHANGELOG to reflect the additional ancestral feature and we can merge.

… during sequence reconstruction and mutation counting

Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>

Adds a new `--seed` argument to the ancestral command, following the existing pattern in the refine command, and passes the user-provided value to TreeTime's `TreeAnc` class to ensure deterministic outcomes for random samples. Since this PR changes ancestral's root sequence inference to be stochastic, this new feature allows us to get the same "random" result with each run of our functional tests. Correspondingly, this commit updates the non-VCF functional tests of ancestral to use a fixed seed. This seed should fix the random behavior to a specific outcome, but we need to update our expected JSON outputs to match that outcome here. In this case, the change is an introduction of a "T" at position 14 in the node_root of the "simple genome" data.

As in the previous commit, this commit updates the expected VCF output to match the fixed "random" outcome for the given seed where position 14 is a T in the root node.

Updates the functional test of multiallele VCF behavior when one of the alleles is a "N" character. The multiallele VCF in the original test (setting sample B to genotype of 30G) caused the inference of alleles at sites 7 and 14 to change substantially in this PR's new implementation. This commit stabilizes the test by assigning both samples A and B to the 30G genotype such that all nodes are inferred to have a 30G and the root node gets a A30G mutation. It should maintain the originally intended behavior of the test [1]. As part of this change, I've replaced the dynamically modified VCF used by this test with a static version of the VCF in version control. This change simplified test debugging by allowing me to inspect the input file outside of the cram temporary environment and it also fixes an issue with the original dynamic VCF where an additional line for site 30 was added but the original line was not being removed. [1] #1380

The change to how augur refine can stochastically assign the root node mutations led to a change in inferred branch lengths for our functional tests. Since the test uses a random seed argument for augur refine, the results of subsequent runs should be deterministically successful after this change.

Updates data for augur translate functional tests to match the updated test data for the ancestral command.

Updates SNPs in manually-curated VCF used to test augur translate so they match the new expected genotype for the "simple genome" data at position 14. This fixes the vcf.t translate test. Also updates the expected amino acid translations JSON for the vcf with root mutation test. As part of this update, I've created a complete copy of the "truth" aa_muts.json with the root mutation and used this instead of the version that was previously created by modifying the original aa_muts file with sed. Although this change duplicates test data, I was able to debug the test more easily with a complete version of the JSON living outside of the cram tests.

Adds random seed argument to all functional tests of the ancestral command. This should hopefully resolve remaining stochastic changes in cram runs during CI.

huddlej · 2024-12-23T21:10:44Z

I tried to resolve a conflict in the change log with GitHub's web UI which ended up making a merge commit from master that I didn't want. To fix this and the change log, I rebased this branch onto master and force-pushed. Sorry for the mess! 🤦🏻

rneher requested a review from jameshadfield November 25, 2024 20:08

victorlin reviewed Dec 4, 2024

View reviewed changes

augur/refine.py Outdated Show resolved Hide resolved

genehack approved these changes Dec 18, 2024

View reviewed changes

rneher and others added 11 commits December 23, 2024 13:06

fix: explicitly specify how the root and ambiguous states are handled…

77be5a7

… during sequence reconstruction and mutation counting

add to changelog

3adf44b

fix: change string 'True' to bool

22fa17f

Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>

Use random seed in simple VCF test of ancestral

0da5c11

As in the previous commit, this commit updates the expected VCF output to match the fixed "random" outcome for the given seed where position 14 is a T in the root node.

Align data for ancestral and translate tests

28dfae9

Updates data for augur translate functional tests to match the updated test data for the ancestral command.

Use random seed in all ancestral tests

d391989

Adds random seed argument to all functional tests of the ancestral command. This should hopefully resolve remaining stochastic changes in cram runs during CI.

Update change log

d618f53

huddlej force-pushed the fix/root-state-in-sequence-reconstruction branch from f556cc9 to d618f53 Compare December 23, 2024 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: explicitly specify how the root and ambiguous states are handled… #1690

fix: explicitly specify how the root and ambiguous states are handled… #1690

rneher commented Nov 25, 2024 •

edited by huddlej

Loading

jameshadfield commented Nov 25, 2024 •

edited

Loading

jameshadfield commented Nov 26, 2024 •

edited

Loading

jameshadfield commented Nov 26, 2024

rneher commented Nov 27, 2024

rneher commented Nov 27, 2024

victorlin left a comment

huddlej commented Dec 17, 2024

victorlin commented Dec 17, 2024

jameshadfield commented Dec 18, 2024

codecov bot commented Dec 18, 2024 •

edited

Loading

huddlej commented Dec 18, 2024

huddlej commented Dec 23, 2024

fix: explicitly specify how the root and ambiguous states are handled… #1690

Are you sure you want to change the base?

fix: explicitly specify how the root and ambiguous states are handled… #1690

Conversation

rneher commented Nov 25, 2024 • edited by huddlej Loading

Checklist

jameshadfield commented Nov 25, 2024 • edited Loading

jameshadfield commented Nov 26, 2024 • edited Loading

jameshadfield commented Nov 26, 2024

rneher commented Nov 27, 2024

rneher commented Nov 27, 2024

victorlin left a comment

Choose a reason for hiding this comment

huddlej commented Dec 17, 2024

victorlin commented Dec 17, 2024

jameshadfield commented Dec 18, 2024

codecov bot commented Dec 18, 2024 • edited Loading

Codecov Report

huddlej commented Dec 18, 2024

huddlej commented Dec 23, 2024

rneher commented Nov 25, 2024 •

edited by huddlej

Loading

jameshadfield commented Nov 25, 2024 •

edited

Loading

jameshadfield commented Nov 26, 2024 •

edited

Loading

codecov bot commented Dec 18, 2024 •

edited

Loading