Allow same strains for all segments #11

jameshadfield · 2024-04-18T04:33:53Z

how to use (copied from README)

By default we subsample data for each segment independently. Alternatively, you can ask the pipeline to use the same strains for each segment. This modifies the pipeline in a few ways:

An additional metadata processing step is added which counts the number of segments a strain has sequence data for
Subsampling is performed as normal for the HA segment, with the additional condition that each sample has sequence data for all segments
All other segments are subsampled to contain the same strains as used for HA in step 2

To enable this set the config parameter same_strains_per_segment to a truthy value. If you are calling snakemake directly you can add

--config 'same_strains_per_segment=True'

If you are using nextstrain build then add that to the end of the command (i.e. as a parameter which will be passed through to Snakemake).

modifications to Snakefile

This required adding a few more lambda / conditional functions. Let me know if anything's confusing and I can add more comments etc.

need to rebase

I need to rebase this on master to include daf5151 but this shouldn't prevent comments on this PR

test runs on AWS

I had a few issues which (I think) are unrelated to this PR as they failed in IQ-TREE due to unexpected characters in pb1/pb2 for h5nx and h5n1 builds. I'll post this in slack. (These failures are represented by "Error!" in the tables below.) I tested via:

nextstrain build --aws-batch --cpus 16 --memory 28gib \
    --no-download --detach-on-interrupt \
    --envdir ~/envdirs/rethink/ \
    . \
    --config 'same_strains_per_segment=True' --keep-going

You can see these all on nextstrain.org/staging via the following tanglegram URLs:

subtype: h5nx

segment	na	ha	pa	np	mp	ns	pb1
na
ha	ha:na
pa	pa:na	pa:ha
np	np:na	np:ha	np:pa
mp	mp:na	mp:ha	mp:pa	mp:np
ns	ns:na	ns:ha	ns:pa	ns:np	ns:mp
pb1	Error!	Error!	Error!	Error!	Error!	Error!
pb2	Error!	Error!	Error!	Error!	Error!	Error!	Error!

subtype: h5n1

segment	na	ha	pa	np	mp	ns	pb1
na
ha	ha:na
pa	pa:na	pa:ha
np	np:na	np:ha	np:pa
mp	mp:na	mp:ha	mp:pa	mp:np
ns	ns:na	ns:ha	ns:pa	ns:np	ns:mp
pb1	Error!	Error!	Error!	Error!	Error!	Error!
pb2	Error!	Error!	Error!	Error!	Error!	Error!	Error!

subtype: h9n2

segment	na	ha	pa	np	mp	ns	pb1
na
ha	ha:na
pa	pa:na	pa:ha
np	np:na	np:ha	np:pa
mp	mp:na	mp:ha	mp:pa	mp:np
ns	ns:na	ns:ha	ns:pa	ns:np	ns:mp
pb1	pb1:na	pb1:ha	pb1:pa	pb1:np	pb1:mp	pb1:ns
pb2	pb2:na	pb2:ha	pb2:pa	pb2:np	pb2:mp	pb2:ns	pb2:pb1

subtype: h7n9

segment	na	ha	pa	np	mp	ns	pb1
na
ha	ha:na
pa	pa:na	pa:ha
np	np:na	np:ha	np:pa
mp	mp:na	mp:ha	mp:pa	mp:np
ns	ns:na	ns:ha	ns:pa	ns:np	ns:mp
pb1	pb1:na	pb1:ha	pb1:pa	pb1:np	pb1:mp	pb1:ns
pb2	pb2:na	pb2:ha	pb2:pa	pb2:np	pb2:mp	pb2:ns	pb2:pb1

jameshadfield · 2024-04-24T03:45:58Z

I've rebased this onto master and run everything on AWS.

The good: most builds worked well. You can see them at URLs nextstrain.org/staging/avian-flu/avian-flu/same-strains-per-segment/{SUBTYPE}/{SEGMENT}/{TIME}, for instance, https://nextstrain.org/staging/avian-flu/same-strains-per-segment/h5n1/ha/2y:staging/avian-flu/same-strains-per-segment/h5n1/na/2y (no nice tables this time sorry!)

The bad: I'm still getting some failures in tree building. I'll dive into this. Since no-one else has seen this and I've seen it twice it implicates the changes here. Similar to last time, it failed for segments pb1 & pb2 for h5n1 & h5nx for both time windows.

augur tree             --alignment results/aligned_h5n1_pb2_all-time.fasta             --output results/tree-raw_h5n1_pb2_all-time.nwk             --method iqtree             --nthreads 1         
ERROR: Shell exited 2 when running: iqtree -ntmax 1 -s results/aligned_h5n1_pb2_all-time-delim.fasta -m GTR -ninit 2 -n 2 -me 0.05 -nt AUTO -redo > results/aligned_h5n1_pb2_all-time-delim.iqtree.log
Command output was:
  ERROR: Sequence A_DELIM-MEVCGQVKJKEPCEQXPDCN_egret_DELIM-MEVCGQVKJKEPCEQXPDCN_Korea_DELIM-MEVCGQVKJKEPCEQXPDCN_22WC603_DELIM-MEVCGQVKJKEPCEQXPDCN_2023 has invalid character E at site 14

trvrb · 2024-04-25T01:02:43Z

I like this! If we compare tangletrees from PR to live we see a pretty big difference.

Here's live: https://nextstrain.org/avian-flu/h5n1/ha/all-time:avian-flu/h5n1/pa/all-time?m=div

And here's the PR: https://next.nextstrain.org/staging/avian-flu/same-strains-per-segment/h5n1/ha/all-time:staging/avian-flu/same-strains-per-segment/h5n1/pa/all-time?m=div

For comparing reassortment patterns the matched segments are clearly better. Looking at representativeness I didn't see an issue and it looks like the matched segments have just as good data available.

This does complicate the workflow, but identifying and working with full genomes is going to be important. And I do very much like the strategy of amending metadata with n_segments.

I don't think I have much to change, except to note that I don't foresee wanting to run with --config 'same_strains_per_segment=True'. If this is the strategy we want then this should just be default. Being able to pass in subtype, etc.. as config would be useful, but I don't see myself wanting to change this flag often.

jameshadfield · 2024-04-25T01:05:43Z

I don't foresee wanting to run with --config 'same_strains_per_segment=True'. If this is the strategy we want then this should just be default. Being able to pass in subtype, etc.. as config would be useful, but I don't see myself wanting to change this flag often.

My rational here was to not change the default behaviour, essentially hiding this feature behind a flag. There's one downside which prevented me from switching it to the default - strains without the full complement of segments will be filtered out, and this includes those in the force-include list (although now that I think about it I should allow them to be force included even if only in a subset of segments.)

trvrb · 2024-04-25T01:06:44Z

Ah! I missed this. I had thought that these were force included despite missing segments. Yes, it would be good to keep these included even when preferring complete genomes.

See added content in README.md for how to use. This is useful in its own right, but also paves the way for future work which will attempt to analyse whole genomes.

jameshadfield · 2024-04-25T22:01:13Z

Merging now based on above review + review in slack.

Yes, it would be good to keep these included even when preferring complete genomes.

Include strains are now always included regardless of how many segments they have data for. Note that (for h5n1) of the 277 include strains only 37 are currently in fauna.

lmoncla · 2024-04-26T16:02:08Z

This is amazing @jameshadfield and I think this is going to be wildly useful. Thank you so much for working on this!!

jameshadfield force-pushed the james/same-strains-per-segment branch from 71cd572 to 59e895d Compare April 24, 2024 01:07

jameshadfield mentioned this pull request Apr 24, 2024

Corrupted sequences in Fauna #16

Closed

jameshadfield force-pushed the james/same-strains-per-segment branch from 59e895d to 33a83ef Compare April 24, 2024 09:33

trvrb requested a review from lmoncla April 25, 2024 01:03

Allow same strains for all segments

504e3eb

See added content in README.md for how to use. This is useful in its own right, but also paves the way for future work which will attempt to analyse whole genomes.

jameshadfield force-pushed the james/same-strains-per-segment branch from 33a83ef to 504e3eb Compare April 25, 2024 21:32

jameshadfield merged commit 64b98ca into master Apr 25, 2024
6 checks passed

jameshadfield deleted the james/same-strains-per-segment branch April 25, 2024 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow same strains for all segments #11

Allow same strains for all segments #11

jameshadfield commented Apr 18, 2024

jameshadfield commented Apr 24, 2024

trvrb commented Apr 25, 2024

jameshadfield commented Apr 25, 2024

trvrb commented Apr 25, 2024

jameshadfield commented Apr 25, 2024

lmoncla commented Apr 26, 2024

Allow same strains for all segments #11

Allow same strains for all segments #11

Conversation

jameshadfield commented Apr 18, 2024

jameshadfield commented Apr 24, 2024

trvrb commented Apr 25, 2024

jameshadfield commented Apr 25, 2024

trvrb commented Apr 25, 2024

jameshadfield commented Apr 25, 2024

lmoncla commented Apr 26, 2024