Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow same strains for all segments #11

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

jameshadfield
Copy link
Member

how to use (copied from README)

By default we subsample data for each segment independently. Alternatively, you can ask the pipeline to use the same strains for each segment. This modifies the pipeline in a few ways:

  1. An additional metadata processing step is added which counts the number of segments a strain has sequence data for
  2. Subsampling is performed as normal for the HA segment, with the additional condition that each sample has sequence data for all segments
  3. All other segments are subsampled to contain the same strains as used for HA in step 2

To enable this set the config parameter same_strains_per_segment to a truthy value. If you are calling snakemake directly you can add

--config 'same_strains_per_segment=True'

If you are using nextstrain build then add that to the end of the command (i.e. as a parameter which will be passed through to Snakemake).

modifications to Snakefile

This required adding a few more lambda / conditional functions. Let me know if anything's confusing and I can add more comments etc.

need to rebase

I need to rebase this on master to include daf5151 but this shouldn't prevent comments on this PR

test runs on AWS

I had a few issues which (I think) are unrelated to this PR as they failed in IQ-TREE due to unexpected characters in pb1/pb2 for h5nx and h5n1 builds. I'll post this in slack. (These failures are represented by "Error!" in the tables below.) I tested via:

nextstrain build --aws-batch --cpus 16 --memory 28gib \
    --no-download --detach-on-interrupt \
    --envdir ~/envdirs/rethink/ \
    . \
    --config 'same_strains_per_segment=True' --keep-going

You can see these all on nextstrain.org/staging via the following tanglegram URLs:

subtype: h5nx

segment na ha pa np mp ns pb1 pb2
na
ha ha:na
pa pa:na pa:ha
np np:na np:ha np:pa
mp mp:na mp:ha mp:pa mp:np
ns ns:na ns:ha ns:pa ns:np ns:mp
pb1 Error! Error! Error! Error! Error! Error!
pb2 Error! Error! Error! Error! Error! Error! Error!

subtype: h5n1

segment na ha pa np mp ns pb1 pb2
na
ha ha:na
pa pa:na pa:ha
np np:na np:ha np:pa
mp mp:na mp:ha mp:pa mp:np
ns ns:na ns:ha ns:pa ns:np ns:mp
pb1 Error! Error! Error! Error! Error! Error!
pb2 Error! Error! Error! Error! Error! Error! Error!

subtype: h9n2

segment na ha pa np mp ns pb1 pb2
na
ha ha:na
pa pa:na pa:ha
np np:na np:ha np:pa
mp mp:na mp:ha mp:pa mp:np
ns ns:na ns:ha ns:pa ns:np ns:mp
pb1 pb1:na pb1:ha pb1:pa pb1:np pb1:mp pb1:ns
pb2 pb2:na pb2:ha pb2:pa pb2:np pb2:mp pb2:ns pb2:pb1

subtype: h7n9

segment na ha pa np mp ns pb1 pb2
na
ha ha:na
pa pa:na pa:ha
np np:na np:ha np:pa
mp mp:na mp:ha mp:pa mp:np
ns ns:na ns:ha ns:pa ns:np ns:mp
pb1 pb1:na pb1:ha pb1:pa pb1:np pb1:mp pb1:ns
pb2 pb2:na pb2:ha pb2:pa pb2:np pb2:mp pb2:ns pb2:pb1

@jameshadfield
Copy link
Member Author

I've rebased this onto master and run everything on AWS.

The good: most builds worked well. You can see them at URLs nextstrain.org/staging/avian-flu/avian-flu/same-strains-per-segment/{SUBTYPE}/{SEGMENT}/{TIME}, for instance, https://nextstrain.org/staging/avian-flu/same-strains-per-segment/h5n1/ha/2y:staging/avian-flu/same-strains-per-segment/h5n1/na/2y (no nice tables this time sorry!)

image

The bad: I'm still getting some failures in tree building. I'll dive into this. Since no-one else has seen this and I've seen it twice it implicates the changes here. Similar to last time, it failed for segments pb1 & pb2 for h5n1 & h5nx for both time windows.

augur tree             --alignment results/aligned_h5n1_pb2_all-time.fasta             --output results/tree-raw_h5n1_pb2_all-time.nwk             --method iqtree             --nthreads 1         
ERROR: Shell exited 2 when running: iqtree -ntmax 1 -s results/aligned_h5n1_pb2_all-time-delim.fasta -m GTR -ninit 2 -n 2 -me 0.05 -nt AUTO -redo > results/aligned_h5n1_pb2_all-time-delim.iqtree.log
Command output was:
  ERROR: Sequence A_DELIM-MEVCGQVKJKEPCEQXPDCN_egret_DELIM-MEVCGQVKJKEPCEQXPDCN_Korea_DELIM-MEVCGQVKJKEPCEQXPDCN_22WC603_DELIM-MEVCGQVKJKEPCEQXPDCN_2023 has invalid character E at site 14

@trvrb
Copy link
Member

trvrb commented Apr 25, 2024

I like this! If we compare tangletrees from PR to live we see a pretty big difference.

Here's live: https://nextstrain.org/avian-flu/h5n1/ha/all-time:avian-flu/h5n1/pa/all-time?m=div
ha-vs-pa-live

And here's the PR: https://next.nextstrain.org/staging/avian-flu/same-strains-per-segment/h5n1/ha/all-time:staging/avian-flu/same-strains-per-segment/h5n1/pa/all-time?m=div
ha-vs-pa-pr

For comparing reassortment patterns the matched segments are clearly better. Looking at representativeness I didn't see an issue and it looks like the matched segments have just as good data available.

This does complicate the workflow, but identifying and working with full genomes is going to be important. And I do very much like the strategy of amending metadata with n_segments.

I don't think I have much to change, except to note that I don't foresee wanting to run with --config 'same_strains_per_segment=True'. If this is the strategy we want then this should just be default. Being able to pass in subtype, etc.. as config would be useful, but I don't see myself wanting to change this flag often.

@trvrb trvrb requested a review from lmoncla April 25, 2024 01:03
@jameshadfield
Copy link
Member Author

I don't foresee wanting to run with --config 'same_strains_per_segment=True'. If this is the strategy we want then this should just be default. Being able to pass in subtype, etc.. as config would be useful, but I don't see myself wanting to change this flag often.

My rational here was to not change the default behaviour, essentially hiding this feature behind a flag. There's one downside which prevented me from switching it to the default - strains without the full complement of segments will be filtered out, and this includes those in the force-include list (although now that I think about it I should allow them to be force included even if only in a subset of segments.)

@trvrb
Copy link
Member

trvrb commented Apr 25, 2024

Ah! I missed this. I had thought that these were force included despite missing segments. Yes, it would be good to keep these included even when preferring complete genomes.

See added content in README.md for how to use.

This is useful in its own right, but also paves the way for future work
which will attempt to analyse whole genomes.
@jameshadfield
Copy link
Member Author

Merging now based on above review + review in slack.

Yes, it would be good to keep these included even when preferring complete genomes.

Include strains are now always included regardless of how many segments they have data for. Note that (for h5n1) of the 277 include strains only 37 are currently in fauna.

@jameshadfield jameshadfield merged commit 64b98ca into master Apr 25, 2024
6 checks passed
@jameshadfield jameshadfield deleted the james/same-strains-per-segment branch April 25, 2024 22:01
@lmoncla
Copy link
Collaborator

lmoncla commented Apr 26, 2024

This is amazing @jameshadfield and I think this is going to be wildly useful. Thank you so much for working on this!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants