-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow same strains for all segments #11
Conversation
71cd572
to
59e895d
Compare
I've rebased this onto master and run everything on AWS. The good: most builds worked well. You can see them at URLs The bad: I'm still getting some failures in tree building. I'll dive into this. Since no-one else has seen this and I've seen it twice it implicates the changes here. Similar to last time, it failed for segments pb1 & pb2 for h5n1 & h5nx for both time windows. augur tree --alignment results/aligned_h5n1_pb2_all-time.fasta --output results/tree-raw_h5n1_pb2_all-time.nwk --method iqtree --nthreads 1
ERROR: Shell exited 2 when running: iqtree -ntmax 1 -s results/aligned_h5n1_pb2_all-time-delim.fasta -m GTR -ninit 2 -n 2 -me 0.05 -nt AUTO -redo > results/aligned_h5n1_pb2_all-time-delim.iqtree.log
Command output was:
ERROR: Sequence A_DELIM-MEVCGQVKJKEPCEQXPDCN_egret_DELIM-MEVCGQVKJKEPCEQXPDCN_Korea_DELIM-MEVCGQVKJKEPCEQXPDCN_22WC603_DELIM-MEVCGQVKJKEPCEQXPDCN_2023 has invalid character E at site 14 |
59e895d
to
33a83ef
Compare
I like this! If we compare tangletrees from PR to live we see a pretty big difference. Here's live: https://nextstrain.org/avian-flu/h5n1/ha/all-time:avian-flu/h5n1/pa/all-time?m=div And here's the PR: https://next.nextstrain.org/staging/avian-flu/same-strains-per-segment/h5n1/ha/all-time:staging/avian-flu/same-strains-per-segment/h5n1/pa/all-time?m=div For comparing reassortment patterns the matched segments are clearly better. Looking at representativeness I didn't see an issue and it looks like the matched segments have just as good data available. This does complicate the workflow, but identifying and working with full genomes is going to be important. And I do very much like the strategy of amending metadata with I don't think I have much to change, except to note that I don't foresee wanting to run with |
My rational here was to not change the default behaviour, essentially hiding this feature behind a flag. There's one downside which prevented me from switching it to the default - strains without the full complement of segments will be filtered out, and this includes those in the force-include list (although now that I think about it I should allow them to be force included even if only in a subset of segments.) |
Ah! I missed this. I had thought that these were force included despite missing segments. Yes, it would be good to keep these included even when preferring complete genomes. |
See added content in README.md for how to use. This is useful in its own right, but also paves the way for future work which will attempt to analyse whole genomes.
33a83ef
to
504e3eb
Compare
Merging now based on above review + review in slack.
Include strains are now always included regardless of how many segments they have data for. Note that (for h5n1) of the 277 include strains only 37 are currently in fauna. |
This is amazing @jameshadfield and I think this is going to be wildly useful. Thank you so much for working on this!! |
how to use (copied from README)
By default we subsample data for each segment independently. Alternatively, you can ask the pipeline to use the same strains for each segment. This modifies the pipeline in a few ways:
To enable this set the config parameter
same_strains_per_segment
to a truthy value. If you are callingsnakemake
directly you can addIf you are using
nextstrain build
then add that to the end of the command (i.e. as a parameter which will be passed through to Snakemake).modifications to Snakefile
This required adding a few more lambda / conditional functions. Let me know if anything's confusing and I can add more comments etc.
need to rebase
I need to rebase this on master to include daf5151 but this shouldn't prevent comments on this PR
test runs on AWS
I had a few issues which (I think) are unrelated to this PR as they failed in IQ-TREE due to unexpected characters in pb1/pb2 for h5nx and h5n1 builds. I'll post this in slack. (These failures are represented by "Error!" in the tables below.) I tested via:
You can see these all on nextstrain.org/staging via the following tanglegram URLs:
subtype: h5nx
subtype: h5n1
subtype: h9n2
subtype: h7n9