-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group segments by strains #18
Conversation
1883dbf
to
cbcc232
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully finished review, but wanted to provide some workflow feedback.
Maybe I'm missing it somewhere but does this new workflow output a file, or even to the terminal, the strains that are being deduped? I ask because I noticed in the builds that are staged that all the Colombia seqs are missing (which can be seen in the live site) and I can't really tell if that's because they were dropped because theyre duplicates or not. Either way, even if some are duplicated I'd expect at least one of those seqs to make it through? |
cbcc232
to
7324fd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the multiple iterations on this James! I mostly left non-blocking comments.
I think the only things that need to be resolved
- move
augur curate apply-record-annotations
back to the curate chain (as you've noted) - fix the segment sequence filtering
if len(segments_present)>1: | ||
raise AssertionError(f"Accession '{accession}' (strain '{strain}') mapped to multiple segments: {', '.join(segments_present)}. Skipping this accession.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Huh, could a single sequence align to multiple segments? Did you actually run into this situation in testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
Accession 'PP952119' (strain 'IRCCS-SCDC_1/2024') mapped to multiple segments: L, S. Skipping this accession.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh weird, I would have expected the alignment to fail for one of the segments. I wonder if the params for the Nextclade alignment are too lax.
Produces a single large metadata file keyed off accession with both strain names and segment-level information encoded therein. The following commit will group this "long" metadata TSV into a "wide" TSV where each row represents a strain.
Allows us to avoid another python script. Original implementation by @joverlee521 in <#18 (comment)>
Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ ¹ <#18 (comment)>
7324fd2
to
afa412f
Compare
They're provided in our other runtimes (almost by happenstance, as part of the underlying OS image) and having them available in all runtimes makes it much easier to write portable programs without having to deal with GNU vs. BSD differences. Note that typically these GNU programs would already be available in the Conda runtime on Linux (via the host system), but not the Conda runtime on macOS (unless installed separately, e.g. via Homebrew). So explicitly including the GNU-flavored programs here increases consistency, isolation, and portability of the runtime. These were ostensibly overlooked in "Provide GNU coreutils in the runtime" (a0b6609). Related-to: <nextstrain/oropouche#18 (comment)>
Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ NOTE: Here we use accession as the ID, however using strain name would be better going forward as it would reduce the duplication needed in the current format. We can't (currently) do this in oropouche because strain names are added _after_ the curate chain runs. ¹ <#18 (comment)>
afa412f
to
2ee0109
Compare
@miparedes - I hope to merge this tomorrow, please let me know if you don't want this merged! |
I just did a fresh pull and everything seems to be running okay. The trees look pretty different than whats currently on nextstrain.org though. I reran everything using this current branch and uploaded the jsons to staging. If you compare https://nextstrain.org/staging/oropouche/S to https://nextstrain.org/oropouche/S you see a different tree topology plus 400 more seqs although the nextstrain.org one was just updated yesterday it seems. same with https://nextstrain.org/staging/oropouche/M where you see the time tree adn the divergence tree looking pretty different (the lower lineage in M seems to be the less diverged but close to the present in time?) but it's probably due to this weird tip https://nextstrain.org/staging/oropouche/M?l=clock&s=BeH505764 which should be easy to exclude. There doesnt seem to be a massive upload of new seqs to explain the large difference between the two builds so I guess im not too sure where the discrepancy is coming from? Thoughts @jameshadfield ? |
Thanks for flagging! I forgot to update the phylo exclude/include lists to use strain names. Will sort that out ASAP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me. Thanks for all the hard work @jameshadfield ! The only outstanding thing is to figure out why the original build didnt include those recent brazil seqs but it isn't blocking given that it seems like the build from this PR is probably the more accurate one.
Transforms a metadata file with one row per accession into a file with one row per strain. Where a strain contains sequences for multiple segments this will group those segments together. Segment-specific field names (i.e. those not in --common-strain-fields) are modified to ensure their suffix is "_{segment}". For instance "accession → accession_HA". Rows are matched on strain name and basic sanity checking is performed when grouping. Segments with multiple matches for a given strain are dropped, and strains where all segments have either zero or multiple matches are dropped entirely. Manual resolutions may be provided via a `--resolutions` YAML which is a list of dictionaries, each with keys "strain", "accession" and "segment" informing the program which accession (of multiple) to use. The resolutions here are taken both from exploring the data and the existing phylo exclude list. Any disagreement within the "--common-strain-fields" will result in the strain being dropped, however empty values may be replaced and ambiguous dates may be replaced with specific ones (where appropriate).
Mismatched field values across segments (e.g. segments disagree on the 'date') are now resolved by choosing the most common occurrence with the intention they are resolved upstream, as implemented here. This approach was the third implementation. Initially I resolved disagreements within `group_segments.py` via a provided resolutions YAML. After discussion with @joverlee521 we decided this could be better implemented via `augur curate` and the original implementation here did this _after_ the segment grouping, however this made it impossible to distinguish disagreements which will be fixed vs those which won't¹ NOTE: Here we use accession as the ID, however using strain name would be better going forward as it would reduce the duplication needed in the current format. We can't (currently) do this in oropouche because strain names are added _after_ the curate chain runs. ¹ <#18 (comment)>
Updates the files which ingest uploads and makes the corresponding changes to the phylogenetic workflow. As metadata (and sequences) now use "strain" as the unique ID a number of simplifications can be made to the workflow. There is one regression: the "accession" column no longer exists and is thus not exported. We'll fix this in a subsequent commit.
This adds back in segment-specific metadata to the Auspice JSON. There are multiple ways this can be done, each with trade-offs. The approach employed here leaves the "_{segment}" suffix on the field names. Alternatively we could remap the metadata file for each `export` call so that (e.g.) "accession_S" becomes "accession".
To use the new metadata format where we group by strain. Steps to regenerate: 1. Populate 'data/' with metadata & sequences from an ingest run. 2. Subsample this as example data via: ``` augur filter --metadata data/metadata.tsv --group-by country --subsample-max-sequences 30 --output-metadata example_data/metadata.tsv augur filter --metadata example_data/metadata.tsv --sequences data/L/sequences.fasta --output-sequences example_data/sequences_L.fasta augur filter --metadata example_data/metadata.tsv --sequences data/M/sequences.fasta --output-sequences example_data/sequences_M.fasta augur filter --metadata example_data/metadata.tsv --sequences data/S/sequences.fasta --output-sequences example_data/sequences_S.fasta ```
6af6b61
to
091dc7f
Compare
This PR closes #15 and is the first implementation of nextstrain/pathogen-repo-guide#59. There are still some improvements to be made, but it's at a good stopping point to reflect on the approach. The main outstanding tasks/improvements are around:
Live builds
https://next.nextstrain.org/staging/oropouche/S/grouped-by-strain
https://next.nextstrain.org/staging/oropouche/M/grouped-by-strain
https://next.nextstrain.org/staging/oropouche/L/grouped-by-strain
Running locally
Trial ingest files are available at:
The phylo workflow may be run locally by adding
--config ingest_url_prefix:"https://data.nextstrain.org/files/workflows/oropouche/trials/grouped-by-strain/"
Current conflicts in the metadata
The approach is currently overly conservative and removes a number of strains due to conflicting metadata. They fall into three groups, however in total we only drop 21 strains + another 8 segments so I think providing a manually curated conflict-resolution-YAML is the best way to resolve these, where appropriate.
Multiple sequences for a segment
There are 14 of these. Example:
Strain TRVL9760 had multiple sequences for segment S. Dropping this entire strain.
Disagreement of common strain fields
Seven occurrences. One for sampling date, six for authors:
Accession mapped to no segments / multiple segments
n=8, examples: