-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split by serotype using NCBI virus_tax_id #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exciting to see these serotype annotations in the dengue metadata, @j23414, so you can confidently build serotype-specific trees! The ingest workflow ran without issues and the general approach looks good.
I left a few non-blocking comments below to consider. My main requests are minor and involve whitespace (woo!) and removing bash-based conditional logic from a rule.
@@ -11,10 +11,12 @@ if not config: | |||
|
|||
send_slack_notifications = config.get("send_slack_notifications", False) | |||
|
|||
serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: this kind of configuration detail might want to live in the config YAML, if you thought you might ever want to allow users to change the list of serotypes to ingest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be worried about mismatch with the serotype names defined in infer-dengue-serotype.py
. We should add a wildcard constraint to ensure they match the predefined serotypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connected this to GitHub Issue #27 for additional discourse and subsequent development.
ingest/bin/infer-dengue-serotype.py
Outdated
default="virus_tax_id", | ||
help="Column name containing the NCBI taxon id of the virus serotype, default: virus_tax_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: If you instantiate the parser
object like this:
parser = argparse.ArgumentParser(
description="Dengue specific processing of metadata, infer serotype from virus_tax_id",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
Then the ArgumentDefaultsHelpFormatter
will automatically print the default values for arguments like this in the help text, so you don't have to copy the default manually. It's not perfect, but it might be good enough for this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! Added in 5da76d9
ingest/bin/infer-dengue-serotype.py
Outdated
dengue_types = { | ||
"11053": "denv1", | ||
"11060": "denv2", | ||
"11069": "denv3", | ||
"11070": "denv4", | ||
"31634": "denv2", # Dengue virus 2 Thailand/16681/84 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: This type of map could also live in a CSV/TSV file that gets passed to this script as an argument, allowing users (future us, etc.) to inspect and edit these mappings without digging into the script. For example, we might soon want to include all taxon ids in the taxonomy subtree for Dengue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked this to an Issue #27 for further discussion and later development.
If we did move the map to a CSV/TSV file (or define a field_map in the config file directly), we could also refactor the infer_dengue_serotypes.py
into a generalized infer_serotypes.py
that could be housed in vendored/ingest.
It's a good suggestion, just thinking through some planning before implementation, possibly as a later PR
ingest/bin/infer-dengue-serotype.py
Outdated
for taxid, serotype in dengue_types.items(): | ||
if taxid in ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised by this loop, at first, since I expected each record to have a single tax id and you could do a dictionary lookup, but is it possible/common for a single record to have multiple tax ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this is left over from VirusLineage
having multiple tax ids in a comma separated list from general to specific. On the contrary, virus-tax-id
only has one and doing a dictionary lookup may be more streamlined. Thanks for flagging!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using a dictionary lookup directly in 28cfcad Thanks!
if [[ "{params.ncbi_serotype}" == "all" ]]; then | ||
cp {input.metadata} {output.metadata} | ||
cp {input.sequences} {output.sequences} | ||
else | ||
augur filter \ | ||
--sequences {input.sequences} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id-columns {params.id_field} \ | ||
--query "ncbi_serotype=='{params.ncbi_serotype}'" \ | ||
--output-sequences {output.sequences} \ | ||
--output-metadata {output.metadata} | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One common pattern that I use to avoid using shell if/else statements inside a Snakemake rule is to split out the logical blocks into their own rules and let Snakemake's own filename pattern matching handle the conditional logic more declaratively. For example, this rule can be split into two separate rules like:
rule split_by_ncbi_serotype:
"""
Split the data by serotype based on the NCBI metadata.
"""
input:
metadata = "data/metadata.tsv",
sequences = "data/sequences.fasta"
output:
metadata = "results/metadata_{serotype}.tsv",
sequences = "results/sequences_{serotype}.fasta"
params:
ncbi_serotype = lambda wildcards: wildcards.serotype,
id_field = config["transform"]["id_field"]
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--metadata-id-columns {params.id_field} \
--query "ncbi_serotype=='{params.ncbi_serotype}'" \
--output-sequences {output.sequences} \
--output-metadata {output.metadata}
"""
rule use_all_ncbi_serotypes:
"""
Use data from all serotypes.
"""
input:
metadata = "data/metadata.tsv",
sequences = "data/sequences.fasta"
output:
metadata = "results/metadata_all.tsv",
sequences = "results/sequences_all.fasta"
shell:
"""
cp {input.metadata} {output.metadata}
cp {input.sequences} {output.sequences}
"""
The benefit of this approach is that you keep a one-to-one mapping between rules and shell commands, you get clearer output from Snakemake on the terminal (see example below), and also Snakemake won't try to re-run the use_all_ncbi_serotypes
rule when the params
in the split_by_ncbi_serotype
rule change which is the kind of thing Snakemake does nowadays.
An example of the printed shell commands from a Snakemake dryrun with the current PR state is:
if [[ "denv4" == "all" ]]; then
cp data/metadata.tsv results/metadata_denv4.tsv
cp data/sequences.fasta results/sequences_denv4.fasta
else
augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --metadata-id-columns genbank_accession --query "ncbi_serotype=='denv4'" --output-sequences results/sequences_denv4.fasta --output-metadata results/metadata_denv4.tsv
fi
vs.
augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --metadata-id-columns genbank_accession --query "ncbi_serotype=='denv4'" --output-sequences results/sequences_denv4.fasta --output-metadata results/metadata_denv4.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves an entry in the Nextstrain Snakemake style guide 🌟
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if/then statement was dropped via a refactor and simplification in 0537e23 Thanks!
""" | ||
Split the data by serotype based on the NCBI metadata. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor request: Can you use the 4-space indentation throughout this new rules file to match the indentation used elsewhere in our workflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, sorry. I need to switch the default spacing in my code editor 😆 Fixed in 678fa5a
metadata="results/metadata.tsv", | ||
sequences="results/sequences.fasta", | ||
metadata="data/metadata.tsv", | ||
sequences="data/sequences.fasta", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further reflection about my rule-splitting comment above, what if you didn't change the output directory here to data/
and instead wrote out the metadata and sequences to results/metadata_all.tsv
and results/sequences_all.fasta
? Then you don't have to copy any files and you can drop the if/else logic from your split_serotypes.smk
rule above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Fixed and your suggestion streamlines the pipeline 0537e23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray for NCBI having the virus_tax_id
🎉
+1 for all of @huddlej's comments here and I've added a couple more stylistic suggestions.
4f2591d
to
3ddefcf
Compare
18536b4
to
09170ac
Compare
1. Pull virus_tax_id and virus_name from the dengue NCBI dataset 2. Use the virus_tax_id to infer the ncbi_serotype of each record 3. Use the ncbi_serotype field to split sequences and metadata files into serotypes 4. Update target files in the Snakefile to reflect the new serotype split Co-authored-by: Jover Lee <joverlee521@gmail.com>
09170ac
to
6c206bc
Compare
Description of proposed changes
Split records by serotype by using the virus-tax-id field in NCBI datasets. Historically, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a
virus-tax-id
field, that we can use to separate the serotypes.Changes in this PR include:
virus-tax-id
andvirus-name
from the dengue NCBI datasetvirus-tax-id
to infer thencbi_serotype
field of each recordncbi_serotype
field to split sequences and metadata files into serotypesRelated issue(s)
Checklist