-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest: Provide target for raw metadata from NCBI Datasets #38
Conversation
ingest/defaults/config.yaml
Outdated
# | ||
# The default includes all available fields to be able to easily see the | ||
# uncurated metadata by running the workflow with the target `data/ncbi_dataset_report.tsv` | ||
# Remove any fields that are not needed in your workflow to reduce file size and save space. | ||
# Note: the "accession" field MUST be provided to match with the sequences | ||
ncbi_datasets_fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another solution (the one I was assuming) is switch the format_ncbi_dataset_report
rule to use the command from your ingest tutorial:
$ dataformat tsv virus-genome \
--package ingest/data/ncbi_dataset.zip \
> ingest/data/raw_metadata.tsv
And then use format_ncbi_datasets_ndjson
to prune the fields down to the desired ncbi_datasets_fields
, change their casing (etc). I presume that NCBI will add fields to their datasets over time and it'd be a pain to miss important fields because we weren't aware of them (i.e. hadn't realised they existed because they were missing from our ncbi_datasets_fields
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, both the mnemonics (with hyphens) and human readable names (with spaces) of the NCBI fields don't work well with our usual tools for TSV manipulation (e.g tsv-utils). So pruning the fields down and renaming them after the dataformat
call will require a separate script that I don't really think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Ok. How about adding another rule and then then we can ask people to invoke it via nextstrain build ingest dump_ncbi_dataset_report
or somesuch?
rule dump_ncbi_dataset_report:
input:
dataset_package="data/ncbi_dataset.zip",
output:
ncbi_dataset_tsv="data/ncbi_dataset_report_raw.tsv",
shell:
"""
dataformat tsv virus-genome \
--package {input.dataset_package} > {output.ncbi_dataset_tsv}
"""
(I say this because I think it's inevitable that we'll miss new NCBI fields in the future because we think we're looking at all the data but really we're just looking at a subset which our config's defined. The measles strain name for instance.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in aa67664
Provides an easy way for first time users to get the full uncurated metadata from NCBI Datasets commands by running the ingest workflow with the specified target `dump_ncbi_dataset_report`. They can then inspect and explore the raw data to determine if they want to configure the workflow to use additional fields from NCBI. The rule is added to `fetch_from_ncbi.smk` to make it easy to run without additional configs. Note that it is not run as part of the default workflow and only intended to be used as a specified target. Prompted by @jameshadfield in review of the tutorial¹ and resolves #30. ¹ nextstrain/docs.nextstrain.org#195 (comment) Co-authored-by: James Hadfield <hadfield.james@gmail.com>
faeabdf
to
aa67664
Compare
The pathogen-repo-guide will be updated to included the target `dump_ncbi_dataset_report` to easily generate the uncurated NCBI Dataset metadata.¹ This commit updates the tutorial to use this new target so that the user does not need to manually run the extra commands to see the raw metadata. ¹ nextstrain/pathogen-repo-guide#38
Merging to merge nextstrain/docs.nextstrain.org#195 |
Provides an easy way for first time users to get the full
uncurated metadata from NCBI Datasets commands by running the
ingest workflow with the specified target
dump_ncbi_dataset_report
.They can then inspect and explore the raw data to determine if they
want to configure the workflow to use additional fields from NCBI.
The rule is added to
fetch_from_ncbi.smk
to make it easy to runwithout additional configs. Note that it is not run as part of the
default workflow and only intended to be used as a specified target.
Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.
¹ nextstrain/docs.nextstrain.org#195 (comment)