ingest: Provide target for raw metadata from NCBI Datasets #38

joverlee521 · 2024-03-28T21:04:51Z

Provides an easy way for first time users to get the full
uncurated metadata from NCBI Datasets commands by running the
ingest workflow with the specified target dump_ncbi_dataset_report.
They can then inspect and explore the raw data to determine if they
want to configure the workflow to use additional fields from NCBI.

The rule is added to fetch_from_ncbi.smk to make it easy to run
without additional configs. Note that it is not run as part of the
default workflow and only intended to be used as a specified target.

Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.

¹ nextstrain/docs.nextstrain.org#195 (comment)

jameshadfield · 2024-03-28T22:35:48Z

ingest/defaults/config.yaml

+#
+# The default includes all available fields to be able to easily see the
+# uncurated metadata by running the workflow with the target `data/ncbi_dataset_report.tsv`
+# Remove any fields that are not needed in your workflow to reduce file size and save space.
 # Note: the "accession" field MUST be provided to match with the sequences
 ncbi_datasets_fields:


Another solution (the one I was assuming) is switch the format_ncbi_dataset_report rule to use the command from your ingest tutorial:

$ dataformat tsv virus-genome \ --package ingest/data/ncbi_dataset.zip \ > ingest/data/raw_metadata.tsv

And then use format_ncbi_datasets_ndjson to prune the fields down to the desired ncbi_datasets_fields, change their casing (etc). I presume that NCBI will add fields to their datasets over time and it'd be a pain to miss important fields because we weren't aware of them (i.e. hadn't realised they existed because they were missing from our ncbi_datasets_fields)

Yeah, both the mnemonics (with hyphens) and human readable names (with spaces) of the NCBI fields don't work well with our usual tools for TSV manipulation (e.g tsv-utils). So pruning the fields down and renaming them after the dataformat call will require a separate script that I don't really think it's worth it.

Hmm. Ok. How about adding another rule and then then we can ask people to invoke it via nextstrain build ingest dump_ncbi_dataset_report or somesuch?

rule dump_ncbi_dataset_report: input: dataset_package="data/ncbi_dataset.zip", output: ncbi_dataset_tsv="data/ncbi_dataset_report_raw.tsv", shell: """ dataformat tsv virus-genome \ --package {input.dataset_package} > {output.ncbi_dataset_tsv} """

(I say this because I think it's inevitable that we'll miss new NCBI fields in the future because we think we're looking at all the data but really we're just looking at a subset which our config's defined. The measles strain name for instance.)

Good idea, similar to @j23414' proposal in #30!

I'll update this PR with the new rule and update the tutorial to use it!

Updated in aa67664

@jameshadfield

Provides an easy way for first time users to get the full uncurated metadata from NCBI Datasets commands by running the ingest workflow with the specified target `dump_ncbi_dataset_report`. They can then inspect and explore the raw data to determine if they want to configure the workflow to use additional fields from NCBI. The rule is added to `fetch_from_ncbi.smk` to make it easy to run without additional configs. Note that it is not run as part of the default workflow and only intended to be used as a specified target. Prompted by @jameshadfield in review of the tutorial¹ and resolves #30. ¹ nextstrain/docs.nextstrain.org#195 (comment) Co-authored-by: James Hadfield <hadfield.james@gmail.com>

The pathogen-repo-guide will be updated to included the target `dump_ncbi_dataset_report` to easily generate the uncurated NCBI Dataset metadata.¹ This commit updates the tutorial to use this new target so that the user does not need to manually run the extra commands to see the raw metadata. ¹ nextstrain/pathogen-repo-guide#38

joverlee521 · 2024-04-03T20:20:36Z

Merging to merge nextstrain/docs.nextstrain.org#195

joverlee521 mentioned this pull request Mar 28, 2024

Add ingest tutorials nextstrain/docs.nextstrain.org#195

Merged

1 task

joverlee521 requested review from corneliusroemer and a team and removed request for corneliusroemer March 28, 2024 21:07

jameshadfield reviewed Mar 28, 2024

View reviewed changes

joverlee521 force-pushed the ingest-no-curation branch from faeabdf to aa67664 Compare March 29, 2024 17:30

joverlee521 changed the title ~~ingest/config.yaml: Add all NCBI Datasets fields~~ ingest: Provide target for raw metadata from NCBI Datasets Mar 29, 2024

joverlee521 merged commit 0f799d7 into main Apr 3, 2024

joverlee521 deleted the ingest-no-curation branch April 3, 2024 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: Provide target for raw metadata from NCBI Datasets #38

ingest: Provide target for raw metadata from NCBI Datasets #38

joverlee521 commented Mar 28, 2024 •

edited

Loading

jameshadfield Mar 28, 2024

joverlee521 Mar 29, 2024

jameshadfield Mar 29, 2024

joverlee521 Mar 29, 2024

joverlee521 Mar 29, 2024

joverlee521 commented Apr 3, 2024

ingest: Provide target for raw metadata from NCBI Datasets #38

ingest: Provide target for raw metadata from NCBI Datasets #38

Conversation

joverlee521 commented Mar 28, 2024 • edited Loading

jameshadfield Mar 28, 2024

Choose a reason for hiding this comment

joverlee521 Mar 29, 2024

Choose a reason for hiding this comment

jameshadfield Mar 29, 2024

Choose a reason for hiding this comment

joverlee521 Mar 29, 2024

Choose a reason for hiding this comment

joverlee521 Mar 29, 2024

Choose a reason for hiding this comment

joverlee521 commented Apr 3, 2024

joverlee521 commented Mar 28, 2024 •

edited

Loading