Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Provide target for raw metadata from NCBI Datasets #38

Merged
merged 1 commit into from
Apr 3, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Mar 28, 2024

Provides an easy way for first time users to get the full
uncurated metadata from NCBI Datasets commands by running the
ingest workflow with the specified target dump_ncbi_dataset_report.
They can then inspect and explore the raw data to determine if they
want to configure the workflow to use additional fields from NCBI.

The rule is added to fetch_from_ncbi.smk to make it easy to run
without additional configs. Note that it is not run as part of the
default workflow and only intended to be used as a specified target.

Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.

¹ nextstrain/docs.nextstrain.org#195 (comment)

@joverlee521 joverlee521 requested review from corneliusroemer and a team and removed request for corneliusroemer March 28, 2024 21:07
#
# The default includes all available fields to be able to easily see the
# uncurated metadata by running the workflow with the target `data/ncbi_dataset_report.tsv`
# Remove any fields that are not needed in your workflow to reduce file size and save space.
# Note: the "accession" field MUST be provided to match with the sequences
ncbi_datasets_fields:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution (the one I was assuming) is switch the format_ncbi_dataset_report rule to use the command from your ingest tutorial:

    $ dataformat tsv virus-genome \
        --package ingest/data/ncbi_dataset.zip \
        > ingest/data/raw_metadata.tsv

And then use format_ncbi_datasets_ndjson to prune the fields down to the desired ncbi_datasets_fields, change their casing (etc). I presume that NCBI will add fields to their datasets over time and it'd be a pain to miss important fields because we weren't aware of them (i.e. hadn't realised they existed because they were missing from our ncbi_datasets_fields)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, both the mnemonics (with hyphens) and human readable names (with spaces) of the NCBI fields don't work well with our usual tools for TSV manipulation (e.g tsv-utils). So pruning the fields down and renaming them after the dataformat call will require a separate script that I don't really think it's worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Ok. How about adding another rule and then then we can ask people to invoke it via nextstrain build ingest dump_ncbi_dataset_report or somesuch?

rule dump_ncbi_dataset_report:
    input:
        dataset_package="data/ncbi_dataset.zip",
    output:
        ncbi_dataset_tsv="data/ncbi_dataset_report_raw.tsv",
    shell:
        """
        dataformat tsv virus-genome \
            --package {input.dataset_package} > {output.ncbi_dataset_tsv}
        """

(I say this because I think it's inevitable that we'll miss new NCBI fields in the future because we think we're looking at all the data but really we're just looking at a subset which our config's defined. The measles strain name for instance.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, similar to @j23414' proposal in #30!

I'll update this PR with the new rule and update the tutorial to use it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in aa67664

Provides an easy way for first time users to get the full
uncurated metadata from NCBI Datasets commands by running the
ingest workflow with the specified target `dump_ncbi_dataset_report`.
They can then inspect and explore the raw data to determine if they
want to configure the workflow to use additional fields from NCBI.

The rule is added to `fetch_from_ncbi.smk` to make it easy to run
without additional configs. Note that it is not run as part of the
default workflow and only intended to be used as a specified target.

Prompted by @jameshadfield in review of the tutorial¹ and
resolves #30.

¹ nextstrain/docs.nextstrain.org#195 (comment)

Co-authored-by: James Hadfield <hadfield.james@gmail.com>
@joverlee521 joverlee521 changed the title ingest/config.yaml: Add all NCBI Datasets fields ingest: Provide target for raw metadata from NCBI Datasets Mar 29, 2024
joverlee521 added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Mar 29, 2024
The pathogen-repo-guide will be updated to included the target
`dump_ncbi_dataset_report` to easily generate the uncurated NCBI Dataset
metadata.¹

This commit updates the tutorial to use this new target so that the
user does not need to manually run the extra commands to see the raw
metadata.

¹ nextstrain/pathogen-repo-guide#38
@joverlee521
Copy link
Contributor Author

Merging to merge nextstrain/docs.nextstrain.org#195

@joverlee521 joverlee521 merged commit 0f799d7 into main Apr 3, 2024
@joverlee521 joverlee521 deleted the ingest-no-curation branch April 3, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a no-curation option for NCBI Ingest
2 participants