Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy ingest #13

Merged
merged 8 commits into from
Dec 5, 2023
Merged

Copy ingest #13

merged 8 commits into from
Dec 5, 2023

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Oct 11, 2023

Description of proposed changes

Splitting out dengue ingest changes PR#6 into smaller, more focused pull requests. Especially since there have been significant improvements to linking reusable pathogen ingest scripts.

The primary scope of this PR includes:

  1. Copying the Monkeypox ingest directory.
  2. Using git subtree to copy and link the reusable scripts in an ingest/vendored subdirectory.
  3. Temporarily removing Nextclade-related rules, pending the compilation of a Nextclade dengue dataset and potential v3 changes.
  4. Pulling and processing one "sequences.fasta" and "metadata.tsv" file pair.

To ease development and review, tasks that are not part of this PR but will be submitted as future PRs include:

  1. Splitting the fetch process into dengue serotypes (denv1-4).
  2. Adding dengue-specific annotations and data fixes.
  3. Integrating Nextclade-related rules and datasets.

Related issue(s)

Checklist

  • Checks pass
  • Manual check passes
git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout cp_ingest
cd ingest
nextstrain build . data/metadata.tsv data/sequences.fasta

@j23414 j23414 marked this pull request as ready for review October 12, 2023 00:21
@j23414 j23414 changed the title [wip] Copy ingest Copy ingest Oct 12, 2023
@j23414 j23414 requested a review from a team October 12, 2023 00:22
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to install ncbi-datasets-cli to run this (missing datasets command) - it's listed in our conda recipe but it's not listed in the "ambient" section of our install docs. I'd suggest it's added to those docs, or detailed in the README here; even better would be to catch the error and explain what piece of software is missing to the user.

Using ncbi-datasets-cli version 15.23.0 I encountered the following error: Error: Internal error (invalid zip archive). Please try again as part of rule fetch_ncbi_dataset_package. Perhaps I'm missing some env variables? Perhaps it's transient (although it happened multiple times)?

(I don't have a view on whether copy-and-pasting the mpx ingest is the right approach; I'll defer to @joverlee521 here.)

@j23414
Copy link
Contributor Author

j23414 commented Oct 16, 2023

I had to install ncbi-datasets-cli

Ah I see what happened. It appears we switched from NCBI Virus to NCBI Datasets on Sept 11, 2023 and the ingest README here and in Monkeypox need to be updated. I'll ensure that the ingest README for Monkeypox is updated first, followed by a rebase of this PR. Additionally, I'll submit a PR to update the ambient documentation.

Error: Internal error (invalid zip archive)

I've encountered this error during testing in the Monkeypox repository as well. It seems to be a known intermittent issue with ncbi datasets . To mitigate this, we have a retries: 5 directive in the Snakemake rule.

(For documentation:
The copy-and-pasting mpx ingest approach serves several benefits, including maintaining consistency across pathogen repositories, facilitating focused review on new changes, and providing a method to highlight changes that need to be propagated to other repositories.)

@j23414 j23414 mentioned this pull request Oct 16, 2023
1 task
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking down dengue/ingest changes into smaller focused PRs @j23414, this is much easier to review!

I don't have a view on whether copy-and-pasting the mpx ingest is the right approach; I'll defer to @joverlee521 here.

I think it's fine to copy monkeypox/ingest here since it's based on previous work that Jennifer had done. I've made comments on some changes that would be small improvements upon the monkeypox workflow.

This does bring the question to mind of whether it's easier for the team to copy/paste an existing ingest workflow rather than start from scratch with the pathogen-repo-template.

ingest/bin/reverse_reversed_sequences.py Outdated Show resolved Hide resolved
ingest/config/config.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field map was added to keep the monkeypox NDJSON backwards compatible with field names that we used from NCBI Virus. I don't think we need to maintain it here and just directly use the NCBI Dataset field names.

My reasoning for using the default NCBI Dataset field names is to centralize all field/data transformations to the curation pipeline that start from the NDJSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat appreciate the renaming because I'm not a fan of the spaces and capitalization in the NCBI Dataset field names (e.g. "Isolate Lineage" is being renamed to "strain" and "Geographic Region" to "region").

Do you prefer "Isolate Lineage" or would find "isolate_lineage" also acceptable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd rather keep the fields exactly as NCBI has them. This way we can point users to the NCBI Dataset docs for a description of each field. I think it would help us avoid things like the submitted/released field mix up in the NDJSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of pointing users to the NCBI Dataset field documentation! However, it is worth noting that the field Geographic Region is documented as geographicRegion and Region. I guess that would be NCBI's responsibility to keep their documentation up-to-date.

Spaces in column names can require special handling (e.g. properly quoting column names) which I think our code has been made to handle. However, such spaces may post challenges in other languages, scripts, or projects. Replacing spaces with underscores in the NCBI Dataset fields, would still enable us to point to NCBI's documentation.

Nevertheless, I'm open to using dengue as a test pipeline to assess how our code handles columns with spaces. If needed, we can later reintroduce modifications to column titles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it is worth noting that the field Geographic Region is documented as geographicRegion and Region.

Oh right, I forgot that the NCBI reference docs have a nonintuitive set up. They have a top level "Geographic" column name that gets prepended to the VirusAssembly.CollectionLocation fields. The dataformat tsv virus-genome docs for the available fields is a better place to see the full field names.

So I guess there's no that much benefit from keeping the space matching for the column names...Welp, maybe we should transform the names back to the mnemonics so that we only have to maintain one set of column mappings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be a bit uncertain on desired names for the final metadata, so any feedback is welcome.

I've been looking around for any documentation for standard Nextstrain metadata fields and found a page in the ncov workflow docs. I think we can use these standard metadata fields across all pathogens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised the metadata fields in 116f363 The fields do not exactly match the docs since we are not pulling in gisaid data, but I tried to adhere to the stylistic choices (e.g. underscores preferred, lowercase). I am open to discussing any of the field names on a call if that would be helpful. Just let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like auspice requires accession instead of genbank_accession to render the link. Unless there's a workaround that I'm not seeing.

Screenshot 2023-11-13 at 4 30 52 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingest/source-data/annotations.tsv Outdated Show resolved Hide resolved
ingest/source-data/geolocation-rules.tsv Outdated Show resolved Hide resolved
ingest/config/optional.yaml Outdated Show resolved Hide resolved
ingest/config/optional.yaml Outdated Show resolved Hide resolved
ingest/README.md Outdated Show resolved Hide resolved
ingest/Snakefile Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/trigger_rebuild.smk Outdated Show resolved Hide resolved
@@ -0,0 +1,87 @@
from snakemake.utils import min_version

min_version(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimum might actually be higher now that we use a default profile - something like 7.30

Below one might possibly need to pass in extra cli flags

"sra-accs",
"submitter-names",
"submitter-affiliation",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of duplication of these mappings here - it might be nice to deduplicate

j23414 added a commit to nextstrain/mpox that referenced this pull request Oct 27, 2023
Nested configs for s3 URLs are challenging to override by the Snakemake
`--config` option, as discussed in the following comment:

nextstrain/dengue#13 (comment)

This moves the url to the top level, and propagate those changes to the relevant Snakefiles.
joverlee521 added a commit to nextstrain/mpox that referenced this pull request Oct 30, 2023
Recently learned that Snakemake no longer overwrites the entire config
when providing nested config values via the `--config` option.¹ This
allows us to remove the set-branch-ingest-config script and directly set
branch config values via the command line option in our GitHub Action
workflows.

I've opted to use `jq` to interpolate the branch name into the S3_DST
because I didn't want to fiddle with escaping quotes in the JSON string.

¹ nextstrain/dengue#13 (comment)
joverlee521 added a commit to nextstrain/mpox that referenced this pull request Oct 30, 2023
Recently learned that Snakemake no longer overwrites the entire config
when providing nested config values via the `--config` option.¹ This
allows us to remove the set-branch-ingest-config script and directly set
branch config values via the command line option in our GitHub Action
workflows.

I've opted to use `jq` to interpolate the branch name into the S3_DST
because I didn't want to fiddle with escaping quotes in the JSON string.

¹ nextstrain/dengue#13 (comment)
@joverlee521 joverlee521 mentioned this pull request Nov 3, 2023
1 task
This is a copy of the ingest directory from the mpox repo:

https://github.com/nextstrain/mpox/tree/c9b8282e2d56056c8b3c45fa860d931b320acd63

However, `ingest/vendored` subdirectory is not copied over since that folder should be
added with `git subtree`.

https://github.com/nextstrain/mpox/tree/c9b8282e2d56056c8b3c45fa860d931b320acd63/ingest/vendored#ingest

Future commits will change this to work with Dengue data.
…/vendored

subrepo:
  subdir:   "ingest/vendored"
  merged:   "a0faef5"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "a0faef5"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Temporary removal of Nextclade-related rules, pending the compilation of
a Nextclade dengue dataset and potential v3 changes.

May be added back in later.
Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
j23414 added a commit that referenced this pull request Nov 9, 2023
Originally the field map was created to keep mpox NDJSON backward compatible
with field names used from NCBI Virus. However, this constraint is not
applicable to dengue.

This commit organizes field renaming into two parts.

1. Rename the NCBI output columns to match the NCBI mnemonics
   (see `source-data/ncbi-dataset-field-map.tsv`)
2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names
   (see "transform: fieldmap:" in `config/config.yaml`)

For context and discussion, see #13 (comment)
j23414 added a commit that referenced this pull request Nov 9, 2023
Originally the field map was created to keep mpox NDJSON backward compatible
with field names used from NCBI Virus. However, this constraint is not
applicable to dengue.

This commit organizes field renaming into two parts.

1. Rename the NCBI output columns to match the NCBI mnemonics
   (see `source-data/ncbi-dataset-field-map.tsv`)
2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names
   (see "transform: fieldmap:" in `config/config.yaml`)

For context and discussion, see #13 (comment)
j23414 added a commit that referenced this pull request Nov 9, 2023
Originally the field map was created to keep mpox NDJSON backward compatible
with field names used from NCBI Virus. However, this constraint is not
applicable to dengue.¹

This commit organizes field renaming into two parts.

1. Rename the NCBI output columns to match the NCBI mnemonics²
   (see `source-data/ncbi-dataset-field-map.tsv`)

2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names³
   (see "transform: fieldmap:" in `config/config.yaml`)

¹ #13 (comment)
² https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
³ https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I stated on Slack, if the workflow in this PR is producing the public files on data.nextstrain.org, then I think it's fine to merge into the main branch now.

We can tackle unresolved questions in subsequent PRs.

Originally the field map was created to keep mpox NDJSON backward compatible
with field names used from NCBI Virus. However, this constraint is not
applicable to dengue.¹

This commit organizes field renaming into two parts.

1. Rename the NCBI output columns to match the NCBI mnemonics²
   (see "ncbi_field_map:" in `config/config.yaml`)
2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names³
   (see "transform: fieldmap:" in `config/config.yaml`)

¹ #13 (comment)
² https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
³ https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html
@j23414 j23414 merged commit f513d31 into main Dec 5, 2023
6 checks passed
@j23414 j23414 deleted the cp_ingest branch December 5, 2023 18:15
@j23414 j23414 mentioned this pull request Dec 12, 2023
2 tasks
@j23414 j23414 mentioned this pull request Feb 6, 2024
3 tasks
@j23414 j23414 mentioned this pull request May 23, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants