Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace join metadata and clades script with csvtk and tsv append #207

Merged
merged 2 commits into from
Oct 12, 2023

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Oct 4, 2023

Description of proposed changes

In our effort to centralize ingest scripts, we identified an opportunity to streamline the process by replacing the join-metadata-and-clades.py script with csvtk rename and tsv-append in cases where customized score calculations are not required.

The specific columns from a nextclade TSV file to be renamed (if desired) and appended to the final metadata.tsv file are defined in a source-data/nextclade-field-map.tsv key-value file. This implementation is similar to the approach used in ncbi-source-field-map used elsewhere.

Related issue(s)

nextstrain/ingest#23

Checklist

  • Checks pass
  • Manual check

Manual check could look like:

git clone https://github.com/nextstrain/monkeypox.git old
git clone https://github.com/nextstrain/monkeypox.git new

cd new/ingest
git checkout replace_join_metadata_and_clades
nextstrain build . data/metadata.tsv

cd ../../old/ingest
nextstrain build . data/metadata.tsv

# Compare using daff diff since order of columns were not retained
daff diff data/metadata.tsv ../../new/ingest/data/metadata.tsv > diff.txt

@j23414 j23414 force-pushed the replace_join_metadata_and_clades branch from 34087e2 to 7a704ff Compare October 4, 2023 22:29
@j23414 j23414 requested review from a team and joverlee521 October 4, 2023 22:41
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this @j23414! This approach makes sense to me but I'd be curious to see what others think.

I do want to note that this changes the order of columns in the final metadata TSV. That should be fine for our uses of the metadata TSV. I can't think of any downstream Augur commands that depend on column order.

There's currently an error that is blocking due to the use of the results directory and I've added other comments as well.

ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/source-data/nextclade-field-map.tsv Outdated Show resolved Hide resolved
@j23414 j23414 marked this pull request as draft October 5, 2023 21:39
@j23414 j23414 marked this pull request as ready for review October 6, 2023 00:28
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing things up @j23414! I have one final suggestion to parameterize the --key-fields name but everything else looks good by inspection.

Also, not sure if you've seen this in this repo, but you can do a full test run of the ingest workflow with the Fetch and ingest (on branch) workflow. When you select to run on a branch with this workflow, the ingest pipeline will upload outputs to s3://nextstrain-data/files/workflows/monkeypox/branch/<branch-name>/.

ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
@j23414 j23414 marked this pull request as draft October 6, 2023 22:50
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Oct 6, 2023
The shell script for joining the metadata and Nextclade outputs is taken
from @j23414's work in nextstrain/mpox#207

Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Oct 7, 2023
The shell script for joining the metadata and Nextclade outputs is taken
from @j23414's work in nextstrain/mpox#207

Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Oct 10, 2023
The shell script for joining the metadata and Nextclade outputs is taken
from @j23414's work in nextstrain/mpox#207

Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Oct 10, 2023
The shell script for joining the metadata and Nextclade outputs is taken
from @j23414's work in nextstrain/mpox#207

Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
@j23414 j23414 force-pushed the replace_join_metadata_and_clades branch from 1a68589 to cefff3b Compare October 10, 2023 21:36
@j23414 j23414 marked this pull request as ready for review October 10, 2023 21:55
@j23414
Copy link
Contributor Author

j23414 commented Oct 10, 2023

Thanks @jover! Sorry for the delay; I've added the nextclade-key-field parameterization (cefff3b).

Thank you for highlighting the Fetch and ingest (on branch) workflow!
I'm encountering an error - ./bin/set-branch-ingest-config: No such file or directory with an exit code of 127 when I attempt to run it. Perhaps I'm overlooking something?

https://github.com/nextstrain/monkeypox/actions/runs/6475385391

Nevermind, I think that script got moved to phylogenetic/bin/set-branch-ingest-config.

@j23414 j23414 force-pushed the replace_join_metadata_and_clades branch from 21d3f96 to 7ba8d5c Compare October 10, 2023 23:38
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I think to good to merge as long as the trial run shows no difference in the output (except column order in the metadata.tsv).

ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the replace_join_metadata_and_clades branch 2 times, most recently from af6211f to 9817e6e Compare October 11, 2023 18:09
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Oct 11, 2023
The shell script for joining the metadata and Nextclade outputs is taken
from @j23414's work in nextstrain/mpox#207

Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
j23414 and others added 2 commits October 12, 2023 15:10
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py
script with csvtk and tsv append when there aren't any customized calculations.

nextstrain/ingest#23

Relatedly, this commit also adds a nextclade config section where mapping
fields from the nextclade output to be appended to the metadata can be specified.

Co-authored-by: Jover Lee <joverlee521@gmail.com>
This change fixes errors for tsv-utils downstream processing.
For example:

 [tsv-join] Error processing command line arguments: Windows/DOS line ending found for data/metadata_raw.tsv
@j23414 j23414 force-pushed the replace_join_metadata_and_clades branch from 9817e6e to 4122deb Compare October 12, 2023 22:14
@j23414 j23414 merged commit afb5513 into master Oct 12, 2023
5 checks passed
@j23414 j23414 deleted the replace_join_metadata_and_clades branch October 12, 2023 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants