-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace join metadata and clades script with csvtk and tsv append #207
Conversation
34087e2
to
7a704ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through this @j23414! This approach makes sense to me but I'd be curious to see what others think.
I do want to note that this changes the order of columns in the final metadata TSV. That should be fine for our uses of the metadata TSV. I can't think of any downstream Augur commands that depend on column order.
There's currently an error that is blocking due to the use of the results
directory and I've added other comments as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing things up @j23414! I have one final suggestion to parameterize the --key-fields
name but everything else looks good by inspection.
Also, not sure if you've seen this in this repo, but you can do a full test run of the ingest workflow with the Fetch and ingest (on branch) workflow. When you select to run on a branch with this workflow, the ingest pipeline will upload outputs to s3://nextstrain-data/files/workflows/monkeypox/branch/<branch-name>/
.
The shell script for joining the metadata and Nextclade outputs is taken from @j23414's work in nextstrain/mpox#207 Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
The shell script for joining the metadata and Nextclade outputs is taken from @j23414's work in nextstrain/mpox#207 Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
The shell script for joining the metadata and Nextclade outputs is taken from @j23414's work in nextstrain/mpox#207 Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
The shell script for joining the metadata and Nextclade outputs is taken from @j23414's work in nextstrain/mpox#207 Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
1a68589
to
cefff3b
Compare
Thanks @jover! Sorry for the delay; I've added the nextclade-key-field parameterization (cefff3b). Thank you for highlighting the https://github.com/nextstrain/monkeypox/actions/runs/6475385391 Nevermind, I think that script got moved to phylogenetic/bin/set-branch-ingest-config. |
21d3f96
to
7ba8d5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I think to good to merge as long as the trial run shows no difference in the output (except column order in the metadata.tsv).
af6211f
to
9817e6e
Compare
The shell script for joining the metadata and Nextclade outputs is taken from @j23414's work in nextstrain/mpox#207 Co-authored-by: Jennifer Chang <jennifer.chang.bioinform@gmail.com>
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py script with csvtk and tsv append when there aren't any customized calculations. nextstrain/ingest#23 Relatedly, this commit also adds a nextclade config section where mapping fields from the nextclade output to be appended to the metadata can be specified. Co-authored-by: Jover Lee <joverlee521@gmail.com>
This change fixes errors for tsv-utils downstream processing. For example: [tsv-join] Error processing command line arguments: Windows/DOS line ending found for data/metadata_raw.tsv
9817e6e
to
4122deb
Compare
Description of proposed changes
In our effort to centralize ingest scripts, we identified an opportunity to streamline the process by replacing the
join-metadata-and-clades.py
script withcsvtk rename
andtsv-append
in cases where customized score calculations are not required.The specific columns from a nextclade TSV file to be renamed (if desired) and appended to the final metadata.tsv file are defined in a
source-data/nextclade-field-map.tsv
key-value file. This implementation is similar to the approach used in ncbi-source-field-map used elsewhere.Related issue(s)
nextstrain/ingest#23
Checklist
Manual check could look like: