-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip]: Add the functionality of join metadata and clades #23
Conversation
Add a minor check to join-metadata-and-clades to ensure that all of the sequences in the metadata file are included in the output.
Each pathogen can have unique columns in the Nextclade output (e.g. ncov-ingest includes SC2 specific columns). This change makes the nextclade column map customizable to support these.
6a64136
to
d7a72f0
Compare
dc74f08
to
e069a58
Compare
We could combine the two files without performing complex calculations by using a combination of csvtk rename and tsv-join as follows: # Rename columns in the input.nextclade file
cat {input.nextclade} \
| csvtk -tl rename2 \
-F \
-f '*' \
-p '(.+)' \
-r '{{kv}}' \
-k {input.nextclade_field_map} \
> results/nextclade_renamed.tsv
# Join the renamed nextclade file with the input.metadata file
cat {input.metadata} \
| tsv-join -H \
--filter-file results/nextclade_renamed.tsv \
--key-fields seqName \
--data-fields accession \
--append-fields `awk '{print $2}' results/nextclade_renamed.tsv | tr '\n' ','` \
--allow-duplicate-keys \
--write-all -1 \
> {output.metadata} |
@j23414 Would you be up for replacing join-metadata-and-clades in monkeypox with your csvtk/tsv-join example? It would be nice to do a full test run there to see how the outputs compare. |
I found myself needing to implement something similar in the flu_frequencies workflow and appreciated Independent from the tools you end up using here, you can drop the
becomes this:
|
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py script with csvtk and tsv append when there aren't any customized calculations. nextstrain/ingest#23
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py script with csvtk and tsv append when there aren't any customized calculations. nextstrain/ingest#23
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py script with csvtk and tsv append when there aren't any customized calculations. nextstrain/ingest#23
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py script with csvtk and tsv append when there aren't any customized calculations. nextstrain/ingest#23
As part of centralizing ingest scripts, replace the join-metadata-and-clades.py script with csvtk and tsv append when there aren't any customized calculations. nextstrain/ingest#23 Relatedly, this commit also adds a nextclade config section where mapping fields from the nextclade output to be appended to the metadata can be specified. Co-authored-by: Jover Lee <joverlee521@gmail.com>
Closed since this script is replaced with csvtk and tsv utils. |
Description of proposed changes
After some discussion with @joverlee521, moved
join-metadata-and-clades.py
from PR: #20 to this draft.Some of the functionality may be replaced by
csvtk
but there are customized calculations in certain pathogen repositories.This is a placeholder that the functionality of
join-metadata-and-clades
requires more discussion and thought.Related issue(s)
Subset of scripts listed in #1
Checklist