-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Nextclade metadata merge to use augur curate rename
and augur merge
#52
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,8 @@ and sequences. | |
See Nextclade docs for more details on usage, inputs, and outputs if you would | ||
like to customize the rules | ||
""" | ||
import sys | ||
|
||
DATASET_NAME = config["nextclade"]["dataset_name"] | ||
|
||
|
||
|
@@ -46,35 +48,53 @@ rule run_nextclade: | |
""" | ||
|
||
|
||
rule join_metadata_and_nextclade: | ||
if isinstance(config["nextclade"]["field_map"], str): | ||
print(f"Converting config['nextclade']['field_map'] from TSV file ({config['nextclade']['field_map']}) to dictionary; " | ||
f"consider putting the field map directly in the config file.", file=sys.stderr) | ||
|
||
with open(config["nextclade"]["field_map"], "r") as f: | ||
config["nextclade"]["field_map"] = dict(line.rstrip("\n").split("\t", 1) for line in f if not line.startswith("#")) | ||
|
||
|
||
rule nextclade_metadata: | ||
input: | ||
nextclade="results/nextclade.tsv", | ||
output: | ||
nextclade_metadata=temp("results/nextclade_metadata.tsv"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a subset of the data that's only needed transiently and unlikely to be useful on its own. It may also be large for some pathogens, so don't want it to stick around unnecessarily. |
||
params: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These were missing in the rule prior to my changes. None of the rules in this file have those stanzas. I'm going to punt on changing that here and now. |
||
nextclade_id_field=config["nextclade"]["id_field"], | ||
nextclade_field_map=[f"{old}={new}" for old, new in config["nextclade"]["field_map"].items()], | ||
nextclade_fields=",".join(config["nextclade"]["field_map"].values()), | ||
shell: | ||
r""" | ||
augur curate rename \ | ||
--metadata {input.nextclade:q} \ | ||
--id-column {params.nextclade_id_field:q} \ | ||
--field-map {params.nextclade_field_map:q} \ | ||
--output-metadata - \ | ||
| tsv-select --header --fields {params.nextclade_fields:q} \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tangent I was thinking of revisiting nextstrain/augur#1526 to remove the need to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, hmm. Depending on the type of TSV that Nextclade outputs, we could flip it to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. following up Nextclade outputs CSV-like TSV, so we will still need the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm going to leave the precise TSV flavors and correct handling of them as future work for nextstrain/augur#1566. Lots of tsv-utils and csvtk usages will have to change if we want them to be correct (and I think we do). |
||
> {output.nextclade_metadata:q} | ||
""" | ||
|
||
|
||
rule join_metadata_and_nextclade: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also missing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
input: | ||
metadata="data/subset_metadata.tsv", | ||
nextclade_field_map=config["nextclade"]["field_map"], | ||
nextclade_metadata="results/nextclade_metadata.tsv", | ||
output: | ||
metadata="results/metadata.tsv", | ||
params: | ||
metadata_id_field=config["curate"]["output_id_field"], | ||
nextclade_id_field=config["nextclade"]["id_field"], | ||
shell: | ||
r""" | ||
augur merge \ | ||
--metadata \ | ||
metadata={input.metadata:q} \ | ||
nextclade={input.nextclade_metadata:q} \ | ||
--metadata-id-columns \ | ||
metadata={params.metadata_id_field:q} \ | ||
nextclade={params.nextclade_id_field:q} \ | ||
--output-metadata {output.metadata:q} \ | ||
--no-source-columns | ||
""" | ||
export SUBSET_FIELDS=`grep -v '^#' {input.nextclade_field_map} | awk '{{print $1}}' | tr '\n' ',' | sed 's/,$//g'` | ||
|
||
csvtk -tl cut -f $SUBSET_FIELDS \ | ||
{input.nextclade} \ | ||
| csvtk -tl rename2 \ | ||
-F \ | ||
-f '*' \ | ||
-p '(.+)' \ | ||
-r '{{kv}}' \ | ||
-k {input.nextclade_field_map} \ | ||
| tsv-join -H \ | ||
--filter-file - \ | ||
--key-fields {params.nextclade_id_field} \ | ||
--data-fields {params.metadata_id_field} \ | ||
--append-fields '*' \ | ||
--write-all ? \ | ||
{input.metadata} \ | ||
| tsv-select -H --exclude {params.nextclade_id_field} \ | ||
> {output.metadata} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this is much cleaner to keep all the field mappings in the central config.yaml 🙌