Skip to content

Commit

Permalink
Replace join metadata and clades script with csvtk and tsv append (#207)
Browse files Browse the repository at this point in the history
  • Loading branch information
j23414 authored Oct 12, 2023
2 parents 3e83617 + 4122deb commit afb5513
Show file tree
Hide file tree
Showing 5 changed files with 47 additions and 84 deletions.
77 changes: 0 additions & 77 deletions ingest/bin/join-metadata-and-clades.py

This file was deleted.

3 changes: 2 additions & 1 deletion ingest/bin/ndjson-to-tsv-and-fasta
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ if __name__ == '__main__':
args.metadata_columns,
restval="",
extrasaction='ignore',
delimiter='\t'
delimiter='\t',
lineterminator='\n',
)
metadata_csv.writeheader()

Expand Down
7 changes: 7 additions & 0 deletions ingest/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,10 @@ transform:
'authors',
'institution'
]

# Params for Nextclade related rules
nextclade:
# Field to use as the sequence ID in the Nextclade file
id_field: 'seqName'
# Fields from a Nextclade file to be renamed (if desired) and appended to a metadata file
field_map: 'source-data/nextclade-field-map.tsv'
16 changes: 16 additions & 0 deletions ingest/source-data/nextclade-field-map.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
key value
seqName seqName
clade clade
outbreak outbreak
lineage lineage
coverage coverage
totalMissing missing_data
totalSubstitutions divergence
totalNonACGTNs nonACGTN
qc.missingData.status QC_missing_data
qc.mixedSites.status QC_mixed_sites
qc.privateMutations.status QC_rare_mutations
qc.frameShifts.status QC_frame_shifts
qc.stopCodons.status QC_stop_codons
frameShifts frame_shifts
isReverseComplement is_reverse_complement
28 changes: 22 additions & 6 deletions ingest/workflow/snakemake_rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -56,15 +56,31 @@ rule join_metadata_clades:
input:
nextclade="data/nextclade.tsv",
metadata="data/metadata_raw.tsv",
nextclade_field_map=config["nextclade"]["field_map"],
output:
"data/metadata.tsv",
metadata="data/metadata.tsv",
params:
id_field=config["transform"]["id_field"],
nextclade_id_field=config["nextclade"]["id_field"],
shell:
"""
python3 bin/join-metadata-and-clades.py \
--id-field {params.id_field} \
--metadata {input.metadata} \
--nextclade {input.nextclade} \
-o {output}
export SUBSET_FIELDS=`awk 'NR>1 {{print $1}}' {input.nextclade_field_map} | tr '\n' ',' | sed 's/,$//g'`

csvtk -tl cut -f $SUBSET_FIELDS \
{input.nextclade} \
| csvtk -tl rename2 \
-F \
-f '*' \
-p '(.+)' \
-r '{{kv}}' \
-k {input.nextclade_field_map} \
| tsv-join -H \
--filter-file - \
--key-fields {params.nextclade_id_field} \
--data-fields {params.id_field} \
--append-fields '*' \
--write-all ? \
{input.metadata} \
| tsv-select -H --exclude {params.nextclade_id_field} \
> {output.metadata}
"""

0 comments on commit afb5513

Please sign in to comment.