Skip to content

Commit

Permalink
ingest/nextclade: Deduplicate serotype code
Browse files Browse the repository at this point in the history
The serotypes still need to be hardcoded because we only have Nextclade
datasets for these specific serotypes. Use these hardcoded serotypes
as wildcard constraints in the Nextclade rules to deduplicate the
shell commands and parallelize the final `split_metadata_by_serotype`
rule.
  • Loading branch information
joverlee521 committed Feb 17, 2024
1 parent 6dbc1af commit 831aa0d
Showing 1 changed file with 14 additions and 25 deletions.
39 changes: 14 additions & 25 deletions ingest/workflow/snakemake_rules/nextclade.smk
Original file line number Diff line number Diff line change
@@ -1,16 +1,21 @@
SUPPORTED_NEXTCLADE_SEROTYPES = ['denv1', 'denv2', 'denv3', 'denv4']
SEROTYPE_CONSTRAINTS = '|'.join(SUPPORTED_NEXTCLADE_SEROTYPES)

rule nextclade_denvX:
"""
For each type, classify into the appropriate subtype
"""
input:
sequences="results/sequences_denv{x}.fasta",
dataset="../nextclade_data/denv{x}",
sequences="results/sequences_{serotype}.fasta",
dataset="../nextclade_data/{serotype}",
output:
nextclade_denvX="data/nextclade_results/nextclade_denv{x}.tsv",
nextclade_denvX="data/nextclade_results/nextclade_{serotype}.tsv",
threads: 4
params:
min_length=config["nextclade"]["min_length"],
min_seed_cover=config["nextclade"]["min_seed_cover"],
wildcard_constraints:
serotype=SEROTYPE_CONSTRAINTS
shell:
"""
nextclade run \
Expand All @@ -28,10 +33,7 @@ rule concat_nextclade_subtype_results:
Concatenate all the nextclade results for dengue subtype classification
"""
input:
nextclade_denv1="data/nextclade_results/nextclade_denv1.tsv",
nextclade_denv2="data/nextclade_results/nextclade_denv2.tsv",
nextclade_denv3="data/nextclade_results/nextclade_denv3.tsv",
nextclade_denv4="data/nextclade_results/nextclade_denv4.tsv",
expand("data/nextclade_results/nextclade_{serotype}.tsv", serotype=SUPPORTED_NEXTCLADE_SEROTYPES),
output:
nextclade_subtypes="results/nextclade_subtypes.tsv",
params:
Expand All @@ -43,16 +45,7 @@ rule concat_nextclade_subtype_results:
| tr ',' '\t' \
> {output.nextclade_subtypes}
tsv-select -H -f "seqName,clade" {input.nextclade_denv1} \
| awk 'NR>1 {{print}}' \
>> {output.nextclade_subtypes}
tsv-select -H -f "seqName,clade" {input.nextclade_denv2} \
| awk 'NR>1 {{print}}' \
>> {output.nextclade_subtypes}
tsv-select -H -f "seqName,clade" {input.nextclade_denv3} \
| awk 'NR>1 {{print}}' \
>> {output.nextclade_subtypes}
tsv-select -H -f "seqName,clade" {input.nextclade_denv4} \
tsv-select -H -f "seqName,clade" {input} \
| awk 'NR>1 {{print}}' \
>> {output.nextclade_subtypes}
"""
Expand Down Expand Up @@ -87,14 +80,10 @@ rule split_metadata_by_serotype:
input:
metadata="results/metadata_all.tsv",
output:
metadata_denv1="results/metadata_denv1.tsv",
metadata_denv2="results/metadata_denv2.tsv",
metadata_denv3="results/metadata_denv3.tsv",
metadata_denv4="results/metadata_denv4.tsv",
serotype_metadata="results/metadata_{serotype}.tsv"
wildcard_constraints:
serotype=SEROTYPE_CONSTRAINTS
shell:
"""
tsv-filter -H --str-eq ncbi_serotype:denv1 {input.metadata} > {output.metadata_denv1}
tsv-filter -H --str-eq ncbi_serotype:denv2 {input.metadata} > {output.metadata_denv2}
tsv-filter -H --str-eq ncbi_serotype:denv3 {input.metadata} > {output.metadata_denv3}
tsv-filter -H --str-eq ncbi_serotype:denv4 {input.metadata} > {output.metadata_denv4}
tsv-filter -H --str-eq ncbi_serotype:{wildcards.serotype} {input.metadata} > {output.serotype_metadata}
"""

0 comments on commit 831aa0d

Please sign in to comment.