Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[h5n1-cattle-outbreak] distinguish inferred vs known metadata for division #100

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

jameshadfield
Copy link
Member

The limited metadata available for this outbreak means we infer division for many tips, so being able to show known vs inferred metadata can be clarifying for many users.

As we add more customisations like this the snakemake pipeline becomes more and more complex. Allowing auspice-config overlays / merging would solve half of the complexity introduced here. If we find ourselves commonly duplicating metadata columns we can add a config-parameterised rule for this.

Note that the colours are generated within Auspice and so differ between the two representations of division (inferred vs metadata). The also differ between genome & segment builds. A nice improvement would be making these consistent over all h5n1-cattle-outbreak datasets.

Context: https://bedfordlab.slack.com/archives/CD84ELG0N/p1730227555767939

Snakefile Outdated Show resolved Hide resolved
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only left non-blocking comments. Totally get what you mean by the snakemake pipeline getting more complex. I would move all of the h5n1-cattle-outbreak specific complexity into the cattle-flu.smk file, but will have to think this through...

Comment on lines +112 to +115
NOTE: long-term we should be consulting `traits_params()` to work out the columns to duplicate, but
that function's not visible to this .smk file so would require deeper refactoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

This .smk file would have access to traits_params() if we move the include block to the bottom of the main Snakefile.

diff --git a/Snakefile b/Snakefile
index e959762..e822899 100755
--- a/Snakefile
+++ b/Snakefile
@@ -8,9 +8,6 @@ wildcard_constraints:
 SEGMENTS = ["pb2", "pb1", "pa", "ha","np", "na", "mp", "ns"]
 
 
-for rule_file in config.get('custom_rules', []):
-    include: rule_file
-
 # The config option `same_strains_per_segment=True'` (e.g. supplied to snakemake via --config command line argument)
 # will change the behaviour of the workflow to use the same strains for each segment. This is achieved via these steps:
 # (1) Filter the HA segment as normal plus filter to those strains with 8 segments
@@ -633,3 +630,6 @@ rule clean:
         "auspice"
     shell:
         "rm -rfv {params}"
+
+for rule_file in config.get('custom_rules', []):
+    include: rule_file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I was thinking we'd have to pull them out into a common file. Would this have any side-effects? E.g. if a main-snakefile rule targeted a filename from the custom rules file is this ok? What about if it referenced rule.x.output.y instead of the filename?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the include directive at the end would be equivalent to just including all of contents of cattle-flu.smk at the end of the file. Access to target filenames and rules.x.output.y should still work. Just any variables and/or functions defined in cattle-flu.smk would not be accessible in the main Snakefile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(P.S. once we do this, I have some WIP commits that shift more of the cattle-flu stuff into cattle-flu.smk, but which can't currently do so because they need access to functions defined in the main snakefile)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Leaving this for future work)

Snakefile Outdated
Comment on lines 537 to 554
import json
with open(input.auspice_config) as fh:
config = json.load(fh)
auspice_config = json.load(fh)
if wildcards.subtype == "h5n1-cattle-outbreak":
if wildcards.segment == "genome":
config['display_defaults']['distance_measure'] = "num_date"
auspice_config['display_defaults']['distance_measure'] = "num_date"
division_idx = next((i for i,c in enumerate(auspice_config['colorings']) if c['key']=='division'), None)
assert division_idx!=None, "Auspice config did not have a division coloring!"
auspice_config['colorings'].insert(division_idx+1, {
"key": "division_metadata",
"title": auspice_config['colorings'][division_idx]["title"] + " (metadata)",
"type": "categorical",
})
auspice_config['colorings'][division_idx]["title"] += " (inferred)"
else:
config['display_defaults']['distance_measure'] = "div"
auspice_config['display_defaults']['distance_measure'] = "div"
with open(output.auspice_config, 'w') as fh:
json.dump(config, fh, indent=2)
json.dump(auspice_config, fh, indent=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Just noting that this is starting to feel like it should be an external script that can be run/debugged outside of the workflow.

rules/cattle-flu.smk Outdated Show resolved Hide resolved
@trvrb
Copy link
Member

trvrb commented Oct 30, 2024

This looks good from final output perspective. It would be a nice thing to have consistent colors. We also have (had?) this with https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=data_source where as data sizes would fluctuate the blue vs yellow coloring would sometimes flip. With consistent colors we could also have a more sensible geographic coloring where nearby states get similar colors.

The limited metadata available for this outbreak means we infer division
for many tips, so being able to show known vs inferred metadata can be
clarifying for many users.

As we add more customisations like this the snakemake pipeline becomes
more and more complex. Allowing auspice-config overlays / merging would
solve half of the complexity introduced here. If we find ourselves
commonly duplicating metadata columns we can add a config-parameterised
rule for this.

Note that the colours are generated within Auspice and so differ between
the two representations of division (inferred vs metadata). The also
differ between genome & segment builds. A nice improvement would be
making these consistent over all h5n1-cattle-outbreak datasets.

Context: <https://bedfordlab.slack.com/archives/CD84ELG0N/p1730227555767939>
@jameshadfield jameshadfield merged commit 443e785 into master Oct 30, 2024
6 checks passed
@jameshadfield jameshadfield deleted the james/division-metadata branch October 30, 2024 22:11
@jameshadfield
Copy link
Member Author

Wrote up the consistent-colours as a separate issue

jameshadfield added a commit that referenced this pull request Oct 31, 2024
Make variables and functions defined in the snakefile available to
custom rule files such as cattle-flu.smk. See
<#100 (comment)>
for more context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants