[h5n1-cattle-outbreak] distinguish inferred vs known metadata for division #100

jameshadfield · 2024-10-30T00:47:38Z

The limited metadata available for this outbreak means we infer division for many tips, so being able to show known vs inferred metadata can be clarifying for many users.

As we add more customisations like this the snakemake pipeline becomes more and more complex. Allowing auspice-config overlays / merging would solve half of the complexity introduced here. If we find ourselves commonly duplicating metadata columns we can add a config-parameterised rule for this.

Note that the colours are generated within Auspice and so differ between the two representations of division (inferred vs metadata). The also differ between genome & segment builds. A nice improvement would be making these consistent over all h5n1-cattle-outbreak datasets.

Context: https://bedfordlab.slack.com/archives/CD84ELG0N/p1730227555767939

Snakefile

joverlee521

I only left non-blocking comments. Totally get what you mean by the snakemake pipeline getting more complex. I would move all of the h5n1-cattle-outbreak specific complexity into the cattle-flu.smk file, but will have to think this through...

joverlee521 · 2024-10-30T16:57:56Z

rules/cattle-flu.smk

+    NOTE: long-term we should be consulting `traits_params()` to work out the columns to duplicate, but
+    that function's not visible to this .smk file so would require deeper refactoring.


non-blocking

This .smk file would have access to traits_params() if we move the include block to the bottom of the main Snakefile.

diff --git a/Snakefile b/Snakefile index e959762..e822899 100755 --- a/Snakefile +++ b/Snakefile @@ -8,9 +8,6 @@ wildcard_constraints: SEGMENTS = ["pb2", "pb1", "pa", "ha","np", "na", "mp", "ns"] -for rule_file in config.get('custom_rules', []): - include: rule_file - # The config option `same_strains_per_segment=True'` (e.g. supplied to snakemake via --config command line argument) # will change the behaviour of the workflow to use the same strains for each segment. This is achieved via these steps: # (1) Filter the HA segment as normal plus filter to those strains with 8 segments @@ -633,3 +630,6 @@ rule clean: "auspice" shell: "rm -rfv {params}" + +for rule_file in config.get('custom_rules', []): + include: rule_file

Nice! I was thinking we'd have to pull them out into a common file. Would this have any side-effects? E.g. if a main-snakefile rule targeted a filename from the custom rules file is this ok? What about if it referenced rule.x.output.y instead of the filename?

Using the include directive at the end would be equivalent to just including all of contents of cattle-flu.smk at the end of the file. Access to target filenames and rules.x.output.y should still work. Just any variables and/or functions defined in cattle-flu.smk would not be accessible in the main Snakefile.

(P.S. once we do this, I have some WIP commits that shift more of the cattle-flu stuff into cattle-flu.smk, but which can't currently do so because they need access to functions defined in the main snakefile)

(Leaving this for future work)

joverlee521 · 2024-10-30T17:17:47Z

Snakefile

        import json
        with open(input.auspice_config) as fh:
-            config = json.load(fh)
+            auspice_config = json.load(fh)
        if wildcards.subtype == "h5n1-cattle-outbreak":
            if wildcards.segment == "genome":
-                config['display_defaults']['distance_measure'] = "num_date"
+                auspice_config['display_defaults']['distance_measure'] = "num_date"
+                division_idx = next((i for i,c in enumerate(auspice_config['colorings']) if c['key']=='division'), None)
+                assert division_idx!=None, "Auspice config did not have a division coloring!"
+                auspice_config['colorings'].insert(division_idx+1, {
+                          "key": "division_metadata",
+                          "title": auspice_config['colorings'][division_idx]["title"] + " (metadata)",
+                          "type": "categorical",
+                })
+                auspice_config['colorings'][division_idx]["title"] += " (inferred)"
            else:
-                config['display_defaults']['distance_measure'] = "div"
+                auspice_config['display_defaults']['distance_measure'] = "div"
        with open(output.auspice_config, 'w') as fh:
-            json.dump(config, fh, indent=2)
+            json.dump(auspice_config, fh, indent=2)


non-blocking

Just noting that this is starting to feel like it should be an external script that can be run/debugged outside of the workflow.

rules/cattle-flu.smk

trvrb · 2024-10-30T21:04:43Z

This looks good from final output perspective. It would be a nice thing to have consistent colors. We also have (had?) this with https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=data_source where as data sizes would fluctuate the blue vs yellow coloring would sometimes flip. With consistent colors we could also have a more sensible geographic coloring where nearby states get similar colors.

The limited metadata available for this outbreak means we infer division for many tips, so being able to show known vs inferred metadata can be clarifying for many users. As we add more customisations like this the snakemake pipeline becomes more and more complex. Allowing auspice-config overlays / merging would solve half of the complexity introduced here. If we find ourselves commonly duplicating metadata columns we can add a config-parameterised rule for this. Note that the colours are generated within Auspice and so differ between the two representations of division (inferred vs metadata). The also differ between genome & segment builds. A nice improvement would be making these consistent over all h5n1-cattle-outbreak datasets. Context: <https://bedfordlab.slack.com/archives/CD84ELG0N/p1730227555767939>

jameshadfield · 2024-10-31T00:20:00Z

Wrote up the consistent-colours as a separate issue

Make variables and functions defined in the snakefile available to custom rule files such as cattle-flu.smk. See <#100 (comment)> for more context.

jameshadfield requested a review from joverlee521 October 30, 2024 00:47

genehack approved these changes Oct 30, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/division-metadata branch from 6c9efad to 84b71b7 Compare October 30, 2024 19:09

joverlee521 approved these changes Oct 30, 2024

View reviewed changes

jameshadfield force-pushed the james/division-metadata branch from 84b71b7 to 6583482 Compare October 30, 2024 22:02

jameshadfield merged commit 443e785 into master Oct 30, 2024
6 checks passed

jameshadfield deleted the james/division-metadata branch October 30, 2024 22:11

jameshadfield added a commit that referenced this pull request Oct 31, 2024

rearrange snakefiles

6675e56

Make variables and functions defined in the snakefile available to custom rule files such as cattle-flu.smk. See <#100 (comment)> for more context.

jameshadfield mentioned this pull request Oct 31, 2024

[cattle-flu] consistent colours #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[h5n1-cattle-outbreak] distinguish inferred vs known metadata for division #100

[h5n1-cattle-outbreak] distinguish inferred vs known metadata for division #100

jameshadfield commented Oct 30, 2024

joverlee521 left a comment

joverlee521 Oct 30, 2024

jameshadfield Oct 30, 2024

joverlee521 Oct 30, 2024

jameshadfield Oct 30, 2024

jameshadfield Oct 30, 2024

joverlee521 Oct 30, 2024

trvrb commented Oct 30, 2024

jameshadfield commented Oct 31, 2024

		NOTE: long-term we should be consulting `traits_params()` to work out the columns to duplicate, but
		that function's not visible to this .smk file so would require deeper refactoring.

[h5n1-cattle-outbreak] distinguish inferred vs known metadata for division #100

[h5n1-cattle-outbreak] distinguish inferred vs known metadata for division #100

Conversation

jameshadfield commented Oct 30, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

joverlee521 Oct 30, 2024

Choose a reason for hiding this comment

jameshadfield Oct 30, 2024

Choose a reason for hiding this comment

joverlee521 Oct 30, 2024

Choose a reason for hiding this comment

jameshadfield Oct 30, 2024

Choose a reason for hiding this comment

jameshadfield Oct 30, 2024

Choose a reason for hiding this comment

joverlee521 Oct 30, 2024

Choose a reason for hiding this comment

trvrb commented Oct 30, 2024

jameshadfield commented Oct 31, 2024