-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[h5n1-cattle-outbreak] distinguish inferred vs known metadata for division #100
Conversation
6c9efad
to
84b71b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only left non-blocking comments. Totally get what you mean by the snakemake pipeline getting more complex. I would move all of the h5n1-cattle-outbreak specific complexity into the cattle-flu.smk file, but will have to think this through...
NOTE: long-term we should be consulting `traits_params()` to work out the columns to duplicate, but | ||
that function's not visible to this .smk file so would require deeper refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
This .smk file would have access to traits_params()
if we move the include block to the bottom of the main Snakefile.
diff --git a/Snakefile b/Snakefile
index e959762..e822899 100755
--- a/Snakefile
+++ b/Snakefile
@@ -8,9 +8,6 @@ wildcard_constraints:
SEGMENTS = ["pb2", "pb1", "pa", "ha","np", "na", "mp", "ns"]
-for rule_file in config.get('custom_rules', []):
- include: rule_file
-
# The config option `same_strains_per_segment=True'` (e.g. supplied to snakemake via --config command line argument)
# will change the behaviour of the workflow to use the same strains for each segment. This is achieved via these steps:
# (1) Filter the HA segment as normal plus filter to those strains with 8 segments
@@ -633,3 +630,6 @@ rule clean:
"auspice"
shell:
"rm -rfv {params}"
+
+for rule_file in config.get('custom_rules', []):
+ include: rule_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I was thinking we'd have to pull them out into a common file. Would this have any side-effects? E.g. if a main-snakefile rule targeted a filename from the custom rules file is this ok? What about if it referenced rule.x.output.y
instead of the filename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the include
directive at the end would be equivalent to just including all of contents of cattle-flu.smk
at the end of the file. Access to target filenames and rules.x.output.y
should still work. Just any variables and/or functions defined in cattle-flu.smk
would not be accessible in the main Snakefile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(P.S. once we do this, I have some WIP commits that shift more of the cattle-flu stuff into cattle-flu.smk
, but which can't currently do so because they need access to functions defined in the main snakefile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Leaving this for future work)
Snakefile
Outdated
import json | ||
with open(input.auspice_config) as fh: | ||
config = json.load(fh) | ||
auspice_config = json.load(fh) | ||
if wildcards.subtype == "h5n1-cattle-outbreak": | ||
if wildcards.segment == "genome": | ||
config['display_defaults']['distance_measure'] = "num_date" | ||
auspice_config['display_defaults']['distance_measure'] = "num_date" | ||
division_idx = next((i for i,c in enumerate(auspice_config['colorings']) if c['key']=='division'), None) | ||
assert division_idx!=None, "Auspice config did not have a division coloring!" | ||
auspice_config['colorings'].insert(division_idx+1, { | ||
"key": "division_metadata", | ||
"title": auspice_config['colorings'][division_idx]["title"] + " (metadata)", | ||
"type": "categorical", | ||
}) | ||
auspice_config['colorings'][division_idx]["title"] += " (inferred)" | ||
else: | ||
config['display_defaults']['distance_measure'] = "div" | ||
auspice_config['display_defaults']['distance_measure'] = "div" | ||
with open(output.auspice_config, 'w') as fh: | ||
json.dump(config, fh, indent=2) | ||
json.dump(auspice_config, fh, indent=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Just noting that this is starting to feel like it should be an external script that can be run/debugged outside of the workflow.
This looks good from final output perspective. It would be a nice thing to have consistent colors. We also have (had?) this with https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=data_source where as data sizes would fluctuate the blue vs yellow coloring would sometimes flip. With consistent colors we could also have a more sensible geographic coloring where nearby states get similar colors. |
The limited metadata available for this outbreak means we infer division for many tips, so being able to show known vs inferred metadata can be clarifying for many users. As we add more customisations like this the snakemake pipeline becomes more and more complex. Allowing auspice-config overlays / merging would solve half of the complexity introduced here. If we find ourselves commonly duplicating metadata columns we can add a config-parameterised rule for this. Note that the colours are generated within Auspice and so differ between the two representations of division (inferred vs metadata). The also differ between genome & segment builds. A nice improvement would be making these consistent over all h5n1-cattle-outbreak datasets. Context: <https://bedfordlab.slack.com/archives/CD84ELG0N/p1730227555767939>
84b71b7
to
6583482
Compare
Wrote up the consistent-colours as a separate issue |
Make variables and functions defined in the snakefile available to custom rule files such as cattle-flu.smk. See <#100 (comment)> for more context.
The limited metadata available for this outbreak means we infer division for many tips, so being able to show known vs inferred metadata can be clarifying for many users.
As we add more customisations like this the snakemake pipeline becomes more and more complex. Allowing auspice-config overlays / merging would solve half of the complexity introduced here. If we find ourselves commonly duplicating metadata columns we can add a config-parameterised rule for this.
Note that the colours are generated within Auspice and so differ between the two representations of division (inferred vs metadata). The also differ between genome & segment builds. A nice improvement would be making these consistent over all h5n1-cattle-outbreak datasets.
Context: https://bedfordlab.slack.com/archives/CD84ELG0N/p1730227555767939