-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge: Support changing the names of, or omitting entirely, the generated source columns #1625
Conversation
The default behaviour here is still unchanged: source columns are generated as |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1625 +/- ##
==========================================
+ Coverage 71.03% 71.06% +0.03%
==========================================
Files 79 79
Lines 8258 8268 +10
Branches 2005 2010 +5
==========================================
+ Hits 5866 5876 +10
Misses 2101 2101
Partials 291 291 ☔ View full report in Codecov by Sentry. |
…ated source columns This lets us more easily use `augur merge` in places where it makes no sense to include the generated source columns (e.g. in the Nextclade metadata merge step of our workflows) and in places where we have existing source column names we want to match (e.g. in ncov, replacing the bespoke combine_metadata.py).
2eee345
to
3c32b99
Compare
I agree that defaulting to not adding source columns feels like the best fit with Principle of Least Surprise — to me, they feel more like a debugging/troubleshooting tool than something you'd want to see all the time. |
In past usage, source columns like these are commonly used for subsampling where you want a little of this dataset and a lot of that dataset. Sometimes you can subsample before the merge, sometimes it happens after, and in that case, you condition on the source of the row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Docs preview looks good
fsvo "good", IMO. :-) I want to improve the command usage doc rendering a lot—there's a bunch of nits that bug me that I addressed for Nextstrain CLI's docs already—but trying to leave that for another time as it's quite a bit involved. |
My vote is |
I vote default to off and require opt-in. I think most of our uses would opt-out of source columns |
I'd be almost always running That said, I agree that the least surprising behavior is to default to off and adding source columns with |
+1 for opt-in. This would encourage users to customize the template, which I think would be good for overall understanding within a workflow. Example: augur merge
--metadata a=a.tsv b=b.tsv
--source-columns '__source_metadata_{NAME}'
--output-metadata merged.tsv
augur filter
--metadata merged.tsv
--query '__source_metadata_a'
--subsample-max-sequences 20
augur filter
--metadata merged.tsv
--query '__source_metadata_b'
--subsample-max-sequences 10 (in contrast to |
Seems there's a nice consensus for defaulting off. I agree it's more self-documenting in workflows and examples to see the explicit template. I'll change the default behaviour in a new PR.
Hmm. As currently discussed, designed, and implemented, we're using multiple boolean columns, one per input source (e.g. |
Removes default naming template and requires users to explicitly provide their own template to include source columns. This makes the output from an `augur merge` invocation more self-documenting without columns "magically" appearing. In the expected context of usage within a workflow, the burden of the extra option is negligible. See also discussion on a prior PR.¹ ¹ <#1625 (comment)>
Removes default naming template and requires users to explicitly provide their own template to include source columns. This makes the output from an `augur merge` invocation more self-documenting without columns "magically" appearing. In the expected context of usage within a workflow, the burden of the extra option is negligible. See also discussion on a prior PR.¹ ¹ <#1625 (comment)>
|
This lets us more easily use
augur merge
in places where it makes no sense to include the generated source columns (e.g. in the Nextclade metadata merge step of our workflows) and in places where we have existing source column names we want to match (e.g. in ncov, replacing the bespoke combine_metadata.py).Checklist