merge: Support changing the names of, or omitting entirely, the generated source columns #1625

tsibley · 2024-09-06T17:17:46Z

This lets us more easily use augur merge in places where it makes no sense to include the generated source columns (e.g. in the Nextclade metadata merge step of our workflows) and in places where we have existing source column names we want to match (e.g. in ncov, replacing the bespoke combine_metadata.py).

Checklist

Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests
Check if you need to update docs

tsibley · 2024-09-06T17:19:33Z

The default behaviour here is still unchanged: source columns are generated as __source_metadata_{NAME}. With the new --source-columns option now, should we default them off instead and require opt-in? We might find ourselves opting in most of the time, but this would maybe be the least surprising behaviour.

codecov · 2024-09-06T17:32:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.06%. Comparing base (81db604) to head (3c32b99).
Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1625      +/-   ##
==========================================
+ Coverage   71.03%   71.06%   +0.03%     
==========================================
  Files          79       79              
  Lines        8258     8268      +10     
  Branches     2005     2010       +5     
==========================================
+ Hits         5866     5876      +10     
  Misses       2101     2101              
  Partials      291      291

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ated source columns This lets us more easily use `augur merge` in places where it makes no sense to include the generated source columns (e.g. in the Nextclade metadata merge step of our workflows) and in places where we have existing source column names we want to match (e.g. in ncov, replacing the bespoke combine_metadata.py).

genehack · 2024-09-09T17:24:15Z

With the new --source-columns option now, should we default them off instead and require opt-in? We might find ourselves opting in most of the time, but this would maybe be the least surprising behaviour.

I agree that defaulting to not adding source columns feels like the best fit with Principle of Least Surprise — to me, they feel more like a debugging/troubleshooting tool than something you'd want to see all the time.

tsibley · 2024-09-10T17:22:24Z

to me, they feel more like a debugging/troubleshooting tool than something you'd want to see all the time.

In past usage, source columns like these are commonly used for subsampling where you want a little of this dataset and a lot of that dataset. Sometimes you can subsample before the merge, sometimes it happens after, and in that case, you condition on the source of the row.

victorlin

Docs preview looks good

tsibley · 2024-09-10T17:41:39Z

Docs preview looks good

fsvo "good", IMO. :-) I want to improve the command usage doc rendering a lot—there's a bunch of nits that bug me that I addressed for Nextstrain CLI's docs already—but trying to leave that for another time as it's quite a bit involved.

j23414 · 2024-09-10T17:59:46Z

My vote is defaulting to not adding source columns. I'd only need it when debugging and I'm imagining using "augur merge" every-time there's new data to merge in (chained across various pull dates).

joverlee521 · 2024-09-10T18:03:04Z

With the new --source-columns option now, should we default them off instead and require opt-in? We might find ourselves opting in most of the time, but this would maybe be the least surprising behaviour.

I vote default to off and require opt-in. I think most of our uses would opt-out of source columns

trvrb · 2024-09-10T18:07:57Z

I'd be almost always running --source-columns for the core use case of combining data sources as the first step in preparing sequence data. This is useful for subsampling, but I'd also want this as a coloring, eg https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=data_source.

That said, I agree that the least surprising behavior is to default to off and adding source columns with --source-columns command line option. Also, seeing --source-columns when running augur merge --help seems like a nice bit of discoverability for a feature whose existence is non-obvious.

victorlin · 2024-09-10T18:09:56Z

+1 for opt-in. This would encourage users to customize the template, which I think would be good for overall understanding within a workflow. Example:

augur merge
  --metadata a=a.tsv b=b.tsv
  --source-columns '__source_metadata_{NAME}'
  --output-metadata merged.tsv

augur filter
  --metadata merged.tsv
  --query '__source_metadata_a'
  --subsample-max-sequences 20

augur filter
  --metadata merged.tsv
  --query '__source_metadata_b'
  --subsample-max-sequences 10

(in contrast to __source_metadata_a "magically" appearing)

tsibley · 2024-09-10T18:18:41Z

Seems there's a nice consensus for defaulting off. I agree it's more self-documenting in workflows and examples to see the explicit template. I'll change the default behaviour in a new PR.

@trvrb

This is useful for subsampling, but I'd also want this as a coloring, eg https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=data_source.

Hmm. As currently discussed, designed, and implemented, we're using multiple boolean columns, one per input source (e.g. source_a = 0, source_b = 1), instead of a single column that identifies/names the input source (e.g. source = b). So to get to c=data_source in Auspice still requires a little bit of data wrangling. Should we make that easier?

Removes default naming template and requires users to explicitly provide their own template to include source columns. This makes the output from an `augur merge` invocation more self-documenting without columns "magically" appearing. In the expected context of usage within a workflow, the burden of the extra option is negligible. See also discussion on a prior PR.¹ ¹ <#1625 (comment)>

tsibley · 2024-09-16T18:56:27Z

I'll change the default behaviour in a new PR.

#1632

Base automatically changed from trs/merge/tests to master September 6, 2024 18:08

tsibley force-pushed the trs/merge/source-columns branch from 2eee345 to 3c32b99 Compare September 6, 2024 18:41

genehack approved these changes Sep 9, 2024

View reviewed changes

victorlin approved these changes Sep 10, 2024

View reviewed changes

tsibley merged commit db54927 into master Sep 10, 2024
28 checks passed

tsibley deleted the trs/merge/source-columns branch September 10, 2024 17:42

This was referenced Sep 10, 2024

augur merge is slow to read in metadata #1628

Open

Update Nextclade metadata merge to use augur curate rename and augur merge nextstrain/measles#52

Merged

tsibley mentioned this pull request Sep 16, 2024

merge: Omit generated source columns by default #1632

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: Support changing the names of, or omitting entirely, the generated source columns #1625

merge: Support changing the names of, or omitting entirely, the generated source columns #1625

tsibley commented Sep 6, 2024 •

edited

Loading

tsibley commented Sep 6, 2024

codecov bot commented Sep 6, 2024 •

edited

Loading

genehack commented Sep 9, 2024

tsibley commented Sep 10, 2024

victorlin left a comment

tsibley commented Sep 10, 2024

j23414 commented Sep 10, 2024

joverlee521 commented Sep 10, 2024

trvrb commented Sep 10, 2024 •

edited

Loading

victorlin commented Sep 10, 2024 •

edited

Loading

tsibley commented Sep 10, 2024

tsibley commented Sep 16, 2024

merge: Support changing the names of, or omitting entirely, the generated source columns #1625

merge: Support changing the names of, or omitting entirely, the generated source columns #1625

Conversation

tsibley commented Sep 6, 2024 • edited Loading

Checklist

tsibley commented Sep 6, 2024

codecov bot commented Sep 6, 2024 • edited Loading

Codecov Report

genehack commented Sep 9, 2024

tsibley commented Sep 10, 2024

victorlin left a comment

Choose a reason for hiding this comment

tsibley commented Sep 10, 2024

j23414 commented Sep 10, 2024

joverlee521 commented Sep 10, 2024

trvrb commented Sep 10, 2024 • edited Loading

victorlin commented Sep 10, 2024 • edited Loading

tsibley commented Sep 10, 2024

tsibley commented Sep 16, 2024

tsibley commented Sep 6, 2024 •

edited

Loading

codecov bot commented Sep 6, 2024 •

edited

Loading

trvrb commented Sep 10, 2024 •

edited

Loading

victorlin commented Sep 10, 2024 •

edited

Loading