Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge: Support changing the names of, or omitting entirely, the generated source columns #1625

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Sep 6, 2024

This lets us more easily use augur merge in places where it makes no sense to include the generated source columns (e.g. in the Nextclade metadata merge step of our workflows) and in places where we have existing source column names we want to match (e.g. in ncov, replacing the bespoke combine_metadata.py).

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

@tsibley
Copy link
Member Author

tsibley commented Sep 6, 2024

The default behaviour here is still unchanged: source columns are generated as __source_metadata_{NAME}. With the new --source-columns option now, should we default them off instead and require opt-in? We might find ourselves opting in most of the time, but this would maybe be the least surprising behaviour.

Copy link

codecov bot commented Sep 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.06%. Comparing base (81db604) to head (3c32b99).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1625      +/-   ##
==========================================
+ Coverage   71.03%   71.06%   +0.03%     
==========================================
  Files          79       79              
  Lines        8258     8268      +10     
  Branches     2005     2010       +5     
==========================================
+ Hits         5866     5876      +10     
  Misses       2101     2101              
  Partials      291      291              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from trs/merge/tests to master September 6, 2024 18:08
…ated source columns

This lets us more easily use `augur merge` in places where it makes no
sense to include the generated source columns (e.g. in the Nextclade
metadata merge step of our workflows) and in places where we have
existing source column names we want to match (e.g. in ncov, replacing
the bespoke combine_metadata.py).
@genehack
Copy link
Contributor

genehack commented Sep 9, 2024

With the new --source-columns option now, should we default them off instead and require opt-in? We might find ourselves opting in most of the time, but this would maybe be the least surprising behaviour.

I agree that defaulting to not adding source columns feels like the best fit with Principle of Least Surprise — to me, they feel more like a debugging/troubleshooting tool than something you'd want to see all the time.

@tsibley
Copy link
Member Author

tsibley commented Sep 10, 2024

to me, they feel more like a debugging/troubleshooting tool than something you'd want to see all the time.

In past usage, source columns like these are commonly used for subsampling where you want a little of this dataset and a lot of that dataset. Sometimes you can subsample before the merge, sometimes it happens after, and in that case, you condition on the source of the row.

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsibley
Copy link
Member Author

tsibley commented Sep 10, 2024

Docs preview looks good

fsvo "good", IMO. :-) I want to improve the command usage doc rendering a lot—there's a bunch of nits that bug me that I addressed for Nextstrain CLI's docs already—but trying to leave that for another time as it's quite a bit involved.

@tsibley tsibley merged commit db54927 into master Sep 10, 2024
28 checks passed
@tsibley tsibley deleted the trs/merge/source-columns branch September 10, 2024 17:42
@j23414
Copy link
Contributor

j23414 commented Sep 10, 2024

My vote is defaulting to not adding source columns. I'd only need it when debugging and I'm imagining using "augur merge" every-time there's new data to merge in (chained across various pull dates).

@joverlee521
Copy link
Contributor

With the new --source-columns option now, should we default them off instead and require opt-in? We might find ourselves opting in most of the time, but this would maybe be the least surprising behaviour.

I vote default to off and require opt-in. I think most of our uses would opt-out of source columns

@trvrb
Copy link
Member

trvrb commented Sep 10, 2024

I'd be almost always running --source-columns for the core use case of combining data sources as the first step in preparing sequence data. This is useful for subsampling, but I'd also want this as a coloring, eg https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=data_source.

That said, I agree that the least surprising behavior is to default to off and adding source columns with --source-columns command line option. Also, seeing --source-columns when running augur merge --help seems like a nice bit of discoverability for a feature whose existence is non-obvious.

@victorlin
Copy link
Member

victorlin commented Sep 10, 2024

+1 for opt-in. This would encourage users to customize the template, which I think would be good for overall understanding within a workflow. Example:

augur merge
  --metadata a=a.tsv b=b.tsv
  --source-columns '__source_metadata_{NAME}'
  --output-metadata merged.tsv

augur filter
  --metadata merged.tsv
  --query '__source_metadata_a'
  --subsample-max-sequences 20

augur filter
  --metadata merged.tsv
  --query '__source_metadata_b'
  --subsample-max-sequences 10

(in contrast to __source_metadata_a "magically" appearing)

@tsibley
Copy link
Member Author

tsibley commented Sep 10, 2024

Seems there's a nice consensus for defaulting off. I agree it's more self-documenting in workflows and examples to see the explicit template. I'll change the default behaviour in a new PR.

@trvrb

This is useful for subsampling, but I'd also want this as a coloring, eg https://nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome?c=data_source.

Hmm. As currently discussed, designed, and implemented, we're using multiple boolean columns, one per input source (e.g. source_a = 0, source_b = 1), instead of a single column that identifies/names the input source (e.g. source = b). So to get to c=data_source in Auspice still requires a little bit of data wrangling. Should we make that easier?

tsibley added a commit that referenced this pull request Sep 16, 2024
Removes default naming template and requires users to explicitly provide
their own template to include source columns.  This makes the output
from an `augur merge` invocation more self-documenting without columns
"magically" appearing.  In the expected context of usage within a
workflow, the burden of the extra option is negligible.  See also
discussion on a prior PR.¹

¹ <#1625 (comment)>
tsibley added a commit that referenced this pull request Sep 16, 2024
Removes default naming template and requires users to explicitly provide
their own template to include source columns.  This makes the output
from an `augur merge` invocation more self-documenting without columns
"magically" appearing.  In the expected context of usage within a
workflow, the burden of the extra option is negligible.  See also
discussion on a prior PR.¹

¹ <#1625 (comment)>
@tsibley
Copy link
Member Author

tsibley commented Sep 16, 2024

I'll change the default behaviour in a new PR.

#1632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants