Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename valuestransform to combine in unstack #3185

Merged
merged 6 commits into from
Oct 3, 2022
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 26, 2022

Fixes #3184

We can still discuss the name if needed. Current proposal is valuesfunction.

@bkamins bkamins added this to the 1.4 milestone Sep 26, 2022
@bkamins
Copy link
Member Author

bkamins commented Sep 27, 2022

I will wait a bit for comments here, but in general I am pretty convinced to keep valuesfunction as it is consistent with values_fn in R.

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2022

@nalimilan + @jariji + @jkrumbiegel + @ararslan + @wolthom

Let us try to reach some conclusion and move forward with this decision. Otherwise there is a risk that we will be waiting idly.

Clearly valuesfunction seems a name that people do not find intuitive and easy to remember. Also I think we should not use anything that has transform or combine in the name to avoid confusion with existing methods (some people feel that what we do is similar to combine, but other do not).

What I have checked:

  • pandas requires aggfunc to return a scalar
  • in databases also the name aggregate is used but it is expected that scalar is returned

We are more flexible, but maybe it is ok to e.g. accept e.g. aggregate_with. The point is mostly that it will be easy for users to remember.

A more non standard name could be fuse_with. The reason is that "fuse" is defined to be (I am not a native speaker so I am not 100% sure here):

join or blend to form a single entity.

Which is what we essentially do - i.e. take potentially several values and put them in a single cell of output table.

Thinking of pros and cons I would lean to aggregate_with as it will be easier to learn and remember (even if we accept non aggregating functions like copy I think that this will be a super rare use case - mostly users will do aggregation).

@jariji
Copy link
Contributor

jariji commented Sep 28, 2022

A name based on "aggregate" is fine with me: aggregate, aggregator, aggregatewith or aggregate_with.

@ararslan
Copy link
Member

At least in my mind, "fuse" doesn't really seem like the right verb for this as it kind of implies something more like a union rather than aggregation.

I also don't think the "with" suffix provides any meaningful clarity, be it appended to "fuse" or "aggregate" or whatever.

in databases also the name aggregate is used but it is expected that scalar is returned

Actually, aggregate seems like a reasonable candidate to me. Even when you aren't returning a scalar, you're aggregating in the sense of putting some set of values into a single cell; the contents of the cell just happens to be all of the original values. 😄

@bkamins bkamins changed the title rename valuestransform to valuesfunction in unstack rename valuestransform to combine in unstack Oct 1, 2022
@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2022

After discussions with @nalimilan we concluded that the intuition users have is best and recommend switching the kwarg name to combine.

Additionally allowduplicates kwarg is deprecated. The reason is that allowduplicates=false is the same as combine=only (which is now the default) and allowduplicates=true is the same as combine=last. Therefore the allowduplicates kwarg is not needed any more.

In this PR both these changes are made.

Please comment if you have any additional thoughts on the design of these unstack keyword arguments.

@wolthom
Copy link
Contributor

wolthom commented Oct 1, 2022

@bkamins Just to add a bit of feedback:
I think combine is a reasonable choice as well. In this case, I personally don't see a risk of confusion / ambiguity with the combine method.

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Oct 1, 2022

I'm not deep into this issue, so sorry if this suggestion was covered before. But I understood that combine might be confusing because if a Vector of values is returned from the function, this Vector will not be expanded like in combine but will occupy a single cell in the resulting DataFrame. Would it maybe make sense to just completely assimilate the behavior to combine and just expand Vector results in the same way? This means that by default you can have duplicated rows if you don't combine with a function that returns a single result. But that might even be useful in some circumstances? If you do want a Vector in a cell, you'd just do the same thing as with combine and use combine = Ref.

@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2022

Would it maybe make sense to just completely assimilate the behavior to combine and just expand Vector results in the same way?

The problem is that it would only make sense if in all cells of a given row you would have the same number of duplicates, which is unlikely. I think we should not add this option.

Later, if user wants, the behavior you describe can be achieved easily with flatten.

(I understand that you agree with the combine keyword argument name still?)

@jkrumbiegel
Copy link
Contributor

Ah yes I guess that does not make sense, ok then the approach with flatten makes more sense in that case.

@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2022

If there will be no additional comments I will merge this PR on Monday, Oct 3, 2022.

NEWS.md Outdated Show resolved Hide resolved
src/abstractdataframe/reshape.jl Show resolved Hide resolved
bkamins and others added 2 commits October 2, 2022 11:36
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 57b560b into main Oct 3, 2022
@bkamins bkamins deleted the bk/unstack_valuesfunction branch October 3, 2022 06:25
@bkamins
Copy link
Member Author

bkamins commented Oct 3, 2022

Merged. Thank you for the discussion.

The conclusion is:

  • we use combine keyword argument to pass function that should be applied to values in combination of rows/columns
  • allowduplicates keyword argument is deprecated as it is not needed now (combine should be used instead)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change valuestransform in unstack
6 participants