Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve groupby sort #3251

Closed
bkamins opened this issue Dec 16, 2022 · 7 comments · Fixed by #3253
Closed

Improve groupby sort #3251

bkamins opened this issue Dec 16, 2022 · 7 comments · Fixed by #3253
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Dec 16, 2022

Add an option to specify sort order in groupby.

@bkamins bkamins added this to the 1.5 milestone Dec 16, 2022
@jkrumbiegel
Copy link
Contributor

In sort this is done by passing lt = some_function which would be < by default I think. It's not very descriptive to match this, but could have the benefit of being the same as Base.

@jkrumbiegel
Copy link
Contributor

There's also the by keyword which has the separate job of transforming the sorted values before sorting them with the function passed in lt. For natural sort order, as an example, you'd only need lt and not by, but in other circumstances by might be useful as well. It's just not good to have by as a keyword in groupby when it doesn't relate to the grouping.

My suggestions would be either, add keywords sort_lt and sort_by and mirror the sort keywords there. Or optionally pass a NamedTuple to sort keyword like sort = (; lt = func, by = ...).

@bkamins
Copy link
Member Author

bkamins commented Dec 16, 2022

The challenge is the details. If lt and by should be "per column" or "per all grouping columns". Probably we should match what sort for AbstractDataFrame does.

@bkamins
Copy link
Member Author

bkamins commented Dec 19, 2022

So, we would need to mimick the sortperm API. The changes would be:

  • the sort kwarg apart from nothing, false, and true would accept a named tuple taking some or all of alg, lt, by, rev, order, with the same defaults as in sortperm; if such a named tuple were passed sorting would be performed.
  • when passing grouping columns to groupby one could use order wrapper, e.g. groupby(df, [:x, order(:y, rev=true)]); if such order is passed then sort=true by default (passing nothing or false would error) (this is needed to allow for specifying sorting orders for grouping columns individually)

CC @nalimilan

@nalimilan
Copy link
Member

Makes sense. That would just be a shorthand for groupby(sort(df, ...), cols) that avoids making a copy, right?

@bkamins
Copy link
Member Author

bkamins commented Dec 19, 2022

This is what @jkrumbiegel wanted. Right?

@jkrumbiegel
Copy link
Contributor

I think so, yes. It just seems natural to influence the sorting directly instead of splitting it out and doing sort = false on the groupby call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants