Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sorting options to groupby #3253

Merged
merged 16 commits into from
Dec 27, 2022
Merged

Add sorting options to groupby #3253

merged 16 commits into from
Dec 27, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 19, 2022

Fixes #3251

@bkamins bkamins added this to the 1.5 milestone Dec 19, 2022
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits December 27, 2022 12:32
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit f7b4769 into main Dec 27, 2022
@bkamins bkamins deleted the bk/groupby branch December 27, 2022 13:20
@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2022

Thank you!

@testset "sorting API" begin
# simple tests
df = DataFrame(x=["b", "c", "b", "a", "c"])
@test getindex.(keys(groupby(df, :x)), 1) == ["b", "c", "a"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can use only instead of getindex(_, 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed it could. I initially had this test implemented on groups not on keys and using getindex works on both.

@test getindex.(keys(groupby(df, :x, sort=true)), 1) == [1, 2, 100]
@test getindex.(keys(groupby(df, :x, sort=NamedTuple())), 1) == [1, 2, 100]
@test getindex.(keys(groupby(df, :x, sort=false)), 1) == [2, 100, 1]
@test getindex.(keys(groupby(df, order(:x))), 1) == [1, 2, 100]
Copy link
Contributor

@jariji jariji Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having so many equivalent ways to specify sorting does seem a bit much? Not sure if it's worth doing anything about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that sort etc. provide that many ways to specify sort order, so we cannot do anything about it.
What is the rationale behind it:

  • normally people will use the "global" settings like (rev=true,), which applies to all columns
  • however there are cases when you want to specify sorting order per column, e.g. [order(:x, rev=true), :y], where you reverse :x but sort on :y in ascending order. Therefore order "per column" is needed.

In general - this complexity is needed when one has several columns.

@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2022

@jariji - if you would be willing actually improving https://dataframes.juliadata.org/stable/man/sorting/ section of the manual would be welcome. I planned to do it at some point, but maybe you would be willing to give it a shot and give a more in-depth coverage of all sorting options (this PR just inherits the complexity we allow for there).

@jariji
Copy link
Contributor

jariji commented Feb 5, 2023

Is there an advantage to sorting during the groupby versus sorting the groups afterwards?

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2023

There is no convenient way to sort the groups afterwards AFAICT. To get a desired order you would need to sort the data frame you groupby before grouping, which is not always desired.

@jariji
Copy link
Contributor

jariji commented Feb 5, 2023

  • Sorting before. I'm not sure what you mean it's not desired. Simply sort by the group columns before grouping.
  • Sorting after. Introduce sortgroups(df). This is more flexible since it doesn't require regrouping in order to resort.

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2023

I'm not sure what you mean it's not desired. Simply sort by the group columns before grouping.

It is expensive (in terms of time and memory)

Introduce sortgroups(df). This is more flexible since it doesn't require regrouping in order to resort.

Sorting while grouping will be faster and more convenient if user wants groups sorted (and most likely this is a most typical use case where user knows upfront how one wants groups to be sorted).

sortgroups(df) can be added as a separate function indeed. The question is how often re-sorting of groups is indeed needed (i.e. I would avoid adding a function that would be almost not used; the option to specify sorting order in groupby was added because users needed it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve groupby sort
3 participants