Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-do the @orderby backend #191

Merged
merged 20 commits into from
Oct 18, 2020
Merged

Conversation

pdeffebach
Copy link
Collaborator

@pdeffebach pdeffebach commented Oct 10, 2020

Before, there was some complicated with_anonymous machinery.

Now the backends are select and combine.

For an abstract data frame this is intuitive, just call select on all the arguments with a nolhs = true option and return the sortperm from that.

For a grouped data frame, unfortunately @orderby currently re-orders the groups and returns a grouped dataframe.

This annoying because if you do, say,


julia> df = DataFrame(a = [1, 1, 2, 2], b = [6, 6, 7, 7]);

julia> gd = groupby(df, :a);

julia> @orderby(df, :b)

then the result of the anonymous function in @orderby is a vector for each group and since Julia has lexicographic ordering of vectors, you can get some surprising results.

This was never tested and I'm not sure who uses it. I am making a breaking change here by only allowing things inside @orderby to return scalars, i.e. @orderby(df, :b) will error, and @orderby(df, mean(:b)) will work.

Update: I have found a solution that is somewhat hacky but doable. Current behavior matches 100% old behavior.

@pdeffebach
Copy link
Collaborator Author

Additionally, it looks like @orderby wasn't really tested! So I have to add lots of tests.

@bkamins
Copy link
Member

bkamins commented Oct 10, 2020

I would prefer if you propose the solution that "makes sense" for the future and if this needs to break in the case of GroupedDataFrame just break it.

@pdeffebach
Copy link
Collaborator Author

The solution I would like is

julia> df = DataFrame(a = [1, 1, 1, 2, 2, 2], b = [1, 0, -1, 3, 0, -3]);

julia> gd = groupby(df, :a);

julia> @orderby(gd, :b .- mean(:b))

which would end up performing the operation :b .- mean(:b) the way it would work in @select and then calling

out_df = @select(gd, :_x = :b .- mean(:b))
parent(gd)[sortperm(out_df))

@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

Yes - this would be super cool. And I a similar thing will be supported in where in DataFrames.jl once I get to implementing it.

@pdeffebach
Copy link
Collaborator Author

This is ready for a review.

For posterity, I want to emphasize that this change reduces performance. @bkamins I was very surprised that gd[[2, 1]] is doesn't do any copies!

But I think this is the better and more intuitive interface. I don't really understand the original authors' preoccupation with reordering and selecting different groups in a DataFrame.

@pdeffebach
Copy link
Collaborator Author

Tests are broken, will fix soon.

@bkamins
Copy link
Member

bkamins commented Oct 13, 2020

@bkamins I was very surprised that gd[[2, 1]] is doesn't do any copies!

I have not designed it (expect for cleaning bugs), but I guess this was intentional to ensure we have a good performance.

I don't really understand the original authors' preoccupation with reordering and selecting different groups in a DataFrame.

Again - I am not the author, but I guess after you order groups (which is fast), if you want to materialize a DataFrame you can always call DataFrame on the result (and if you do not need a materialized DataFrame you can leave things as they are - e.g. when you later want to run some aggregation function).

@pdeffebach
Copy link
Collaborator Author

Apologies for the delay on this. It is now ready for a review

@bkamins
Copy link
Member

bkamins commented Oct 14, 2020

Just to be clear - why do you prefer @orderby to return a DataFrame not GroupedDataFrame?

@pdeffebach
Copy link
Collaborator Author

  1. Returning grouped data frame is inconsistent with @select and @transform
  2. With the addition of @select and @transform, the order of groups doesn't matter as much, so i'm not sure the original use-case applies
  3. something like @orderby(groupby(df, :g), :x) is very confusing. Current behavior compares the arrays gd[1].x, gd[2].x etc. This is lexicographic ordering because that's how arrays are compared in Julia. I think this is confusing.
  4. Implementation seems hard when using the new backend.

@bkamins
Copy link
Member

bkamins commented Oct 14, 2020

OK

@@ -410,6 +410,10 @@ function orderby(x::AbstractDataFrame, @nospecialize(args...))
end

function orderby(x::GroupedDataFrame, @nospecialize(args...))

@warn "orderby behavior now returns a `DataFrame` instead of a `GroupedDataFrame`. " *
"Group the returned data frame to restore old behavior" maxlog = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably maxlog = 1 or 2 is enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@pdeffebach
Copy link
Collaborator Author

@nalimilan Are you okay with me breaking this?

@bkamins
Copy link
Member

bkamins commented Oct 14, 2020

I would recommend that you just push forward with what you think it a good design (of course @nalimilan and I will gladly consult you). You have this freedom as DataFramesMeta.jl is not that big/mature - so I would recommend to shape it aggressively (the reason is that once it matures/becomes widely used - you will hit what we have in DataFrames.jl - that every decision, even ones that seem simple, are taking a lot of time).

@pdeffebach
Copy link
Collaborator Author

That sounds good. I will break @where in a similar way, then.

I will merge tomorrow if there are no objections

@nalimilan
Copy link
Member

I'm fine with you breaking thinks, and I agree it makes sense to make @orderby closer to @select. Though maybe we should think carefully about the consistency of the API in general? In particular, I wonder what's the relation between @orderby and sort. For @select, transform and combine there's a one-to-one mapping with DataFrames functions. Should it be the case for other macros too?

For DataFrame, orderby does the same thing as sort AFAICT, it just allows computing a new column and sorting on it in a single operation. For GroupedDataFrame there's the issue that sort should probably sort groups rather than rows, similar to what we discussed for filter. So maybe actually the current behavior of @orderby is what a hypothetical @sort would do?

Also, is the goal that all macros return a DataFrame by default, even when passed a GroupedDataFrame?

FWIW, in dplyr arrange doesn't take into account grouping at all when sorting, but preserves grouping, and this is explicitly mentioned as an exception in the dplyr API. Not sure whether they are happy about that, but in issues mentioning that on GitHub they didn't mention any regret.

@bkamins
Copy link
Member

bkamins commented Oct 15, 2020

For GroupedDataFrame there's the issue that sort should probably sort groups rather than rows

I understand that this is what @orderby does now in this proposal. That is why I was OK with it. The only thing was that DataFrame is returned, but I understand @pdeffebach wants GroupedDataFrame not to be sticky in DataFramesMeta.jl, but dropped always. Actually this is the point of the design that would be good to understand the reasons behind better. @pdeffebach - can you comment on this more?

@nalimilan
Copy link
Member

Now let me play the devil's advocate. :-p

What are the use cases for having @orderby return a DataFrame rather than a GroupedDataFrame? That sounds useful only if sorting within groups was the goal of the operation and the last part of the piping sequence, right? Is the idea that people should either sort before grouping, or after ungrouping (using e.g. select)?

Also, functions like @where will keep returning a GroupedDataFrame, right?

README.md Outdated Show resolved Hide resolved
src/DataFramesMeta.jl Outdated Show resolved Hide resolved
Comment on lines 436 to 440
The second example below shows the logic of `@orderby` with a
`GroupedDataFrame`. Note that the column `:t` is arranged from
lowest to highest after the `@orderby` command. This shows that
`@orderby` is equivelent to a transformation by group followed
by ordering on the subsequent transformation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better put this with the corresponding example.

src/DataFramesMeta.jl Outdated Show resolved Hide resolved
src/DataFramesMeta.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated
Comment on lines 306 to 308
@test @orderby(gd, mean(:i)).i == [1, 2, 3, 4, 5]
@test @orderby(df, std(:i) .- :i).i == [5, 4, 3, 2, 1]
@test @orderby(gd, :g, -1 .* (:i .- mean(:i))).i == [3, 2, 1, 5, 4]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check against @orderby(@select(gd, ...), ...)? That will also allow checking other columns.

Also df should be gd on the second line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 69 to 72
DataFrames.groupby(b_str) |>
orderby(-mean(cols(x_sym))) |>
groupby(:b) |>
based_on(cols("meanX") = mean(:x), meanY = mean(:y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is really weird. Why isn't orderby called as the last step?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have changed this pipe to match the other ones (it should be the same, just adding cols everywhere).

README.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Oct 16, 2020

@pdeffebach - given the discussion we have in this PR a more general question occurred to me. Here we discuss about @orderby macro, but it would be natural to define @sort macro (as in other cases - macro name is the same as function name). So the questions are:

  • why at all we want @orderby macro (maybe we should remove it and just define @sort macro)
  • does this macro have to work on GroupedDataFrame at all (maybe it is enough to allow sorting data frames) and then compose this sorting with select and transform that work on GroupedDataFrame

What do you think?

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Oct 16, 2020

What are the use cases for having @orderby return a DataFrame rather than a GroupedDataFrame? That sounds useful only if sorting within groups was the goal of the operation and the last part of the piping sequence, right? Is the idea that people should either sort before grouping, or after ungrouping (using e.g. select)?

I don't have a strong preference about returning a grouped data frame or a data frame. What I do have a preference for is not re-ordering groups, but rather performing an observation by group and re-ordering the result by rows. This seems like a more common need than sorting groups. Returning a DataFrame after this seems more about consistency. It's not good to have some operations on GroupedDataFrames be sticky and others not.

  • why at all we want @orderby macro (maybe we should remove it and just define @sort macro)
  • does this macro have to work on GroupedDataFrame at all (maybe it is enough to allow sorting data frames) and then compose this sorting with select and transform that work on GroupedDataFrame

I'm not super opposed to requiring a @transform and then a @sort, but no one likes having to create temporary columns, especially when dealing with data frames with many columns. In general we should avoid the kind of "calculations as columns" that Stata users have to do.

In the interest of clarity, if I could re-name everything from scratch, here is what I would do

  • @gencols: @transform (both DataFrame and GroupedDataFrame)
  • @keepcols: @select (both DataFrame and GroupedDataFrame)
  • @keeprows: My proposed implementation of @where, which always returns a DataFrame and performs transformations by group
  • @keepgroups: Current implementation of @where with a GroupedDataFrame
  • @sortrows: My proposed implementation of @orderby in this PR, which always returns a DataFrame and performs transformations by group
  • @sortgroups: Current implementation of @orderby on a GroupedDataFrame
  • @with: Works only with a DataFrame (added to this list for completeness).

@nalimilan
Copy link
Member

I'm not super opposed to requiring a @transform and then a @sort, but no one likes having to create temporary columns, especially when dealing with data frames with many columns. In general we should avoid the kind of "calculations as columns" that Stata users have to do.

I think the idea was that @sort would also allow sorting on transformations, without creating columns manually before sorting. Would that address your concern?

Following your terminology, I wonder whether the @keepgroups and @sortgroups are really needed. AFAICT dplyr doesn't support that, right?

@pdeffebach
Copy link
Collaborator Author

Yes, I don't see a need for @keepgroups and @sortgroups. This is a main motivation for changing @orderby and @where, as this is their current functionality.

Given that GroupedDataFrames iterate through groups, I think there is value in the explicitness of rows vs groups.

@nalimilan
Copy link
Member

How about deprecating passing GroupedDataFrame to @orderby then and see how it goes?

@pdeffebach
Copy link
Collaborator Author

What if I want to

  1. Sort by groups
  2. Within a group, sort by a values deviation from the group mean

This is the kind of functionality this PR gives

@orderby(groupby(df, :g), :g, :v .- mean(:v))

@bkamins
Copy link
Member

bkamins commented Oct 16, 2020

sort by a values deviation from the group mean

@orderby(groupby(df, :g), :g, :v .- mean(:v))

is the same as:
sort(df, [:g, :v])

but I see the point. Still maybe (I will keep using DataFrames.jl syntax, not to fix ourselves on the solution for DataFramesMeta.jl):

combine(groupby(df, :g, sort=true), sdf -> sort(sdf, :v => v => v .- mean(v))

or

transform(groupby(sort(df, :g), :g), sdf -> sort(sdf, :v => v => v .- mean(v))

seems to be legible enough (and I have opened JuliaData/DataFrames.jl#2489 to allow transformations in DataFrames.jl in the future; of course in DataFramesMeta.jl we can allow transformations already now)

@pdeffebach
Copy link
Collaborator Author

After this discussion, in general, I am becoming more confident that this PR is the right move

  1. As Milan mentioned, there isn't a big need to order groups, the @sortgroups macro above
  2. The combine solution post by Bogumil is cumbersome and requires much more typing
  3. As a matter of conisistency, it's important that all macros return DataFrames.

Though I do wish I could go ahead and implement all the re-naming macros given above...

pdeffebach and others added 3 commits October 16, 2020 15:40
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@pdeffebach
Copy link
Collaborator Author

Thanks for the feedback.

I will think on this more for a few days, but I am still leaning towards keeping this. I like the consistency that every macro which wither implicitely or explicitely performs a transformation allows a grouped data frame and does the transformation by group.

For instance, I think we can all agree that @where should perform operations by group. For instance, taking the row with the highest value for each group is something people generally want to do.

This would leave @orderby as lonely, the only macro which does not allow for a grouped operation returning a DataFrame.

@bkamins
Copy link
Member

bkamins commented Oct 16, 2020

As I have commented in other PR - if @orderby for GroupedDataFrame is a functionality that is needed very rarely we can leave it out for now and decide how to handle it later when we have very clear use cases. Thank you for working on this.

@nalimilan
Copy link
Member

To me the main advantage of not supporting GroupedDataFrame inputs would be that we wouldn't need @orderby but we could call it @sort instead, which is a natural name and consistent with Base. That's a similar decision to adding where in DataFames corresponding to @where in DataFramesMeta, while we already have filter but with semantics that don't match our needs for GroupedDataFrame.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Oct 17, 2020

Okay, I will "reserve" @orderby with a grouped data frame.

However I still have a concern about @sort. I do think that the feature I propose for @orderby is useful overall. It's not a super high priority so I am fine not implementing it now. But once we use @sort, the feature I propose is off the table since it would break a contract with Base.

So maybe I'm not sure what the solution is in the long run. It would be off for DataFrames to define both sort and orderby. But then again we will have both filter and where.

Maybe sort's transformations should operate on row-wise! That would be most similar to the contract with Base and open up an path for an alternative orderby. Similar to filter and where.

@pdeffebach
Copy link
Collaborator Author

Okay ready for a review after disallowing a GroupedDataFrame

@bkamins
Copy link
Member

bkamins commented Oct 18, 2020

OK - let's get rolling with it. Nightly fails due to unrelated reasons - right?

@pdeffebach
Copy link
Collaborator Author

Yes, the errors are due to MacroTools and Lazy which are only imported during the chaining test set. This is ready to merge.

@bkamins
Copy link
Member

bkamins commented Oct 18, 2020

OK - go ahead and merge (if you do not have rights please let me know).

@pdeffebach pdeffebach merged commit 1e1577b into JuliaData:master Oct 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants