Re-do the `@orderby` backend #191

pdeffebach · 2020-10-10T16:37:28Z

Before, there was some complicated with_anonymous machinery.

Now the backends are select and combine.

For an abstract data frame this is intuitive, just call select on all the arguments with a nolhs = true option and return the sortperm from that.

For a grouped data frame, unfortunately @orderby currently re-orders the groups and returns a grouped dataframe.

This annoying because if you do, say,


julia> df = DataFrame(a = [1, 1, 2, 2], b = [6, 6, 7, 7]);

julia> gd = groupby(df, :a);

julia> @orderby(df, :b)

then the result of the anonymous function in @orderby is a vector for each group and since Julia has lexicographic ordering of vectors, you can get some surprising results.

This was never tested and I'm not sure who uses it. I am making a breaking change here by only allowing things inside @orderby to return scalars, i.e. @orderby(df, :b) will error, and @orderby(df, mean(:b)) will work.

Update: I have found a solution that is somewhat hacky but doable. Current behavior matches 100% old behavior.

pdeffebach · 2020-10-10T16:38:40Z

Additionally, it looks like @orderby wasn't really tested! So I have to add lots of tests.

bkamins · 2020-10-10T17:10:24Z

I would prefer if you propose the solution that "makes sense" for the future and if this needs to break in the case of GroupedDataFrame just break it.

pdeffebach · 2020-10-10T18:07:42Z

The solution I would like is

julia> df = DataFrame(a = [1, 1, 1, 2, 2, 2], b = [1, 0, -1, 3, 0, -3]);

julia> gd = groupby(df, :a);

julia> @orderby(gd, :b .- mean(:b))

which would end up performing the operation :b .- mean(:b) the way it would work in @select and then calling

out_df = @select(gd, :_x = :b .- mean(:b))
parent(gd)[sortperm(out_df))

bkamins · 2020-10-11T05:55:14Z

Yes - this would be super cool. And I a similar thing will be supported in where in DataFrames.jl once I get to implementing it.

pdeffebach · 2020-10-13T01:02:34Z

This is ready for a review.

For posterity, I want to emphasize that this change reduces performance. @bkamins I was very surprised that gd[[2, 1]] is doesn't do any copies!

But I think this is the better and more intuitive interface. I don't really understand the original authors' preoccupation with reordering and selecting different groups in a DataFrame.

pdeffebach · 2020-10-13T02:00:52Z

Tests are broken, will fix soon.

bkamins · 2020-10-13T07:07:30Z

@bkamins I was very surprised that gd[[2, 1]] is doesn't do any copies!

I have not designed it (expect for cleaning bugs), but I guess this was intentional to ensure we have a good performance.

I don't really understand the original authors' preoccupation with reordering and selecting different groups in a DataFrame.

Again - I am not the author, but I guess after you order groups (which is fast), if you want to materialize a DataFrame you can always call DataFrame on the result (and if you do not need a materialized DataFrame you can leave things as they are - e.g. when you later want to run some aggregation function).

pdeffebach · 2020-10-13T20:01:21Z

Apologies for the delay on this. It is now ready for a review

bkamins · 2020-10-14T08:05:03Z

Just to be clear - why do you prefer @orderby to return a DataFrame not GroupedDataFrame?

pdeffebach · 2020-10-14T12:37:21Z

Returning grouped data frame is inconsistent with @select and @transform
With the addition of @select and @transform, the order of groups doesn't matter as much, so i'm not sure the original use-case applies
something like @orderby(groupby(df, :g), :x) is very confusing. Current behavior compares the arrays gd[1].x, gd[2].x etc. This is lexicographic ordering because that's how arrays are compared in Julia. I think this is confusing.
Implementation seems hard when using the new backend.

bkamins · 2020-10-14T14:00:37Z

OK

bkamins · 2020-10-14T19:08:06Z

src/DataFramesMeta.jl

@@ -410,6 +410,10 @@ function orderby(x::AbstractDataFrame, @nospecialize(args...))
 end

 function orderby(x::GroupedDataFrame, @nospecialize(args...))
+
+    @warn "orderby behavior now returns a `DataFrame` instead of a `GroupedDataFrame`. " *
+          "Group the returned data frame to restore old behavior" maxlog = 5


probably maxlog = 1 or 2 is enough?

pdeffebach · 2020-10-14T19:11:27Z

@nalimilan Are you okay with me breaking this?

bkamins · 2020-10-14T21:04:45Z

I would recommend that you just push forward with what you think it a good design (of course @nalimilan and I will gladly consult you). You have this freedom as DataFramesMeta.jl is not that big/mature - so I would recommend to shape it aggressively (the reason is that once it matures/becomes widely used - you will hit what we have in DataFrames.jl - that every decision, even ones that seem simple, are taking a lot of time).

pdeffebach · 2020-10-14T21:25:19Z

That sounds good. I will break @where in a similar way, then.

I will merge tomorrow if there are no objections

nalimilan · 2020-10-15T09:46:22Z

I'm fine with you breaking thinks, and I agree it makes sense to make @orderby closer to @select. Though maybe we should think carefully about the consistency of the API in general? In particular, I wonder what's the relation between @orderby and sort. For @select, transform and combine there's a one-to-one mapping with DataFrames functions. Should it be the case for other macros too?

For DataFrame, orderby does the same thing as sort AFAICT, it just allows computing a new column and sorting on it in a single operation. For GroupedDataFrame there's the issue that sort should probably sort groups rather than rows, similar to what we discussed for filter. So maybe actually the current behavior of @orderby is what a hypothetical @sort would do?

Also, is the goal that all macros return a DataFrame by default, even when passed a GroupedDataFrame?

FWIW, in dplyr arrange doesn't take into account grouping at all when sorting, but preserves grouping, and this is explicitly mentioned as an exception in the dplyr API. Not sure whether they are happy about that, but in issues mentioning that on GitHub they didn't mention any regret.

bkamins · 2020-10-15T10:21:50Z

For GroupedDataFrame there's the issue that sort should probably sort groups rather than rows

I understand that this is what @orderby does now in this proposal. That is why I was OK with it. The only thing was that DataFrame is returned, but I understand @pdeffebach wants GroupedDataFrame not to be sticky in DataFramesMeta.jl, but dropped always. Actually this is the point of the design that would be good to understand the reasons behind better. @pdeffebach - can you comment on this more?

nalimilan · 2020-10-16T08:17:55Z

Now let me play the devil's advocate. :-p

What are the use cases for having @orderby return a DataFrame rather than a GroupedDataFrame? That sounds useful only if sorting within groups was the goal of the operation and the last part of the piping sequence, right? Is the idea that people should either sort before grouping, or after ungrouping (using e.g. select)?

Also, functions like @where will keep returning a GroupedDataFrame, right?

README.md

src/DataFramesMeta.jl

nalimilan · 2020-10-16T08:01:04Z

src/DataFramesMeta.jl

+The second example below shows the logic of `@orderby` with a
+`GroupedDataFrame`. Note that the column `:t` is arranged from
+lowest to highest after the `@orderby` command. This shows that
+`@orderby` is equivelent to a transformation by group followed
+by ordering on the subsequent transformation.


Better put this with the corresponding example.

src/DataFramesMeta.jl

nalimilan · 2020-10-16T08:05:54Z

test/grouping.jl

+    @test @orderby(gd, mean(:i)).i == [1, 2, 3, 4, 5]
+    @test @orderby(df, std(:i) .- :i).i == [5, 4, 3, 2, 1]
+    @test @orderby(gd, :g, -1 .* (:i .- mean(:i))).i == [3, 2, 1, 5, 4]


Maybe check against @orderby(@select(gd, ...), ...)? That will also allow checking other columns.

Also df should be gd on the second line.

nalimilan · 2020-10-16T08:09:54Z

test/linqmacro.jl

        DataFrames.groupby(b_str) |>
        orderby(-mean(cols(x_sym)))  |>
+        groupby(:b) |>
        based_on(cols("meanX") = mean(:x), meanY = mean(:y))


This example is really weird. Why isn't orderby called as the last step?

Thanks, I have changed this pipe to match the other ones (it should be the same, just adding cols everywhere).

README.md

bkamins · 2020-10-16T10:19:47Z

@pdeffebach - given the discussion we have in this PR a more general question occurred to me. Here we discuss about @orderby macro, but it would be natural to define @sort macro (as in other cases - macro name is the same as function name). So the questions are:

why at all we want @orderby macro (maybe we should remove it and just define @sort macro)
does this macro have to work on GroupedDataFrame at all (maybe it is enough to allow sorting data frames) and then compose this sorting with select and transform that work on GroupedDataFrame

What do you think?

pdeffebach · 2020-10-16T13:07:56Z

What are the use cases for having @orderby return a DataFrame rather than a GroupedDataFrame? That sounds useful only if sorting within groups was the goal of the operation and the last part of the piping sequence, right? Is the idea that people should either sort before grouping, or after ungrouping (using e.g. select)?

I don't have a strong preference about returning a grouped data frame or a data frame. What I do have a preference for is not re-ordering groups, but rather performing an observation by group and re-ordering the result by rows. This seems like a more common need than sorting groups. Returning a DataFrame after this seems more about consistency. It's not good to have some operations on GroupedDataFrames be sticky and others not.

why at all we want @orderby macro (maybe we should remove it and just define @sort macro)

does this macro have to work on GroupedDataFrame at all (maybe it is enough to allow sorting data frames) and then compose this sorting with select and transform that work on GroupedDataFrame

I'm not super opposed to requiring a @transform and then a @sort, but no one likes having to create temporary columns, especially when dealing with data frames with many columns. In general we should avoid the kind of "calculations as columns" that Stata users have to do.

In the interest of clarity, if I could re-name everything from scratch, here is what I would do

@gencols: @transform (both DataFrame and GroupedDataFrame)
@keepcols: @select (both DataFrame and GroupedDataFrame)
@keeprows: My proposed implementation of @where, which always returns a DataFrame and performs transformations by group
@keepgroups: Current implementation of @where with a GroupedDataFrame
@sortrows: My proposed implementation of @orderby in this PR, which always returns a DataFrame and performs transformations by group
@sortgroups: Current implementation of @orderby on a GroupedDataFrame
@with: Works only with a DataFrame (added to this list for completeness).

nalimilan · 2020-10-16T14:07:35Z

I'm not super opposed to requiring a @transform and then a @sort, but no one likes having to create temporary columns, especially when dealing with data frames with many columns. In general we should avoid the kind of "calculations as columns" that Stata users have to do.

I think the idea was that @sort would also allow sorting on transformations, without creating columns manually before sorting. Would that address your concern?

Following your terminology, I wonder whether the @keepgroups and @sortgroups are really needed. AFAICT dplyr doesn't support that, right?

pdeffebach · 2020-10-16T14:17:58Z

Yes, I don't see a need for @keepgroups and @sortgroups. This is a main motivation for changing @orderby and @where, as this is their current functionality.

Given that GroupedDataFrames iterate through groups, I think there is value in the explicitness of rows vs groups.

nalimilan · 2020-10-16T14:53:39Z

How about deprecating passing GroupedDataFrame to @orderby then and see how it goes?

pdeffebach · 2020-10-16T14:57:38Z

What if I want to

Sort by groups
Within a group, sort by a values deviation from the group mean

This is the kind of functionality this PR gives

@orderby(groupby(df, :g), :g, :v .- mean(:v))

bkamins · 2020-10-16T16:00:29Z

sort by a values deviation from the group mean

@orderby(groupby(df, :g), :g, :v .- mean(:v))

is the same as:
sort(df, [:g, :v])

but I see the point. Still maybe (I will keep using DataFrames.jl syntax, not to fix ourselves on the solution for DataFramesMeta.jl):

combine(groupby(df, :g, sort=true), sdf -> sort(sdf, :v => v => v .- mean(v))

or

transform(groupby(sort(df, :g), :g), sdf -> sort(sdf, :v => v => v .- mean(v))

seems to be legible enough (and I have opened JuliaData/DataFrames.jl#2489 to allow transformations in DataFrames.jl in the future; of course in DataFramesMeta.jl we can allow transformations already now)

pdeffebach · 2020-10-16T16:25:58Z

After this discussion, in general, I am becoming more confident that this PR is the right move

As Milan mentioned, there isn't a big need to order groups, the @sortgroups macro above
The combine solution post by Bogumil is cumbersome and requires much more typing
As a matter of conisistency, it's important that all macros return DataFrames.

Though I do wish I could go ahead and implement all the re-naming macros given above...

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

pdeffebach · 2020-10-16T19:58:56Z

Thanks for the feedback.

I will think on this more for a few days, but I am still leaning towards keeping this. I like the consistency that every macro which wither implicitely or explicitely performs a transformation allows a grouped data frame and does the transformation by group.

For instance, I think we can all agree that @where should perform operations by group. For instance, taking the row with the highest value for each group is something people generally want to do.

This would leave @orderby as lonely, the only macro which does not allow for a grouped operation returning a DataFrame.

bkamins · 2020-10-16T20:24:07Z

As I have commented in other PR - if @orderby for GroupedDataFrame is a functionality that is needed very rarely we can leave it out for now and decide how to handle it later when we have very clear use cases. Thank you for working on this.

nalimilan · 2020-10-17T12:49:13Z

To me the main advantage of not supporting GroupedDataFrame inputs would be that we wouldn't need @orderby but we could call it @sort instead, which is a natural name and consistent with Base. That's a similar decision to adding where in DataFames corresponding to @where in DataFramesMeta, while we already have filter but with semantics that don't match our needs for GroupedDataFrame.

pdeffebach · 2020-10-17T14:19:43Z

Okay, I will "reserve" @orderby with a grouped data frame.

However I still have a concern about @sort. I do think that the feature I propose for @orderby is useful overall. It's not a super high priority so I am fine not implementing it now. But once we use @sort, the feature I propose is off the table since it would break a contract with Base.

So maybe I'm not sure what the solution is in the long run. It would be off for DataFrames to define both sort and orderby. But then again we will have both filter and where.

Maybe sort's transformations should operate on row-wise! That would be most similar to the contract with Base and open up an path for an alternative orderby. Similar to filter and where.

pdeffebach · 2020-10-17T23:17:59Z

Okay ready for a review after disallowing a GroupedDataFrame

bkamins · 2020-10-18T06:34:50Z

OK - let's get rolling with it. Nightly fails due to unrelated reasons - right?

pdeffebach · 2020-10-18T13:58:35Z

Yes, the errors are due to MacroTools and Lazy which are only imported during the chaining test set. This is ready to merge.

bkamins · 2020-10-18T15:58:48Z

OK - go ahead and merge (if you do not have rights please let me know).

initial commit

e8ad1fa

hacky ordering for grouped data frame

766abb2

pdeffebach added 4 commits October 12, 2020 20:40

make breaking change

159a52f

fix readme

220ed1c

update news

dac863b

fix linq tests

a6a5a97

debugging

7a13558

pdeffebach and others added 4 commits October 13, 2020 14:37

make tests pass

ccbbec1

Merge branch 'master' into orderby_backend

716b0de

make tests pass again after merge

992ee01

fix diff

3fddded

add depwarn

156bd7e

bkamins reviewed Oct 14, 2020

View reviewed changes

change maxlog to 2

fc15fdc

better explanation in docstring

ef8687f

pdeffebach mentioned this pull request Oct 16, 2020

Re-do the backend of @where #192

Merged

nalimilan reviewed Oct 16, 2020

View reviewed changes

pdeffebach and others added 3 commits October 16, 2020 15:40

Apply suggestions from code review

3e88085

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

respond to milan

4253ba6

fix merge conflict

6e2b49f

pdeffebach mentioned this pull request Oct 17, 2020

Consider allowing transformations in sort JuliaData/DataFrames.jl#2489

Open

pdeffebach added 2 commits October 17, 2020 18:29

remove GroupedDataFrame cases

3cdc681

small changes

d383429

bkamins approved these changes Oct 18, 2020

View reviewed changes

This was referenced Oct 18, 2020

Lazy seems to fail on nightly MikeInnes/Lazy.jl#127

Open

Lazy fails on nightly MikeInnes/Lazy.jl#128

Closed

pdeffebach merged commit 1e1577b into JuliaData:master Oct 18, 2020

Re-do the @orderby backend #191

Re-do the @orderby backend #191

Conversation

pdeffebach commented Oct 10, 2020 • edited Loading

pdeffebach commented Oct 10, 2020

bkamins commented Oct 10, 2020

pdeffebach commented Oct 10, 2020

bkamins commented Oct 11, 2020

pdeffebach commented Oct 13, 2020

pdeffebach commented Oct 13, 2020

bkamins commented Oct 13, 2020

pdeffebach commented Oct 13, 2020

bkamins commented Oct 14, 2020

pdeffebach commented Oct 14, 2020

bkamins commented Oct 14, 2020

bkamins Oct 14, 2020

Choose a reason for hiding this comment

pdeffebach Oct 14, 2020

Choose a reason for hiding this comment

pdeffebach commented Oct 14, 2020

bkamins commented Oct 14, 2020

pdeffebach commented Oct 14, 2020

nalimilan commented Oct 15, 2020

bkamins commented Oct 15, 2020 • edited Loading

nalimilan commented Oct 16, 2020

nalimilan Oct 16, 2020

Choose a reason for hiding this comment

nalimilan Oct 16, 2020

Choose a reason for hiding this comment

pdeffebach Oct 16, 2020

Choose a reason for hiding this comment

nalimilan Oct 16, 2020

Choose a reason for hiding this comment

pdeffebach Oct 16, 2020

Choose a reason for hiding this comment

bkamins commented Oct 16, 2020

pdeffebach commented Oct 16, 2020 • edited Loading

nalimilan commented Oct 16, 2020

pdeffebach commented Oct 16, 2020

nalimilan commented Oct 16, 2020

pdeffebach commented Oct 16, 2020

bkamins commented Oct 16, 2020

pdeffebach commented Oct 16, 2020

pdeffebach commented Oct 16, 2020

bkamins commented Oct 16, 2020

nalimilan commented Oct 17, 2020

pdeffebach commented Oct 17, 2020 • edited Loading

pdeffebach commented Oct 17, 2020

bkamins commented Oct 18, 2020

pdeffebach commented Oct 18, 2020

bkamins commented Oct 18, 2020

Re-do the `@orderby` backend #191

Re-do the `@orderby` backend #191

pdeffebach commented Oct 10, 2020 •

edited

Loading

bkamins commented Oct 15, 2020 •

edited

Loading

pdeffebach commented Oct 16, 2020 •

edited

Loading

pdeffebach commented Oct 17, 2020 •

edited

Loading