-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-do the @orderby
backend
#191
Conversation
Additionally, it looks like |
I would prefer if you propose the solution that "makes sense" for the future and if this needs to break in the case of |
The solution I would like is
which would end up performing the operation
|
Yes - this would be super cool. And I a similar thing will be supported in |
This is ready for a review. For posterity, I want to emphasize that this change reduces performance. @bkamins I was very surprised that But I think this is the better and more intuitive interface. I don't really understand the original authors' preoccupation with reordering and selecting different groups in a DataFrame. |
Tests are broken, will fix soon. |
I have not designed it (expect for cleaning bugs), but I guess this was intentional to ensure we have a good performance.
Again - I am not the author, but I guess after you order groups (which is fast), if you want to materialize a |
Apologies for the delay on this. It is now ready for a review |
Just to be clear - why do you prefer |
|
OK |
src/DataFramesMeta.jl
Outdated
@@ -410,6 +410,10 @@ function orderby(x::AbstractDataFrame, @nospecialize(args...)) | |||
end | |||
|
|||
function orderby(x::GroupedDataFrame, @nospecialize(args...)) | |||
|
|||
@warn "orderby behavior now returns a `DataFrame` instead of a `GroupedDataFrame`. " * | |||
"Group the returned data frame to restore old behavior" maxlog = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably maxlog = 1
or 2
is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
@nalimilan Are you okay with me breaking this? |
I would recommend that you just push forward with what you think it a good design (of course @nalimilan and I will gladly consult you). You have this freedom as DataFramesMeta.jl is not that big/mature - so I would recommend to shape it aggressively (the reason is that once it matures/becomes widely used - you will hit what we have in DataFrames.jl - that every decision, even ones that seem simple, are taking a lot of time). |
That sounds good. I will break I will merge tomorrow if there are no objections |
I'm fine with you breaking thinks, and I agree it makes sense to make For Also, is the goal that all macros return a FWIW, in dplyr |
I understand that this is what |
Now let me play the devil's advocate. :-p What are the use cases for having Also, functions like |
src/DataFramesMeta.jl
Outdated
The second example below shows the logic of `@orderby` with a | ||
`GroupedDataFrame`. Note that the column `:t` is arranged from | ||
lowest to highest after the `@orderby` command. This shows that | ||
`@orderby` is equivelent to a transformation by group followed | ||
by ordering on the subsequent transformation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better put this with the corresponding example.
test/grouping.jl
Outdated
@test @orderby(gd, mean(:i)).i == [1, 2, 3, 4, 5] | ||
@test @orderby(df, std(:i) .- :i).i == [5, 4, 3, 2, 1] | ||
@test @orderby(gd, :g, -1 .* (:i .- mean(:i))).i == [3, 2, 1, 5, 4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check against @orderby(@select(gd, ...), ...)
? That will also allow checking other columns.
Also df
should be gd
on the second line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
test/linqmacro.jl
Outdated
DataFrames.groupby(b_str) |> | ||
orderby(-mean(cols(x_sym))) |> | ||
groupby(:b) |> | ||
based_on(cols("meanX") = mean(:x), meanY = mean(:y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is really weird. Why isn't orderby
called as the last step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have changed this pipe to match the other ones (it should be the same, just adding cols
everywhere).
@pdeffebach - given the discussion we have in this PR a more general question occurred to me. Here we discuss about
What do you think? |
I don't have a strong preference about returning a grouped data frame or a data frame. What I do have a preference for is not re-ordering groups, but rather performing an observation by group and re-ordering the result by rows. This seems like a more common need than sorting groups. Returning a
I'm not super opposed to requiring a In the interest of clarity, if I could re-name everything from scratch, here is what I would do
|
I think the idea was that Following your terminology, I wonder whether the |
Yes, I don't see a need for Given that |
How about deprecating passing |
What if I want to
This is the kind of functionality this PR gives
|
is the same as: but I see the point. Still maybe (I will keep using DataFrames.jl syntax, not to fix ourselves on the solution for DataFramesMeta.jl):
or
seems to be legible enough (and I have opened JuliaData/DataFrames.jl#2489 to allow transformations in DataFrames.jl in the future; of course in DataFramesMeta.jl we can allow transformations already now) |
After this discussion, in general, I am becoming more confident that this PR is the right move
Though I do wish I could go ahead and implement all the re-naming macros given above... |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thanks for the feedback. I will think on this more for a few days, but I am still leaning towards keeping this. I like the consistency that every macro which wither implicitely or explicitely performs a transformation allows a grouped data frame and does the transformation by group. For instance, I think we can all agree that This would leave |
As I have commented in other PR - if |
To me the main advantage of not supporting |
Okay, I will "reserve" However I still have a concern about So maybe I'm not sure what the solution is in the long run. It would be off for DataFrames to define both Maybe |
Okay ready for a review after disallowing a GroupedDataFrame |
OK - let's get rolling with it. Nightly fails due to unrelated reasons - right? |
Yes, the errors are due to |
OK - go ahead and merge (if you do not have rights please let me know). |
Before, there was some complicated
with_anonymous
machinery.Now the backends are
select
andcombine
.For an abstract data frame this is intuitive, just call
select
on all the arguments with anolhs = true
option and return thesortperm
from that.For a grouped data frame, unfortunately
@orderby
currently re-orders the groups and returns a grouped dataframe.This annoying because if you do, say,
then the result of the anonymous function in
@orderby
is a vector for each group and since Julia has lexicographic ordering of vectors, you can get some surprising results.This was never tested and I'm not sure who uses it. I am making a breaking change here by only allowing things inside@orderby
to return scalars, i.e.@orderby(df, :b)
will error, and@orderby(df, mean(:b))
will work.Update: I have found a solution that is somewhat hacky but doable. Current behavior matches 100% old behavior.