-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame for GroupedDataFrame #1689
Conversation
I agree that behavior doesn't make a lot of sense. Probably better change |
test/grouping.jl
Outdated
@test sort(DataFrame(gd), :B) ≅ sort(df, :B) | ||
@test eltypes(DataFrame(gd)) == eltypes(df) | ||
gd = groupby_checked(df, :A, skipmissing=true) | ||
@test sort(DataFrame(gd), :B) == sort(dropmissing(df, disallowmissing=false), :B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disallowmissing
shouldn't make any difference here, right? Same two lines below. And just below, Missings.T.(eltypes(df))
would be slightly clearer IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to test explicitly here if we retain exactly the same types. Missings.T.(eltypes(df))
does not allow me to check if the type is exactly the same. But I can rewrite these tests if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. Then why not just hardcode the expected values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
I moved the discussion on the combine behavior to #1460. Here I think it is better to implement a more efficient constructor anyway which I will propose in the revision. |
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
Apart from the better constructor I have patched some holes in internal API. |
src/groupeddataframe/grouping.jl
Outdated
length(gd) == 0 && return similar(parent(gd), 0) | ||
# below we assume that gd.ends[end] == length(gd.idxs) | ||
# and that gd.starts and gd.ends are increasing and cover a continuous range | ||
gd.starts[1] == 1 && return parent(gd)[gd.idx, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it always the case that gd.starts[1]
? AFAICT that's what we explicitly set in group_rows
.
OTC, if I'm wrong, is it really useful to distinguish this case? It should be quite cheap to create a view, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will make a view
- I wanted to avoid it in the most common case.
gd.starts[1]
is greater than 1
if we have missing values in grouping and we set skipmissing
to true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gd.starts[1]
is greater than1
if we have missing values in grouping and we setskipmissing
totrue
.
Indeed. We should probably just do starts .-= starts[2] .+ 1 and
ends .-= ends[2] .+ 1. We already do that for groups
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I understand you mean to do it in groupby
related mechanics not here. Right? I think it would be OK, but then gd.idx
should not include the leading indices that are pointing at missing
so length(gd.idx)
may be less that nrow(parent(gd))
and the same for gd.ends[end]
(but the relationsip gd.ends[end] == length(gd.idxs)
would still hold) - I assume that we are not relying on the fact that gd.idx
has the same number of elements as rows in the parent, but that you know this part of code better 😄.
If we took this approach then we could avoid making the view here and just index by gd.idx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I don't think we assume length(gd.idx) == nrow(parent(gd))
, but only test will tell for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but I guess it should be a separate PR. Are willing to work on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1692. But actually when sort=true
there's no guaranty that gd.starts[1] == 1
, only that minimum(gd.starts) == 1
. And using gd.idx
directly would return groups in an unsorted order. AFAIK that's unavoidable since we only sort group indices (gd.start
and gd.stops
), which should be faster than sorting row indices (gd.idx
) when there are few groups. Not sure whether sorting group indices would be significantly slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I have fixed it (that is why I have said that split-apply-combine is tricky 😄). We cannot avoid making a copy of gd.idx
in this case unfortunately, but I have proposed the fastest method to get what we want I could think of.
The fix depends on #1692, so I will rebase this PR, when we merge #1692 (I could resize to make it independent, but I think it is better to follow the idea of code simplification that can be achieved with #1692).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wonder whether #1692 is a good idea. It adds 1ms (for a total of about 5ms) to the following test case:
df = DataFrame(a = categorical(repeat([missing; 1:39], outer=[20000])),
c = randn(800000))
using BenchmarkTools
@btime gd = groupby(df, :a, skipmissing=true);
Given that it only simplifies code a little bit, I'm not sure it's really worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In parallel I was thinking about the same and consider using view
instead of copyto!
or creating a correct rperm
from the start - it would probably help but would complicate the design a bit. But in general I am OK to drop #1692.
If you decide to drop #1692 I can fix this PR not to rely on #1692 and be safe.
I have added the deprecation for |
OK - I have pushed the version that does not depend on #1692 so it should be good for checking when you have time. |
src/groupeddataframe/grouping.jl
Outdated
min_start = min(min_start, s) | ||
end | ||
resize!(idx, doff - 1) | ||
@assert doff == length(gd.idx) + 2 - min_start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird to have this assertion here. Why not put something equivalent in groupby_checked
instead? That way we will check it much more systematically, and the code here will be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An excellent point. I have included a combo of tests to groupby_checked
that I think are also useful for the future readers to understand what we expect to have in GroupedDataFrame
object.
@@ -177,9 +202,7 @@ end | |||
# groupby() without groups sorting | |||
gd = groupby_checked(df, cols) | |||
@test names(parent(gd))[gd.cols] == colssym | |||
@test sort(combine(identity, gd), colssym) == | |||
sort(combine(gd), colssym) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not test DataFrame(gd)
here and below? Can't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test here (although it is not very clean unfortunately, because of the combine
mechanics), but I agree it is good to have it here.
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
This follows the discussion in JuliaData/DataFramesMeta.jl#122.
@nalimilan In my opinion the current behavior of
combine(::GroupedDataFrame)
(duplicating grouping columns) is not very intuitive. Is it intentional or you plan to change it in the future?