Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame for GroupedDataFrame #1689

Merged
merged 24 commits into from
Jan 24, 2019
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 20, 2019

This follows the discussion in JuliaData/DataFramesMeta.jl#122.

@nalimilan In my opinion the current behavior of combine(::GroupedDataFrame) (duplicating grouping columns) is not very intuitive. Is it intentional or you plan to change it in the future?

@nalimilan
Copy link
Member

I agree that behavior doesn't make a lot of sense. Probably better change combine at the same time. Or deprecate it in favor of DataFrame(gd), which will allow reintroducing it later with a new behavior.

src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated
@test sort(DataFrame(gd), :B) ≅ sort(df, :B)
@test eltypes(DataFrame(gd)) == eltypes(df)
gd = groupby_checked(df, :A, skipmissing=true)
@test sort(DataFrame(gd), :B) == sort(dropmissing(df, disallowmissing=false), :B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disallowmissing shouldn't make any difference here, right? Same two lines below. And just below, Missings.T.(eltypes(df)) would be slightly clearer IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to test explicitly here if we retain exactly the same types. Missings.T.(eltypes(df)) does not allow me to check if the type is exactly the same. But I can rewrite these tests if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Then why not just hardcode the expected values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@nalimilan
Copy link
Member

That's actually related to #1460 and #1555. I'd be inclined to overwrite the grouping keys with columns of the same name if the returned data frame contains such columns.

@bkamins
Copy link
Member Author

bkamins commented Jan 21, 2019

I moved the discussion on the combine behavior to #1460. Here I think it is better to implement a more efficient constructor anyway which I will propose in the revision.

nalimilan and others added 2 commits January 21, 2019 11:06
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
@bkamins
Copy link
Member Author

bkamins commented Jan 21, 2019

Apart from the better constructor I have patched some holes in internal API.

length(gd) == 0 && return similar(parent(gd), 0)
# below we assume that gd.ends[end] == length(gd.idxs)
# and that gd.starts and gd.ends are increasing and cover a continuous range
gd.starts[1] == 1 && return parent(gd)[gd.idx, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it always the case that gd.starts[1]? AFAICT that's what we explicitly set in group_rows.

OTC, if I'm wrong, is it really useful to distinguish this case? It should be quite cheap to create a view, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will make a view - I wanted to avoid it in the most common case.
gd.starts[1] is greater than 1 if we have missing values in grouping and we set skipmissing to true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gd.starts[1] is greater than 1 if we have missing values in grouping and we set skipmissing to true.

Indeed. We should probably just do starts .-= starts[2] .+ 1 and ends .-= ends[2] .+ 1. We already do that for groups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I understand you mean to do it in groupby related mechanics not here. Right? I think it would be OK, but then gd.idx should not include the leading indices that are pointing at missing so length(gd.idx) may be less that nrow(parent(gd)) and the same for gd.ends[end] (but the relationsip gd.ends[end] == length(gd.idxs) would still hold) - I assume that we are not relying on the fact that gd.idx has the same number of elements as rows in the parent, but that you know this part of code better 😄.

If we took this approach then we could avoid making the view here and just index by gd.idx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I don't think we assume length(gd.idx) == nrow(parent(gd)), but only test will tell for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but I guess it should be a separate PR. Are willing to work on it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1692. But actually when sort=true there's no guaranty that gd.starts[1] == 1, only that minimum(gd.starts) == 1. And using gd.idx directly would return groups in an unsorted order. AFAIK that's unavoidable since we only sort group indices (gd.start and gd.stops), which should be faster than sorting row indices (gd.idx) when there are few groups. Not sure whether sorting group indices would be significantly slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have fixed it (that is why I have said that split-apply-combine is tricky 😄). We cannot avoid making a copy of gd.idx in this case unfortunately, but I have proposed the fastest method to get what we want I could think of.

The fix depends on #1692, so I will rebase this PR, when we merge #1692 (I could resize to make it independent, but I think it is better to follow the idea of code simplification that can be achieved with #1692).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder whether #1692 is a good idea. It adds 1ms (for a total of about 5ms) to the following test case:

df = DataFrame(a = categorical(repeat([missing; 1:39], outer=[20000])),
               c = randn(800000))

using BenchmarkTools
@btime gd = groupby(df, :a, skipmissing=true);

Given that it only simplifies code a little bit, I'm not sure it's really worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parallel I was thinking about the same and consider using view instead of copyto! or creating a correct rperm from the start - it would probably help but would complicate the design a bit. But in general I am OK to drop #1692.

If you decide to drop #1692 I can fix this PR not to rely on #1692 and be safe.

@bkamins
Copy link
Member Author

bkamins commented Jan 21, 2019

I have added the deprecation for combine(::GroupedDataFrame). I leave it in the grouping.jl file, as later we should change it to throw an error.

@bkamins
Copy link
Member Author

bkamins commented Jan 21, 2019

OK - I have pushed the version that does not depend on #1692 so it should be good for checking when you have time.

min_start = min(min_start, s)
end
resize!(idx, doff - 1)
@assert doff == length(gd.idx) + 2 - min_start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of weird to have this assertion here. Why not put something equivalent in groupby_checked instead? That way we will check it much more systematically, and the code here will be simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent point. I have included a combo of tests to groupby_checked that I think are also useful for the future readers to understand what we expect to have in GroupedDataFrame object.

@@ -177,9 +202,7 @@ end
# groupby() without groups sorting
gd = groupby_checked(df, cols)
@test names(parent(gd))[gd.cols] == colssym
@test sort(combine(identity, gd), colssym) ==
sort(combine(gd), colssym) ==
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test DataFrame(gd) here and below? Can't hurt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a test here (although it is not very clean unfortunately, because of the combine mechanics), but I agree it is good to have it here.

test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
nalimilan and others added 3 commits January 23, 2019 13:57
@bkamins bkamins merged commit 0b108a1 into JuliaData:master Jan 24, 2019
@bkamins bkamins deleted the dataframegrouped branch January 24, 2019 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants