DataFrame for GroupedDataFrame #1689

bkamins · 2019-01-20T21:45:28Z

This follows the discussion in JuliaData/DataFramesMeta.jl#122.

@nalimilan In my opinion the current behavior of combine(::GroupedDataFrame) (duplicating grouping columns) is not very intuitive. Is it intentional or you plan to change it in the future?

nalimilan · 2019-01-21T08:57:48Z

I agree that behavior doesn't make a lot of sense. Probably better change combine at the same time. Or deprecate it in favor of DataFrame(gd), which will allow reintroducing it later with a new behavior.

src/groupeddataframe/grouping.jl

nalimilan · 2019-01-21T08:59:59Z

test/grouping.jl

+        @test sort(DataFrame(gd), :B) ≅ sort(df, :B)
+        @test eltypes(DataFrame(gd)) == eltypes(df)
+        gd = groupby_checked(df, :A, skipmissing=true)
+        @test sort(DataFrame(gd), :B) == sort(dropmissing(df, disallowmissing=false), :B)


disallowmissing shouldn't make any difference here, right? Same two lines below. And just below, Missings.T.(eltypes(df)) would be slightly clearer IMHO.

I want to test explicitly here if we retain exactly the same types. Missings.T.(eltypes(df)) does not allow me to check if the type is exactly the same. But I can rewrite these tests if you prefer.

Ah right. Then why not just hardcode the expected values?

nalimilan · 2019-01-21T09:09:56Z

That's actually related to #1460 and #1555. I'd be inclined to overwrite the grouping keys with columns of the same name if the returned data frame contains such columns.

bkamins · 2019-01-21T09:54:27Z

I moved the discussion on the combine behavior to #1460. Here I think it is better to implement a more efficient constructor anyway which I will propose in the revision.

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

bkamins · 2019-01-21T11:19:32Z

Apart from the better constructor I have patched some holes in internal API.

nalimilan · 2019-01-21T12:34:03Z

src/groupeddataframe/grouping.jl

+    length(gd) == 0 && return similar(parent(gd), 0)
+    # below we assume that gd.ends[end] == length(gd.idxs)
+    # and that gd.starts and gd.ends are increasing and cover a continuous range
+    gd.starts[1] == 1 && return parent(gd)[gd.idx, :]


Isn't it always the case that gd.starts[1]? AFAICT that's what we explicitly set in group_rows.

OTC, if I'm wrong, is it really useful to distinguish this case? It should be quite cheap to create a view, right?

OK - I will make a view - I wanted to avoid it in the most common case.
gd.starts[1] is greater than 1 if we have missing values in grouping and we set skipmissing to true.

gd.starts[1] is greater than 1 if we have missing values in grouping and we set skipmissing to true.

Indeed. We should probably just do starts .-= starts[2] .+ 1 and ends .-= ends[2] .+ 1. We already do that for groups.

But I understand you mean to do it in groupby related mechanics not here. Right? I think it would be OK, but then gd.idx should not include the leading indices that are pointing at missing so length(gd.idx) may be less that nrow(parent(gd)) and the same for gd.ends[end] (but the relationsip gd.ends[end] == length(gd.idxs) would still hold) - I assume that we are not relying on the fact that gd.idx has the same number of elements as rows in the parent, but that you know this part of code better 😄.

If we took this approach then we could avoid making the view here and just index by gd.idx.

Right. I don't think we assume length(gd.idx) == nrow(parent(gd)), but only test will tell for sure.

OK, but I guess it should be a separate PR. Are willing to work on it?

See #1692. But actually when sort=true there's no guaranty that gd.starts[1] == 1, only that minimum(gd.starts) == 1. And using gd.idx directly would return groups in an unsorted order. AFAIK that's unavoidable since we only sort group indices (gd.start and gd.stops), which should be faster than sorting row indices (gd.idx) when there are few groups. Not sure whether sorting group indices would be significantly slower.

OK - I have fixed it (that is why I have said that split-apply-combine is tricky 😄). We cannot avoid making a copy of gd.idx in this case unfortunately, but I have proposed the fastest method to get what we want I could think of.

The fix depends on #1692, so I will rebase this PR, when we merge #1692 (I could resize to make it independent, but I think it is better to follow the idea of code simplification that can be achieved with #1692).

Actually, I wonder whether #1692 is a good idea. It adds 1ms (for a total of about 5ms) to the following test case:

df = DataFrame(a = categorical(repeat([missing; 1:39], outer=[20000])), c = randn(800000)) using BenchmarkTools @btime gd = groupby(df, :a, skipmissing=true);

Given that it only simplifies code a little bit, I'm not sure it's really worth it.

In parallel I was thinking about the same and consider using view instead of copyto! or creating a correct rperm from the start - it would probably help but would complicate the design a bit. But in general I am OK to drop #1692.

If you decide to drop #1692 I can fix this PR not to rely on #1692 and be safe.

bkamins · 2019-01-21T17:03:30Z

I have added the deprecation for combine(::GroupedDataFrame). I leave it in the grouping.jl file, as later we should change it to throw an error.

src/groupeddataframe/grouping.jl

bkamins · 2019-01-21T21:56:54Z

OK - I have pushed the version that does not depend on #1692 so it should be good for checking when you have time.

nalimilan · 2019-01-22T18:56:17Z

src/groupeddataframe/grouping.jl

+        min_start = min(min_start, s)
+    end
+    resize!(idx, doff - 1)
+    @assert doff == length(gd.idx) + 2 - min_start


It's kind of weird to have this assertion here. Why not put something equivalent in groupby_checked instead? That way we will check it much more systematically, and the code here will be simpler.

An excellent point. I have included a combo of tests to groupby_checked that I think are also useful for the future readers to understand what we expect to have in GroupedDataFrame object.

nalimilan · 2019-01-23T10:36:28Z

test/grouping.jl

@@ -177,9 +202,7 @@ end
        # groupby() without groups sorting
        gd = groupby_checked(df, cols)
        @test names(parent(gd))[gd.cols] == colssym
-        @test sort(combine(identity, gd), colssym) ==
-            sort(combine(gd), colssym) ==


Why not test DataFrame(gd) here and below? Can't hurt.

I have added a test here (although it is not very clean unfortunately, because of the combine mechanics), but I agree it is good to have it here.

test/grouping.jl

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

DataFrame for GroupedDataFrame

2cac457

bkamins mentioned this pull request Jan 20, 2019

Remove methods not belonging to DataFramesMeta.jl JuliaData/DataFramesMeta.jl#122

Merged

fix equality test

f0e873f

nalimilan reviewed Jan 21, 2019

View reviewed changes

nalimilan and others added 2 commits January 21, 2019 11:06

Update src/groupeddataframe/grouping.jl

1aac018

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

a better constructor

40e6058

nalimilan reviewed Jan 21, 2019

View reviewed changes

bkamins added 2 commits January 21, 2019 13:55

corrections after review comments

b1b69c4

fix type signature

39606e5

bkamins mentioned this pull request Jan 21, 2019

combine an empty GroupedDataFrame #1690

Closed

deprecate combine(gd)

5047f65

bkamins added the grouping label Jan 21, 2019

bkamins added 2 commits January 21, 2019 20:01

fix DataFrame constructor

dee8247

fix DataFrame constructor

cd3a2a6

nalimilan reviewed Jan 21, 2019

View reviewed changes

src/groupeddataframe/grouping.jl Show resolved Hide resolved

bkamins added 2 commits January 21, 2019 22:21

do not assume starts start at 1

1da5a42

small fixes in design

bae7df0

bkamins added 2 commits January 21, 2019 23:45

add deprecated test

5518cee

fix typo

c9b5331

nalimilan reviewed Jan 22, 2019

View reviewed changes

bkamins added 4 commits January 22, 2019 22:46

Merge branch 'master' into dataframegrouped

75a4760

small fixes

42e1afb

improved groupby_checked

5060e74

fix typo

ea9c1df

fix missings test

3fd37c8

bkamins mentioned this pull request Jan 22, 2019

Ensure GroupedDataFrame starts at index 1 #1692

Closed

use se properly

1007c6b

nalimilan reviewed Jan 23, 2019

View reviewed changes

nalimilan and others added 3 commits January 23, 2019 13:57

Apply suggestions from code review

8675e0e

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

added more tests

f20c0d4

Add spaces

b2447ce

nalimilan approved these changes Jan 23, 2019

View reviewed changes

bkamins added 2 commits January 23, 2019 14:54

fix makeunique

d34d2ac

fix tests

f77a19a

bkamins merged commit 0b108a1 into JuliaData:master Jan 24, 2019

bkamins deleted the dataframegrouped branch January 24, 2019 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame for GroupedDataFrame #1689

DataFrame for GroupedDataFrame #1689

bkamins commented Jan 20, 2019

nalimilan commented Jan 21, 2019

nalimilan Jan 21, 2019

bkamins Jan 21, 2019

nalimilan Jan 21, 2019

bkamins Jan 21, 2019

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan Jan 21, 2019

bkamins Jan 21, 2019

nalimilan Jan 21, 2019

bkamins Jan 21, 2019

nalimilan Jan 21, 2019

bkamins Jan 21, 2019

nalimilan Jan 21, 2019

bkamins Jan 21, 2019

nalimilan Jan 21, 2019

bkamins Jan 21, 2019

bkamins commented Jan 21, 2019

bkamins commented Jan 21, 2019

nalimilan Jan 22, 2019

bkamins Jan 22, 2019

nalimilan Jan 23, 2019

bkamins Jan 23, 2019

DataFrame for GroupedDataFrame #1689

DataFrame for GroupedDataFrame #1689

Conversation

bkamins commented Jan 20, 2019

nalimilan commented Jan 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

bkamins commented Jan 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Jan 21, 2019

bkamins commented Jan 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment