filter and broadcasting for GroupedDataFrame #2194

bkamins · 2020-04-16T07:06:09Z

This can be done post 1.0, but I think we are ready to decide that GroupedDataFrame is a collection of SubDataFrames, so:

we could add filter to it (question just asked on Slack)
we could consider enabling broadcasting of GroupedDataFrame - here the only decision would be if the result of such broadcasting operation should be a Vector or a GroupedDataFrame (both make sense in different contexts; I tend to prefer a Vector as map already gives an option with GroupedDataFrame).

The text was updated successfully, but these errors were encountered:

akdor1154 · 2020-04-18T16:01:12Z

just ran into the need to filter groups on my first foray into Julia. It actually looks impossible to hack together something myself (based on my three days' experience with the language, so I'm probably wrong here) so it would be nice to see this included as a function.

bkamins · 2020-04-18T16:12:27Z

It is not that hard - just use Bool indexing into GroupedDataFrame:

julia> using Statistics

julia> df = DataFrame(g = 1:10, x = rand(10))
10×2 DataFrame
│ Row │ g     │ x         │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.109814  │
│ 2   │ 2     │ 0.966174  │
│ 3   │ 3     │ 0.0975521 │
│ 4   │ 4     │ 0.393241  │
│ 5   │ 5     │ 0.0790575 │
│ 6   │ 6     │ 0.809334  │
│ 7   │ 7     │ 0.756541  │
│ 8   │ 8     │ 0.735663  │
│ 9   │ 9     │ 0.0790629 │
│ 10  │ 10    │ 0.244876  │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 10 groups based on key: g
First Group (1 row): g = 1
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.109814 │
⋮
Last Group (1 row): g = 10
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 10    │ 0.244876 │

julia> gdf[[mean(sdf.x) > 0.5 for  sdf in gdf]]
GroupedDataFrame with 4 groups based on key: g
First Group (1 row): g = 2
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 2     │ 0.966174 │
⋮
Last Group (1 row): g = 8
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤

and the result is a GroupedDataFrame containing only groups for which mean of :x column is greater than 0.5. Still probably this:

filter(sdf -> mean(sdf.x) > 0.5, gdf)

or (this syntax will be available soon and will be faster):

filter(:x => x -> mean(x) > 0.5, gdf)

is more convenient.

akdor1154 · 2020-04-18T16:17:43Z

ah, thanks for the workaround! no offence taken if you want to clean up my comments to this issue to keep it clean :)

bkamins · 2020-04-18T16:19:40Z

Actually thank you for commenting - we need end user feedback.

bkamins added decision grouping non-breaking The proposed change is not breaking labels Apr 16, 2020

bkamins added this to the 1.x milestone Apr 16, 2020

bkamins mentioned this issue Apr 25, 2020

Cleaner syntax #2206

Closed

This was referenced May 14, 2020

Clarify position on iteration API #2254

Closed

Make combine(gdf, args...) more flexible #2260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter and broadcasting for GroupedDataFrame #2194

filter and broadcasting for GroupedDataFrame #2194

bkamins commented Apr 16, 2020

akdor1154 commented Apr 18, 2020

bkamins commented Apr 18, 2020

akdor1154 commented Apr 18, 2020

bkamins commented Apr 18, 2020

filter and broadcasting for GroupedDataFrame #2194

filter and broadcasting for GroupedDataFrame #2194

Comments

bkamins commented Apr 16, 2020

akdor1154 commented Apr 18, 2020

bkamins commented Apr 18, 2020

akdor1154 commented Apr 18, 2020

bkamins commented Apr 18, 2020