Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter and broadcasting for GroupedDataFrame #2194

Open
bkamins opened this issue Apr 16, 2020 · 4 comments
Open

filter and broadcasting for GroupedDataFrame #2194

bkamins opened this issue Apr 16, 2020 · 4 comments
Labels
decision grouping non-breaking The proposed change is not breaking
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Apr 16, 2020

This can be done post 1.0, but I think we are ready to decide that GroupedDataFrame is a collection of SubDataFrames, so:

  1. we could add filter to it (question just asked on Slack)
  2. we could consider enabling broadcasting of GroupedDataFrame - here the only decision would be if the result of such broadcasting operation should be a Vector or a GroupedDataFrame (both make sense in different contexts; I tend to prefer a Vector as map already gives an option with GroupedDataFrame).
@bkamins bkamins added decision grouping non-breaking The proposed change is not breaking labels Apr 16, 2020
@bkamins bkamins added this to the 1.x milestone Apr 16, 2020
@akdor1154
Copy link

just ran into the need to filter groups on my first foray into Julia. It actually looks impossible to hack together something myself (based on my three days' experience with the language, so I'm probably wrong here) so it would be nice to see this included as a function.

@bkamins
Copy link
Member Author

bkamins commented Apr 18, 2020

It is not that hard - just use Bool indexing into GroupedDataFrame:

julia> using Statistics

julia> df = DataFrame(g = 1:10, x = rand(10))
10×2 DataFrame
│ Row │ g     │ x         │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.109814  │
│ 2   │ 2     │ 0.966174  │
│ 3   │ 3     │ 0.0975521 │
│ 4   │ 4     │ 0.393241  │
│ 5   │ 5     │ 0.0790575 │
│ 6   │ 6     │ 0.809334  │
│ 7   │ 7     │ 0.756541  │
│ 8   │ 8     │ 0.735663  │
│ 9   │ 9     │ 0.0790629 │
│ 10  │ 10    │ 0.244876  │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 10 groups based on key: g
First Group (1 row): g = 1
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.109814 │
⋮
Last Group (1 row): g = 10
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 10    │ 0.244876 │

julia> gdf[[mean(sdf.x) > 0.5 for  sdf in gdf]]
GroupedDataFrame with 4 groups based on key: g
First Group (1 row): g = 2
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 2     │ 0.966174 │
⋮
Last Group (1 row): g = 8
│ Row │ g     │ x        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤

and the result is a GroupedDataFrame containing only groups for which mean of :x column is greater than 0.5. Still probably this:

filter(sdf -> mean(sdf.x) > 0.5, gdf)

or (this syntax will be available soon and will be faster):

filter(:x => x -> mean(x) > 0.5, gdf)

is more convenient.

@akdor1154
Copy link

ah, thanks for the workaround! no offence taken if you want to clean up my comments to this issue to keep it clean :)

@bkamins
Copy link
Member Author

bkamins commented Apr 18, 2020

Actually thank you for commenting - we need end user feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision grouping non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

2 participants