Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Product of multiple aggregation functions and columns #2419

Open
tk3369 opened this issue Sep 8, 2020 · 7 comments
Open

Product of multiple aggregation functions and columns #2419

tk3369 opened this issue Sep 8, 2020 · 7 comments
Labels
Milestone

Comments

@tk3369
Copy link
Contributor

tk3369 commented Sep 8, 2020

While working on the pandas vs DataFrames.jl comparison doc (#2378), I encountered the use case of applying many aggregation functions over many columns.

Consider the following example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
                   'x': range(6, 0, -1),
                   'y': range(4, 10),
                   'z': [3, 4, 5, 6, 7, None]},
                   index = list('abcdef'))

>>> df[['x', 'y']].agg([max, min])
     x  y
max  6  9
min  1  4

With DataFrames.jl, we can achieve something similar as previously suggested by @bkamins and @nalimilan .

julia> df = DataFrame(id = 'a':'f', grp = repeat(1:2, 3), x = 6:-1:1, y = 4:9, z = [3:7; missing])
6×5 DataFrame
│ Row │ id   │ grp   │ x     │ y     │ z       │
│     │ Char │ Int64 │ Int64 │ Int64 │ Int64?  │
├─────┼──────┼───────┼───────┼───────┼─────────┤
│ 1   │ 'a'  │ 1     │ 6     │ 4     │ 3       │
│ 2   │ 'b'  │ 2     │ 5     │ 5     │ 4       │
│ 3   │ 'c'  │ 1     │ 4     │ 6     │ 5       │
│ 4   │ 'd'  │ 2     │ 3     │ 7     │ 6       │
│ 5   │ 'e'  │ 1     │ 2     │ 8     │ 7       │
│ 6   │ 'f'  │ 2     │ 1     │ 9     │ missing │

julia> combine(df, vec([:x, :y] .=> [maximum minimum]))
1×4 DataFrame
│ Row │ x_maximum │ y_maximum │ x_minimum │ y_minimum │
│     │ Int64     │ Int64     │ Int64     │ Int64     │
├─────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ 6         │ 9         │ 1         │ 4         │

As you can see, the results are stored in single row with many columns. Essentially, if you have N functions and M columns, you end up with N x M columns. IMHO, pandas' output is nicer. So, I'm wondering if DataFrames.jl should be enhanced to allow multiple functions to be applied for multiple columns.

Here's a little code that works:

julia> function agg(df, cols, funcs) 
           result = DataFrame()
           result.function = string.(funcs)
           for c in cols
               result[!, c] = [f(df[!, c]) for f in funcs]
           end
           return result
       end
agg (generic function with 1 method)

julia> agg(df, [:x, :y], [maximum, minimum])
2×3 DataFrame
│ Row │ function │ x     │ y     │
│     │ String   │ Int64 │ Int64 │
├─────┼──────────┼───────┼───────┤
│ 1   │ maximum  │ 6     │ 9     │
│ 2   │ minimum  │ 1     │ 4     │

Maybe this little agg function can be rolled into combine with a signature like this?

combine(df, ::Vector{Function}, ::Vector{StringOrSymbol}

Thoughts?

@bkamins
Copy link
Member

bkamins commented Sep 8, 2020

Ah - now I get what you wanted (my previous suggestion was ignoring the shape you wanted as I did not read your question carefully enough - sorry for this).

We currently have a transpose of this using describe:

julia> describe(df, :min, :max, cols=[:x, :y])
2×3 DataFrame
│ Row │ variable │ min   │ max   │
│     │ Symbol   │ Int64 │ Int64 │
├─────┼──────────┼───────┼───────┤
│ 1   │ x        │ 1     │ 6     │
│ 2   │ y        │ 4     │ 9     │

maybe we should just add to describe a kwarg transpose?

@bkamins bkamins added the feature label Sep 8, 2020
@bkamins bkamins added this to the 1.x milestone Sep 8, 2020
@tk3369
Copy link
Contributor Author

tk3369 commented Sep 9, 2020

That's brilliant! The describe function never crossed my mind. Somehow I cannot relate the word "describe" with performing aggregation functions.

@bkamins
Copy link
Member

bkamins commented Sep 9, 2020

Given this - do you think we should add the transpose kwarg?

Also - while we are at it, do you think it would be good to add skipmissing kwarg to describe to make it not skip missing values (by default it skips them).

CC @nalimilan

@nalimilan
Copy link
Member

Yeah, why not add transpose and skipmissing. We could also implement a transpose(df, namescol) method that could be more generally useful.

@bkamins
Copy link
Member

bkamins commented Sep 9, 2020

Agreed, then I would not add transpose kwarg here, opened #2420. Still skipmissing can be added.

@pdeffebach
Copy link
Contributor

+1 to adding transpose to describe. I like that summary statistics and aggregations are distinct in Julia.

@bkamins
Copy link
Member

bkamins commented Sep 16, 2020

I like that summary statistics and aggregations are distinct in Julia.

What do you mean by this exactly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants