Product of multiple aggregation functions and columns #2419

tk3369 · 2020-09-08T20:57:18Z

While working on the pandas vs DataFrames.jl comparison doc (#2378), I encountered the use case of applying many aggregation functions over many columns.

Consider the following example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
                   'x': range(6, 0, -1),
                   'y': range(4, 10),
                   'z': [3, 4, 5, 6, 7, None]},
                   index = list('abcdef'))

>>> df[['x', 'y']].agg([max, min])
     x  y
max  6  9
min  1  4

With DataFrames.jl, we can achieve something similar as previously suggested by @bkamins and @nalimilan .

julia> df = DataFrame(id = 'a':'f', grp = repeat(1:2, 3), x = 6:-1:1, y = 4:9, z = [3:7; missing])
6×5 DataFrame
│ Row │ id   │ grp   │ x     │ y     │ z       │
│     │ Char │ Int64 │ Int64 │ Int64 │ Int64?  │
├─────┼──────┼───────┼───────┼───────┼─────────┤
│ 1   │ 'a'  │ 1     │ 6     │ 4     │ 3       │
│ 2   │ 'b'  │ 2     │ 5     │ 5     │ 4       │
│ 3   │ 'c'  │ 1     │ 4     │ 6     │ 5       │
│ 4   │ 'd'  │ 2     │ 3     │ 7     │ 6       │
│ 5   │ 'e'  │ 1     │ 2     │ 8     │ 7       │
│ 6   │ 'f'  │ 2     │ 1     │ 9     │ missing │

julia> combine(df, vec([:x, :y] .=> [maximum minimum]))
1×4 DataFrame
│ Row │ x_maximum │ y_maximum │ x_minimum │ y_minimum │
│     │ Int64     │ Int64     │ Int64     │ Int64     │
├─────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ 6         │ 9         │ 1         │ 4         │

As you can see, the results are stored in single row with many columns. Essentially, if you have N functions and M columns, you end up with N x M columns. IMHO, pandas' output is nicer. So, I'm wondering if DataFrames.jl should be enhanced to allow multiple functions to be applied for multiple columns.

Here's a little code that works:

julia> function agg(df, cols, funcs) 
           result = DataFrame()
           result.function = string.(funcs)
           for c in cols
               result[!, c] = [f(df[!, c]) for f in funcs]
           end
           return result
       end
agg (generic function with 1 method)

julia> agg(df, [:x, :y], [maximum, minimum])
2×3 DataFrame
│ Row │ function │ x     │ y     │
│     │ String   │ Int64 │ Int64 │
├─────┼──────────┼───────┼───────┤
│ 1   │ maximum  │ 6     │ 9     │
│ 2   │ minimum  │ 1     │ 4     │

Maybe this little agg function can be rolled into combine with a signature like this?

combine(df, ::Vector{Function}, ::Vector{StringOrSymbol}

Thoughts?

The text was updated successfully, but these errors were encountered:

bkamins · 2020-09-08T21:04:19Z

Ah - now I get what you wanted (my previous suggestion was ignoring the shape you wanted as I did not read your question carefully enough - sorry for this).

We currently have a transpose of this using describe:

julia> describe(df, :min, :max, cols=[:x, :y])
2×3 DataFrame
│ Row │ variable │ min   │ max   │
│     │ Symbol   │ Int64 │ Int64 │
├─────┼──────────┼───────┼───────┤
│ 1   │ x        │ 1     │ 6     │
│ 2   │ y        │ 4     │ 9     │

maybe we should just add to describe a kwarg transpose?

tk3369 · 2020-09-09T01:02:13Z

That's brilliant! The describe function never crossed my mind. Somehow I cannot relate the word "describe" with performing aggregation functions.

bkamins · 2020-09-09T06:17:35Z

Given this - do you think we should add the transpose kwarg?

Also - while we are at it, do you think it would be good to add skipmissing kwarg to describe to make it not skip missing values (by default it skips them).

CC @nalimilan

nalimilan · 2020-09-09T08:08:01Z

Yeah, why not add transpose and skipmissing. We could also implement a transpose(df, namescol) method that could be more generally useful.

bkamins · 2020-09-09T08:42:37Z

Agreed, then I would not add transpose kwarg here, opened #2420. Still skipmissing can be added.

pdeffebach · 2020-09-15T21:50:39Z

+1 to adding transpose to describe. I like that summary statistics and aggregations are distinct in Julia.

bkamins · 2020-09-16T04:23:45Z

I like that summary statistics and aggregations are distinct in Julia.

What do you mean by this exactly?

bkamins added the feature label Sep 8, 2020

bkamins added this to the 1.x milestone Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Product of multiple aggregation functions and columns #2419

Product of multiple aggregation functions and columns #2419

tk3369 commented Sep 8, 2020

bkamins commented Sep 8, 2020

tk3369 commented Sep 9, 2020

bkamins commented Sep 9, 2020

nalimilan commented Sep 9, 2020

bkamins commented Sep 9, 2020

pdeffebach commented Sep 15, 2020

bkamins commented Sep 16, 2020

Product of multiple aggregation functions and columns #2419

Product of multiple aggregation functions and columns #2419

Comments

tk3369 commented Sep 8, 2020

bkamins commented Sep 8, 2020

tk3369 commented Sep 9, 2020

bkamins commented Sep 9, 2020

nalimilan commented Sep 9, 2020

bkamins commented Sep 9, 2020

pdeffebach commented Sep 15, 2020

bkamins commented Sep 16, 2020