Stacking multiple groups of columns #2414

tk3369 · 2020-09-05T16:28:13Z

I have a data frame that contains multiple groups of columns and I want to stack them up. If there were only one group, then the regular stack function would have worked. But, I have two groups as shown in the following MWE:

df = DataFrame(x=1:3, y1=4:6, y2=7:9, z1=10:12, z2=13:15)

So the data looks like this:

│ Row │ x     │ y1    │ y2    │ z1    │ z2    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 7     │ 10    │ 13    │
│ 2   │ 2     │ 5     │ 8     │ 11    │ 14    │
│ 3   │ 3     │ 6     │ 9     │ 12    │ 15    │

And, I need to transform it to something like this.

│ Row │ x     │ order │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 4     │ 10    │
│ 2   │ 1     │ 2     │ 7     │ 13    │
│ 3   │ 2     │ 1     │ 5     │ 11    │
│ 4   │ 2     │ 2     │ 8     │ 13    │
│ 5   │ 3     │ 1     │ 6     │ 12    │
│ 6   │ 3     │ 2     │ 9     │ 15    │

My current solution is somewhat manual:

df1 = select(df, :x, :x => (_->1) => :order, :y1 => :y, :z1 => :z)
df2 = select(df, :x, :x => (_->2) => :order, :y2 => :y, :z2 => :z)
vcat(df1, df2)

Is it possible to make it more generic since it's really a stacking function?

Just an idea:

stack(df, [[:y1, :z1] => 1, [:y2, :z2] => 2]; variable_name = :order, value_names = [:y, :z])

The values 1 and 2 above are really group names for the columns. In this case, the group is called order.

The text was updated successfully, but these errors were encountered:

bkamins · 2020-09-05T19:01:38Z

I think the proposal makes sense and thinking of the API what you proposed is roughly the simplest thing that can be proposed.

MatthewRGonzalez · 2022-08-31T17:23:38Z

Hi! I just want to comment to keep track of this issue. I often encounter instances where this feature would be useful and wonder if there's been any progress. I believe that this relates to a previous discourse question.

bkamins · 2022-08-31T19:03:34Z

It is tracked. I have now increased the priority to 1.5 release.

Before discussing the API let me first comment what would be my natural way of doing what you want without this change:

julia> @chain df begin
           select(:x, [[:y1, :z1], [:y2, :z2]] .=> ByRow(tuple) .=> ["1", "2"])
           stack(["1", "2"])
           select(:x, :variable, :value => [:y, :z])
       end
6×4 DataFrame
 Row │ x      variable  y      z
     │ Int64  String    Int64  Int64
─────┼───────────────────────────────
   1 │     1  1             4     10
   2 │     2  1             5     11
   3 │     3  1             6     12
   4 │     1  2             7     13
   5 │     2  2             8     14
   6 │     3  2             9     15

(with a different order but the order can be changed with sorting; the originally proposed order is inconsistent with stack design)

Given this let us discuss the request:

Do you and @tk3369 think that the request is needed often enough to justify complicating stack API (given the example above how it can be achieved using current functionalities)
If we were to add it I would choose something like:

stack(df, ([:y1, :z1] => 1, [:y2, :z2] => 2); variable_name = :order, value_name = [:y, :z])

(note that I use Tuple as a second argument - the reason is that it makes dispatch easier and avoids surprise conversions, see example below)
Indeed an additional benefit of this API would be that one could write:

stack(df, (:y1 => 1, :y2 => 2); variable_name = :order, value_name = :y)

to conveniently give values in variable column.
We would need to think if in this case the view option should be allowed (this is probably doable, but adds another layer of complexity)

Example of surprise conversion showing why vector is problematic:

julia> [:x => true, :y => 1]
2-element Vector{Pair{Symbol, Int64}}:
 :x => 1
 :y => 1

CC @nalimilan

MatthewRGonzalez · 2022-08-31T22:46:20Z

Thank you for the discussion!

Do you and tk3369 think that the request is needed often enough to justify complicating stack API (given the example above how it can be achieved using current functionalities)

I find these types of transformations to be useful and would certainly appreciate the convenience of this added functionality within the stack function. The approach you provide above is certainly sufficient, though perhaps less intuitive for new Julia and DataFrames.jl users. With this said, I'm unsure about the costs associated with adding complexity to the API, so it's difficult for me to provide a complete justification.

In terms of implementation, I fully agree with your proposal above (including the use of the Tuple as the second argument). It seems intuitive. I do not feel that the view would be particularly necessary for most cases.

bkamins · 2022-09-01T05:59:29Z

With this said, I'm unsure about the costs associated with adding complexity to the API

I have already thought about it - it is not that hard, so I think we can add it in 1.5.

nalimilan · 2022-09-01T09:31:42Z

Makes sense. I wonder whether the tidyverse's pivot_longer handles this. From a quick that doesn't seem to be the case.

bkamins · 2022-09-01T11:31:11Z

For me a more natural way to think about this is via nested columns (per example I have given) but if users prefer unnested columns we can add it.

bkamins · 2022-12-05T11:53:16Z

I am closing this in favor of #3237 (to have a single place to discuss all related issues)

bkamins added feature non-breaking The proposed change is not breaking labels Sep 5, 2020

bkamins added this to the 1.0 milestone Sep 5, 2020

bkamins mentioned this issue Mar 4, 2021

Release 1.0 tracking #2640

Closed

19 tasks

bkamins modified the milestones: 1.0, 1.x Mar 25, 2021

bkamins modified the milestones: 1.x, 1.5 Aug 31, 2022

bkamins added the reshaping label Dec 4, 2022

bkamins mentioned this issue Dec 5, 2022

Redesign of stack and unstack #3237

Open

bkamins closed this as completed Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stacking multiple groups of columns #2414

Stacking multiple groups of columns #2414

tk3369 commented Sep 5, 2020

bkamins commented Sep 5, 2020

MatthewRGonzalez commented Aug 31, 2022

bkamins commented Aug 31, 2022

MatthewRGonzalez commented Aug 31, 2022

bkamins commented Sep 1, 2022

nalimilan commented Sep 1, 2022

bkamins commented Sep 1, 2022

bkamins commented Dec 5, 2022

Stacking multiple groups of columns #2414

Stacking multiple groups of columns #2414

Comments

tk3369 commented Sep 5, 2020

bkamins commented Sep 5, 2020

MatthewRGonzalez commented Aug 31, 2022

bkamins commented Aug 31, 2022

MatthewRGonzalez commented Aug 31, 2022

bkamins commented Sep 1, 2022

nalimilan commented Sep 1, 2022

bkamins commented Sep 1, 2022

bkamins commented Dec 5, 2022