Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stacking multiple groups of columns #2414

Closed
tk3369 opened this issue Sep 5, 2020 · 8 comments
Closed

Stacking multiple groups of columns #2414

tk3369 opened this issue Sep 5, 2020 · 8 comments
Labels
feature non-breaking The proposed change is not breaking reshaping
Milestone

Comments

@tk3369
Copy link
Contributor

tk3369 commented Sep 5, 2020

I have a data frame that contains multiple groups of columns and I want to stack them up. If there were only one group, then the regular stack function would have worked. But, I have two groups as shown in the following MWE:

df = DataFrame(x=1:3, y1=4:6, y2=7:9, z1=10:12, z2=13:15)

So the data looks like this:

│ Row │ x     │ y1    │ y2    │ z1    │ z2    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 7     │ 10    │ 13    │
│ 2   │ 2     │ 5     │ 8     │ 11    │ 14    │
│ 3   │ 3     │ 6     │ 9     │ 12    │ 15    │

And, I need to transform it to something like this.

│ Row │ x     │ order │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 4     │ 10    │
│ 2   │ 1     │ 2     │ 7     │ 13    │
│ 3   │ 2     │ 1     │ 5     │ 11    │
│ 4   │ 2     │ 2     │ 8     │ 13    │
│ 5   │ 3     │ 1     │ 6     │ 12    │
│ 6   │ 3     │ 2     │ 9     │ 15    │

My current solution is somewhat manual:

df1 = select(df, :x, :x => (_->1) => :order, :y1 => :y, :z1 => :z)
df2 = select(df, :x, :x => (_->2) => :order, :y2 => :y, :z2 => :z)
vcat(df1, df2)

Is it possible to make it more generic since it's really a stacking function?

Just an idea:

stack(df, [[:y1, :z1] => 1, [:y2, :z2] => 2]; variable_name = :order, value_names = [:y, :z])

The values 1 and 2 above are really group names for the columns. In this case, the group is called order.

@bkamins bkamins added feature non-breaking The proposed change is not breaking labels Sep 5, 2020
@bkamins bkamins added this to the 1.0 milestone Sep 5, 2020
@bkamins
Copy link
Member

bkamins commented Sep 5, 2020

I think the proposal makes sense and thinking of the API what you proposed is roughly the simplest thing that can be proposed.

@bkamins bkamins mentioned this issue Mar 4, 2021
19 tasks
@bkamins bkamins modified the milestones: 1.0, 1.x Mar 25, 2021
@MatthewRGonzalez
Copy link

Hi! I just want to comment to keep track of this issue. I often encounter instances where this feature would be useful and wonder if there's been any progress. I believe that this relates to a previous discourse question.

@bkamins bkamins modified the milestones: 1.x, 1.5 Aug 31, 2022
@bkamins
Copy link
Member

bkamins commented Aug 31, 2022

It is tracked. I have now increased the priority to 1.5 release.

Before discussing the API let me first comment what would be my natural way of doing what you want without this change:

julia> @chain df begin
           select(:x, [[:y1, :z1], [:y2, :z2]] .=> ByRow(tuple) .=> ["1", "2"])
           stack(["1", "2"])
           select(:x, :variable, :value => [:y, :z])
       end
6×4 DataFrame
 Row │ x      variable  y      z
     │ Int64  String    Int64  Int64
─────┼───────────────────────────────
   1 │     1  1             4     10
   2 │     2  1             5     11
   3 │     3  1             6     12
   4 │     1  2             7     13
   5 │     2  2             8     14
   6 │     3  2             9     15

(with a different order but the order can be changed with sorting; the originally proposed order is inconsistent with stack design)

Given this let us discuss the request:

  1. Do you and @tk3369 think that the request is needed often enough to justify complicating stack API (given the example above how it can be achieved using current functionalities)
  2. If we were to add it I would choose something like:
stack(df, ([:y1, :z1] => 1, [:y2, :z2] => 2); variable_name = :order, value_name = [:y, :z])

(note that I use Tuple as a second argument - the reason is that it makes dispatch easier and avoids surprise conversions, see example below)
Indeed an additional benefit of this API would be that one could write:

stack(df, (:y1 => 1, :y2 => 2); variable_name = :order, value_name = :y)

to conveniently give values in variable column.
We would need to think if in this case the view option should be allowed (this is probably doable, but adds another layer of complexity)

Example of surprise conversion showing why vector is problematic:

julia> [:x => true, :y => 1]
2-element Vector{Pair{Symbol, Int64}}:
 :x => 1
 :y => 1

CC @nalimilan

@MatthewRGonzalez
Copy link

Thank you for the discussion!

Do you and tk3369 think that the request is needed often enough to justify complicating stack API (given the example above how it can be achieved using current functionalities)

I find these types of transformations to be useful and would certainly appreciate the convenience of this added functionality within the stack function. The approach you provide above is certainly sufficient, though perhaps less intuitive for new Julia and DataFrames.jl users. With this said, I'm unsure about the costs associated with adding complexity to the API, so it's difficult for me to provide a complete justification.

In terms of implementation, I fully agree with your proposal above (including the use of the Tuple as the second argument). It seems intuitive. I do not feel that the view would be particularly necessary for most cases.

@bkamins
Copy link
Member

bkamins commented Sep 1, 2022

With this said, I'm unsure about the costs associated with adding complexity to the API

I have already thought about it - it is not that hard, so I think we can add it in 1.5.

@nalimilan
Copy link
Member

Makes sense. I wonder whether the tidyverse's pivot_longer handles this. From a quick that doesn't seem to be the case.

@bkamins
Copy link
Member

bkamins commented Sep 1, 2022

For me a more natural way to think about this is via nested columns (per example I have given) but if users prefer unnested columns we can add it.

@bkamins
Copy link
Member

bkamins commented Dec 5, 2022

I am closing this in favor of #3237 (to have a single place to discuss all related issues)

@bkamins bkamins closed this as completed Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature non-breaking The proposed change is not breaking reshaping
Projects
None yet
Development

No branches or pull requests

4 participants