Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of stack and unstack #3237

Open
bkamins opened this issue Dec 5, 2022 · 13 comments
Open

Redesign of stack and unstack #3237

bkamins opened this issue Dec 5, 2022 · 13 comments

Comments

@bkamins
Copy link
Member

bkamins commented Dec 5, 2022

This issue is meant to replace: #2215 #2148 #3066 #2422 #2414 #1839

The proposed improved API for stack is:

stack(df::AbstractDataFrame,
    measure_vars,
    id_vars;
    variable_name=:variable,
    name_value=identity,
    value_name=:value, # or function
    variable_eltype::Type=String,
    fill=missing,
    view::Bool=false)

Questions to discuss:

  • should we rename measure_vars and id_vars arguments (I was thinking of something better but could not come with anything better);
  • variable_name - no change here
  • name_value - I propose to allow passing a function that takes measure_vars column names as strings and produces values in the variable_name column where the name of the measure_vars will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting in stack?; also the question is what name would be best here)
  • value_name - I propose to allow passing a function that takes measure_vars column names as strings and produces name of the column where the name of the values will be stored
  • fill if variable_name/value_name combination is missing what value use to fill data
  • do we need to provide an option to specify sorting order of produced data frame? (I now assume measure_vars columns are processed left to right)
  • view - true will be disallowed if variable_name or value_name is a function
  • if variable_name/value_name combination produces duplicate I assume we throw an error (but maybe we want some other behavior also?)

Example. Input df:

2×5 DataFrame
│ Row │ ID    │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│     │ Int64 │ Int64    │ Int64    │ Int64    │ Int64    │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 1        │ 1        │ 1        │ 1        │
│ 2   │ 2     │ 2        │ 2        │ 2        │ 2        │

Output of stack(df, r"var", :ID, variable_name= :Year, name_value=x -> parse(Int, last(x, 4)), value_name=x -> first(x, 4)):

4×4 DataFrame
│ Row │ ID    │ Year  │ varA  │ varB  │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │

(so the general idea is to allow for dynamic generation of variable_name and value_name based on measure_vars column names)

The proposed improved API for unstack is:

unstack(df::AbstractDataFrame,
      row_keys,
      col_keys,
      values;
      renamecols::Function=(x...) -> join(x, "_"),
      allowmissing::Bool=false,
      combine=values isa ColumnIndex ? only : (x...) -> only.(x),
      fill=missing,
      threads::Bool=true)

Questions to discuss:

  • rename rowkeys to row_keys
  • rename colkey to col_keys; start allowing passing multiple columns as col_keys
  • rename value to values; start allowing passing multiple columns as values
  • renamecols: takes as many positional arguments as there are col_keys columns; by default joins them with _
  • combine: takes as many positional arguments as there are values columns; by default if a single column is passed only, and if multiple a tuple of captured values is produced (but anything can be computed and returned here). Internally it will be values => combine transformation in operation specification syntax. Note that I propose, as opposed to, e.g. dplyr, that we do not create multiple columns here, but instead a value in a cell if a function of multiple columns (by default a tuple of matching values)

Example. Input df:

8×5 DataFrame
 Row │ id     n1      n2      v1     v2
     │ Int64  String  String  Int64  Char
─────┼────────────────────────────────────
   1 │     1  a       x           1  a
   2 │     1  a       y           2  b
   3 │     1  b       x           3  c
   4 │     1  b       y           4  d
   5 │     2  a       x           5  e
   6 │     2  a       y           6  f
   7 │     2  b       x           7  g
   8 │     2  b       y           8  h

Output of unstack(df, :id, r"n", r"v") (with default renamescols and combine):

2×5 DataFrame
 Row │ id     a_x       a_y       b_x       b_y
     │ Int64  Tuple…    Tuple…    Tuple…    Tuple…
─────┼───────────────────────────────────────────────
   1 │     1  (1, 'a')  (2, 'b')  (3, 'c')  (4, 'd')
   2 │     2  (5, 'e')  (6, 'f')  (7, 'g')  (8, 'h')

Output of unstack(df, :id, r"n", r"v", renamecols=string, combine=(x,y) -> string(x[1], y[1])):

2×5 DataFrame
 Row │ id     ax      ay      bx      by
     │ Int64  String  String  String  String
─────┼───────────────────────────────────────
   1 │     1  1a      2b      3c      4d
   2 │     2  5e      6f      7g      8h
@bkamins bkamins added this to the 1.5 milestone Dec 5, 2022
@bkamins
Copy link
Member Author

bkamins commented Dec 5, 2022

@jonas-schulze
Copy link
Contributor

In your stack example,

stack(df, r"var", :ID, variable_name= x -> last(x, 4), value_name=x -> first(x, 4))
# note: first(x, 4)

how did the Year column get its name?

@bkamins
Copy link
Member Author

bkamins commented Dec 5, 2022

Ah - right. These are problems, when one designs before implementing. It then should be something like:

stack(df, r"var", :ID, variable_name= :Year, name_value=x -> last(x, 4), value_name=x -> first(x, 3))

I will update the post. Thank you for spotting.

@MatthewRGonzalez
Copy link

In the example,

stack(df, r"var", :ID, variable_name= :Year, name_value=x -> last(x, 4), value_name=x -> first(x, 3))

why do you use var_name = x -> first(x,3)) and not var_name = x -> first(x,4)) to produce column names varA and varB.

@bkamins
Copy link
Member Author

bkamins commented Dec 5, 2022

Again - typo. Fixed. It should be first(x, 4). I was writing the expression from my head (not tested). I want to first get a general agreement that what I propose is OK and sufficient because the implementation will heavily depend on the design.

The most important decisions affecting the design are:

  • for stack: are we OK to have only one :variable_name column (does anyone need multiple variable name columns in practice?)
  • for unstack: are we OK that we always store the result of multiple values columns in one cell (i.e. not creating multiple columns). The benefit of this is that we can combine them with a function (as in the example); the downside is that if someone wants multiple columns in the end then and additional select operation is needed later that will unnest the produced column (which seems easy, but maybe we feel that it is crucial to provide such functionality in unstack)

@MatthewRGonzalez
Copy link

In my opinion:

Are we OK to have only one :variable_name column (does anyone need multiple variable name columns in practice?)

Yes-- I would prefer to keep it simple. It's easy enough to split columns later on.

Are we OK that we always store the result of multiple values columns in one cell (i.e. not creating multiple columns). The benefit of this is that we can combine them with a function (as in the example);

I'd prefer this as well. I think the ability to combine the values with a function is useful.

@nalimilan
Copy link
Member

Sounds good. I don't have an opinion about supporting multiple columns. Do we have examples where it's useful in dplyr?

name_value - I propose to allow passing a function that takes measure_vars column names as strings and produces values in the variable_name column where the name of the measure_vars will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting in stack?; also the question is what name would be best here)

Regarding the argument name, having both value_name and name_value seems confusing to me. Maybe something like variable_name_transform or passing a tuple to the existing argument, like variable_name=(col, fun)?

@jariji
Copy link
Contributor

jariji commented Dec 27, 2022

I encourage interested people to look at tidyr's pivot_longer for the design and naming for inspiration.

@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2022

@jariji - I know pivot_longer.
Given your comment I understand you feel that pivot_longer has a better design than the proposal above? (except names - where I agree that as usual we need a careful decision)

If this is the case can you comment on the advantages of pivot_longer design from your perspective? Thank you!

@jariji
Copy link
Contributor

jariji commented Dec 27, 2022

I'm still reviewing the above and I'm not sure what's better at this point, just wanted to make sure pivot_* was in the discussion.

@jariji
Copy link
Contributor

jariji commented Dec 27, 2022

Looking at the example above, you have name_value=x -> last(x, 4) which I expect to produce String values but then the generated Year column has eltype Int64. Is that a typo or intentional?

@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2022

Is that a typo or intentional?

It was a typo (I was just sketching the intention). I updated the example with parse call.

just wanted to make sure pivot_* was in the discussion.

💯 agreed. The dplyr underwent a huge redesign. I was thinking for 3 months what to propose (and I am still not sure what is best - especially naming of arguments). This is reflected in the comment by @nalimilan - we want something flexible and composable, but at the same time to avoid complexities that users will never use.

For example privot_longer and pivot_wider were designed to be reversible i.e. so that you can always call pivot_longer and pivot_wider to go to the starting point. And I understand this desire as it is indeed clean.

However, I thought that this lead to a very complex design (if you look at the documentation there are many cases and complex rules). I tried to propose something that is simpler but still covers all standard needs.

For example:

  1. I did not propose column splitting in stack (and pivot_longer supports it) as it is I think rarely needed, and can be easily be done later by the user.
  2. I proposed in unstack to always create a single column even for many values also (as opposed to creating multiple columns as pivot_wider does) the reason is:
    • column splitting can be done later if needed
    • it allows us in combine to perform computations on multiple columns (which can be handy at times)
    • it introduced in dplyr the complexity of argument saying what should change faster (values of column names) + it is likely to produce quite wide tables, which in my experience are hard to grasp.

However, I am open for suggestions as indeed these are hard decisions.

@bkamins bkamins modified the milestones: 1.5, 1.6 Feb 5, 2023
@bkamins bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023
@pdeffebach
Copy link
Contributor

I ran into the issue of not being able to have multiple value columns in unstack today. I think this would be a great feature to have.

@bkamins bkamins modified the milestones: 1.7, 1.x Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants