Stack/Melt over multiple sets of variables #1839

pmarg · 2019-06-08T09:18:39Z

It would be a useful feature if stack and melt could be implemented over multiple sets of variables. For example from this:

2×5 DataFrame
│ Row │ ID    │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│     │ Int64 │ Int64    │ Int64    │ Int64    │ Int64    │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 1        │ 1        │ 1        │ 1        │
│ 2   │ 2     │ 2        │ 2        │ 2        │ 2        │

To this:

4×4 DataFrame
│ Row │ ID    │ Year  │ varA  │ varB  │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │

This is implemented in R and Stata via pivot_longer and reshape long respectively.

@bkamins suggested the following for a simpler name:

df = DataFrame(ID=1:2, A2018=1:2, A2019=1:2, B2018=1:2, B2019=1:2)
df2 = stack(df, 2:5)
df2.colkey = first.(String.(df2.variable), 1)
df2.Year = parse.(Int, chop.(String.(df2.variable), head=1, tail=0))
unstack(df2, [:ID, :Year], :colkey, :value)

For a more general application, users need to rename the columns with a specific pattern for the function to split the name and the value such as varA_2018.

One idea is to make use of Regex e.g. :

m=match(r"(?<Name>\w+)_(?<Value>\d+)",colname)

The function would melt the DataFrame as many times as the specified set of columns over the specified ID (in the above example 2), rename the column of the value to m[1] (in this case :varA and :varB), replace the name of the variable to :Year and the values of the column :Year to m[2] and finally merge the intermediate DataFrames over :ID and :Year.

df_longer = melt(df, :ID, [:varA2018, :varA2019], [:varB2018, :varB2019], variable_name =:Year)

The second method is to specify the name of the new variable as an argument in melt :

df_longer = metl(df, :ID, [:VarA, :VarB],variable_name =:Year)

Using a regular expression the function will melt the DataFrame using the columns that start with "varA" and "varB" respectively, rename and replace the value as above and then merge the DataFrames over :ID and :Year.

I am not very good at writing functions for packages, especially regular expressions, but if you need any help to implement this I would be glad to help the best that I can.

The text was updated successfully, but these errors were encountered:

pmarg · 2019-07-28T15:23:05Z

I have written for myself a function that melts over multiple variables but I am not very good at optimizing code.

function pivot_longer(df::AbstractDataFrame,bycol::Symbol,cols::Vector{Symbol};period = :Period)
  temp = nothing
  df_longer = nothing
  m = nothing
  @inbounds for col ∈ cols
    temp = select(df,Regex("[$(col)]"))
    @inbounds for colname ∈ names(temp)
      m = match(r"(?<Name>\w+)_(?<Value>\d+)",String(colname))
      rename!(temp,colname=>Symbol(m[2]))
    end
    temp[!,bycol] = df[!,bycol]
    temp = melt(temp,bycol)
    rename!(temp,:variable => period, :value => Symbol(m[1]))
    if df_longer == nothing
      df_longer = temp
    else
      df_longer = join(df_longer, temp, on = [bycol,period], kind = :outer)
    end
  end
  return df_longer
end

Example:

df = DataFrame(A_2018=1:4, A_2019=5:8, B_2017=9:12, B_2018=9:12, B_2019 = [missing,13,14,15], ID = [1,2,3,4])

4×6 DataFrame
│ Row │ A_2018 │ A_2019 │ B_2017 │ B_2018 │ B_2019  │ ID    │
│     │ Int64  │ Int64  │ Int64  │ Int64  │ Int64⍰  │ Int64 │
├─────┼────────┼────────┼────────┼────────┼─────────┼───────┤
│ 1   │ 1      │ 5      │ 9      │ 9      │ missing │ 1     │
│ 2   │ 2      │ 6      │ 10     │ 10     │ 13      │ 2     │
│ 3   │ 3      │ 7      │ 11     │ 11     │ 14      │ 3     │
│ 4   │ 4      │ 8      │ 12     │ 12     │ 15      │ 4     │


df_longer = pivot_longer(df,:ID,[:A,:B],period = :Year)

12×4 DataFrame
│ Row │ Year   │ A       │ ID    │ B       │
│     │ Symbol │ Int64⍰  │ Int64 │ Int64⍰  │
├─────┼────────┼─────────┼───────┼─────────┤
│ 1   │ 2018   │ 1       │ 1     │ 9       │
│ 2   │ 2018   │ 2       │ 2     │ 10      │
│ 3   │ 2018   │ 3       │ 3     │ 11      │
│ 4   │ 2018   │ 4       │ 4     │ 12      │
│ 5   │ 2019   │ 5       │ 1     │ missing │
│ 6   │ 2019   │ 6       │ 2     │ 13      │
│ 7   │ 2019   │ 7       │ 3     │ 14      │
│ 8   │ 2019   │ 8       │ 4     │ 15      │
│ 9   │ 2017   │ missing │ 1     │ 9       │
│ 10  │ 2017   │ missing │ 2     │ 10      │
│ 11  │ 2017   │ missing │ 3     │ 11      │
│ 12  │ 2017   │ missing │ 4     │ 12      │

The functionality is similar to Stata's reshape long. The columns of the original DataFrame need to be renamed to a specific pattern (in this case Variable_Period) and the function uses Regex to select the columns that match the column names by group (e.g. A or B). Then it melts over the ID variable for each group and renames the columns. Finally the melted DataFrames are merged over ID and Period.

I can try to write some test, documentation and improve the readability of the function before I submit a PR but feel free to use it as a clarification of my feature request in case you are already working on a similar function.

bkamins · 2019-07-28T17:12:23Z

@nalimilan recently mentioned an intention to work on stackdf/meltdf so maybe he has done some work in this area, so let us wait for his reaction.

Regarding your code @inbounds is not needed I think (this part of code is not performance critical anyway) and I would rather use user specified matcher function taking a column name and splitting it to two parts with some sensible default.

bkamins · 2022-12-05T11:53:24Z

I am closing this in favor of #3237 (to have a single place to discuss all related issues)

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins added this to the 2.0 milestone Feb 12, 2020

bkamins added the reshaping label Apr 25, 2021

bkamins mentioned this issue Apr 25, 2021

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

Closed

sl-solution mentioned this issue May 1, 2021

Transposing DataFrame #2743

Closed

bkamins mentioned this issue Jun 3, 2022

Generalization of the value parameter in the unstack function #3066

Closed

bkamins modified the milestones: 1.x, 1.5 Nov 19, 2022

bkamins mentioned this issue Dec 5, 2022

Redesign of stack and unstack #3237

Open

bkamins closed this as completed Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack/Melt over multiple sets of variables #1839

Stack/Melt over multiple sets of variables #1839

pmarg commented Jun 8, 2019 •

edited

Loading

pmarg commented Jul 28, 2019

bkamins commented Jul 28, 2019

bkamins commented Dec 5, 2022

Stack/Melt over multiple sets of variables #1839

Stack/Melt over multiple sets of variables #1839

Comments

pmarg commented Jun 8, 2019 • edited Loading

pmarg commented Jul 28, 2019

bkamins commented Jul 28, 2019

bkamins commented Dec 5, 2022

pmarg commented Jun 8, 2019 •

edited

Loading