Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack/Melt over multiple sets of variables #1839

Closed
pmarg opened this issue Jun 8, 2019 · 3 comments
Closed

Stack/Melt over multiple sets of variables #1839

pmarg opened this issue Jun 8, 2019 · 3 comments
Labels
non-breaking The proposed change is not breaking reshaping
Milestone

Comments

@pmarg
Copy link

pmarg commented Jun 8, 2019

It would be a useful feature if stack and melt could be implemented over multiple sets of variables. For example from this:

2×5 DataFrame
│ Row │ ID    │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│     │ Int64 │ Int64    │ Int64    │ Int64    │ Int64    │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 1        │ 1        │ 1        │ 1        │
│ 2   │ 2     │ 2        │ 2        │ 2        │ 2        │

To this:

4×4 DataFrame
│ Row │ ID    │ Year  │ varA  │ varB  │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │

This is implemented in R and Stata via pivot_longer and reshape long respectively.

@bkamins suggested the following for a simpler name:

df = DataFrame(ID=1:2, A2018=1:2, A2019=1:2, B2018=1:2, B2019=1:2)
df2 = stack(df, 2:5)
df2.colkey = first.(String.(df2.variable), 1)
df2.Year = parse.(Int, chop.(String.(df2.variable), head=1, tail=0))
unstack(df2, [:ID, :Year], :colkey, :value)

For a more general application, users need to rename the columns with a specific pattern for the function to split the name and the value such as varA_2018.

One idea is to make use of Regex e.g. :

m=match(r"(?<Name>\w+)_(?<Value>\d+)",colname)

The function would melt the DataFrame as many times as the specified set of columns over the specified ID (in the above example 2), rename the column of the value to m[1] (in this case :varA and :varB), replace the name of the variable to :Year and the values of the column :Year to m[2] and finally merge the intermediate DataFrames over :ID and :Year.

df_longer = melt(df, :ID, [:varA2018, :varA2019], [:varB2018, :varB2019], variable_name =:Year)

The second method is to specify the name of the new variable as an argument in melt :

df_longer = metl(df, :ID, [:VarA, :VarB],variable_name =:Year)

Using a regular expression the function will melt the DataFrame using the columns that start with "varA" and "varB" respectively, rename and replace the value as above and then merge the DataFrames over :ID and :Year.

I am not very good at writing functions for packages, especially regular expressions, but if you need any help to implement this I would be glad to help the best that I can.

@pmarg
Copy link
Author

pmarg commented Jul 28, 2019

I have written for myself a function that melts over multiple variables but I am not very good at optimizing code.

function pivot_longer(df::AbstractDataFrame,bycol::Symbol,cols::Vector{Symbol};period = :Period)
  temp = nothing
  df_longer = nothing
  m = nothing
  @inbounds for col ∈ cols
    temp = select(df,Regex("[$(col)]"))
    @inbounds for colname ∈ names(temp)
      m = match(r"(?<Name>\w+)_(?<Value>\d+)",String(colname))
      rename!(temp,colname=>Symbol(m[2]))
    end
    temp[!,bycol] = df[!,bycol]
    temp = melt(temp,bycol)
    rename!(temp,:variable => period, :value => Symbol(m[1]))
    if df_longer == nothing
      df_longer = temp
    else
      df_longer = join(df_longer, temp, on = [bycol,period], kind = :outer)
    end
  end
  return df_longer
end

Example:

df = DataFrame(A_2018=1:4, A_2019=5:8, B_2017=9:12, B_2018=9:12, B_2019 = [missing,13,14,15], ID = [1,2,3,4])

4×6 DataFrame
│ Row │ A_2018 │ A_2019 │ B_2017 │ B_2018 │ B_2019  │ ID    │
│     │ Int64  │ Int64  │ Int64  │ Int64  │ Int64⍰  │ Int64 │
├─────┼────────┼────────┼────────┼────────┼─────────┼───────┤
│ 1   │ 1      │ 5      │ 9      │ 9      │ missing │ 1     │
│ 2   │ 2      │ 6      │ 10     │ 10     │ 13      │ 2     │
│ 3   │ 3      │ 7      │ 11     │ 11     │ 14      │ 3     │
│ 4   │ 4      │ 8      │ 12     │ 12     │ 15      │ 4     │


df_longer = pivot_longer(df,:ID,[:A,:B],period = :Year)

12×4 DataFrame
│ Row │ Year   │ A       │ ID    │ B       │
│     │ Symbol │ Int64⍰  │ Int64 │ Int64⍰  │
├─────┼────────┼─────────┼───────┼─────────┤
│ 1   │ 2018   │ 1       │ 1     │ 9       │
│ 2   │ 2018   │ 2       │ 2     │ 10      │
│ 3   │ 2018   │ 3       │ 3     │ 11      │
│ 4   │ 2018   │ 4       │ 4     │ 12      │
│ 5   │ 2019   │ 5       │ 1     │ missing │
│ 6   │ 2019   │ 6       │ 2     │ 13      │
│ 7   │ 2019   │ 7       │ 3     │ 14      │
│ 8   │ 2019   │ 8       │ 4     │ 15      │
│ 9   │ 2017   │ missing │ 1     │ 9       │
│ 10  │ 2017   │ missing │ 2     │ 10      │
│ 11  │ 2017   │ missing │ 3     │ 11      │
│ 12  │ 2017   │ missing │ 4     │ 12      │

The functionality is similar to Stata's reshape long. The columns of the original DataFrame need to be renamed to a specific pattern (in this case Variable_Period) and the function uses Regex to select the columns that match the column names by group (e.g. A or B). Then it melts over the ID variable for each group and renames the columns. Finally the melted DataFrames are merged over ID and Period.

I can try to write some test, documentation and improve the readability of the function before I submit a PR but feel free to use it as a clarification of my feature request in case you are already working on a similar function.

@bkamins
Copy link
Member

bkamins commented Jul 28, 2019

@nalimilan recently mentioned an intention to work on stackdf/meltdf so maybe he has done some work in this area, so let us wait for his reaction.

Regarding your code @inbounds is not needed I think (this part of code is not performance critical anyway) and I would rather use user specified matcher function taking a column name and splitting it to two parts with some sensible default.

@bkamins
Copy link
Member

bkamins commented Dec 5, 2022

I am closing this in favor of #3237 (to have a single place to discuss all related issues)

@bkamins bkamins closed this as completed Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking reshaping
Projects
None yet
Development

No branches or pull requests

2 participants