map on DataFrameColumns should return DataFrameColumns #2245

oschulz · 2020-05-11T14:47:08Z

It would be great if map on DataFrameColumns would return a DataFrameColumns object again, preserving column-names. This would enable constructs like this:

DataFrame(map(float, eachcol(df)))

and, in conjuction with #2244, also generic constructs like

Tables.materializer(df)(map(float, Tables.columns(df)))

Also, this way map on DataFrameColumns would behave like map on NamedTuple.

The text was updated successfully, but these errors were encountered:

bkamins · 2020-05-11T15:41:41Z

The problem is that this would be significantly breaking, as now function you pass to map does not have to return vectors (and even if it does they do not have to be of equal length).

E.g.:

julia> df = DataFrame((1:4)')
1×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │ 4     │

julia> map(n -> rand(first(n)), eachcol(df))
4-element Array{Array{Float64,1},1}:
 [0.03703956715176471]
 [0.9473764146722374, 0.26359244877423005]
 [0.42185607645767975, 0.41610264583877243, 0.19887646860466157]
 [0.6993524536753626, 0.05976450888429419, 0.31209321674196455, 0.049249218793889105]

The functionality you ask for is provided by mapcols and mapcols! functions.

Alternatively you can do map(fun, Tables.columntable(df)) to get what you want:

julia> map(n -> rand(first(n)), Tables.columntable(df))
(x1 = [0.026694417607980858], x2 = [0.21588433711303656, 0.9892101734162946], x3 = [0.6304413868617336, 0.4504408743912218, 0.6439963321706532], x4 = [0.5043942617555817, 0.09743004053251836, 0.5816787939912509, 0.21532591703133042])

So unless there is some more justification for the change I would leave the things as they are as what you propose would be breaking and we do not want to be breaking unless there is a significant benefit of this move (and in this case actually there are also downsides, as map on eachcol was thought to allow returning anything and not a homogenous collection as it would have to be under your proposal).

As a more general comment - actually I prefer map to work like in Base on objects from DataFrames.jl (so be undefined or have a default implementation), as there were questions in the past when we overridden this behaviour (that is e.g. why map on GroupedDataFrame is deprecated).

oschulz · 2020-05-12T00:54:44Z

You're right, of course, looks like Tables.columntable(df) or Tables.Colums(df) is the way to go here. Thanks!

bkamins added the decision label May 11, 2020

oschulz mentioned this issue May 12, 2020

Tables.columns should return a DataFrameColumns object #2244

Closed

oschulz closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map on DataFrameColumns should return DataFrameColumns #2245

map on DataFrameColumns should return DataFrameColumns #2245

oschulz commented May 11, 2020

bkamins commented May 11, 2020

oschulz commented May 12, 2020 •

edited

Loading

map on DataFrameColumns should return DataFrameColumns #2245

map on DataFrameColumns should return DataFrameColumns #2245

Comments

oschulz commented May 11, 2020

bkamins commented May 11, 2020

oschulz commented May 12, 2020 • edited Loading

oschulz commented May 12, 2020 •

edited

Loading