Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map on DataFrameColumns should return DataFrameColumns #2245

Closed
oschulz opened this issue May 11, 2020 · 2 comments
Closed

map on DataFrameColumns should return DataFrameColumns #2245

oschulz opened this issue May 11, 2020 · 2 comments
Labels

Comments

@oschulz
Copy link

oschulz commented May 11, 2020

It would be great if map on DataFrameColumns would return a DataFrameColumns object again, preserving column-names. This would enable constructs like this:

DataFrame(map(float, eachcol(df)))

and, in conjuction with #2244, also generic constructs like

Tables.materializer(df)(map(float, Tables.columns(df)))

Also, this way map on DataFrameColumns would behave like map on NamedTuple.

@bkamins
Copy link
Member

bkamins commented May 11, 2020

The problem is that this would be significantly breaking, as now function you pass to map does not have to return vectors (and even if it does they do not have to be of equal length).

E.g.:

julia> df = DataFrame((1:4)')
1×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │ 4     │

julia> map(n -> rand(first(n)), eachcol(df))
4-element Array{Array{Float64,1},1}:
 [0.03703956715176471]
 [0.9473764146722374, 0.26359244877423005]
 [0.42185607645767975, 0.41610264583877243, 0.19887646860466157]
 [0.6993524536753626, 0.05976450888429419, 0.31209321674196455, 0.049249218793889105]

The functionality you ask for is provided by mapcols and mapcols! functions.

Alternatively you can do map(fun, Tables.columntable(df)) to get what you want:

julia> map(n -> rand(first(n)), Tables.columntable(df))
(x1 = [0.026694417607980858], x2 = [0.21588433711303656, 0.9892101734162946], x3 = [0.6304413868617336, 0.4504408743912218, 0.6439963321706532], x4 = [0.5043942617555817, 0.09743004053251836, 0.5816787939912509, 0.21532591703133042])

So unless there is some more justification for the change I would leave the things as they are as what you propose would be breaking and we do not want to be breaking unless there is a significant benefit of this move (and in this case actually there are also downsides, as map on eachcol was thought to allow returning anything and not a homogenous collection as it would have to be under your proposal).

As a more general comment - actually I prefer map to work like in Base on objects from DataFrames.jl (so be undefined or have a default implementation), as there were questions in the past when we overridden this behaviour (that is e.g. why map on GroupedDataFrame is deprecated).

@oschulz
Copy link
Author

oschulz commented May 12, 2020

You're right, of course, looks like Tables.columntable(df) or Tables.Colums(df) is the way to go here. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants