Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcasted setindex not working as expected #1507

Closed
ExpandingMan opened this issue Sep 14, 2018 · 14 comments
Closed

broadcasted setindex not working as expected #1507

ExpandingMan opened this issue Sep 14, 2018 · 14 comments

Comments

@ExpandingMan
Copy link
Contributor

Currently if you do df[idx, n] .= x you will be doing a broadcasted assignment to a copy of the selected part of the dataframe. See this post on discourse.

It seems to me that this should probably assign to the original dataframe, as most users would probably expect that behavior (I've been using Julia a couple of years now, and it was not immediately obvious to me what was happening here). Apparently this can be done by extending Base.dotview.

If someone decides to take this on, it might also be worth thinking carefully about the broadcast behavior on DataFrames more generally, as I suspect there are a few other "gotchas" lurking.

@bkamins
Copy link
Member

bkamins commented Sep 14, 2018

Yes - in general whole getindex/setindex! infrastructure in DataFrames.jl should be carefully reviewed to sync it with Julia 1.0. I have started it some time ago (#1380) but waited till things stabilize. Probably this is a good time to go back to it if someone wants to make a PR (if not I might get back to it but not immediately).

@ExpandingMan
Copy link
Contributor Author

I think it migth call for some breaking changes. As I was saying in the discourse thread, really df[idx, n] .= x should work and df[idx, n] = x should throw an error unless x matches the shape of df[idx, n]. This would be consistent with the behavior of AbstractArray.

@bkamins
Copy link
Member

bkamins commented Sep 14, 2018

Right. That is why I have stopped #1380 in spring to make sure we are consistent. There are also other cases to clean up (e.g. handling of binary indexing). In general this topic is complex and tricky so I did not want to rush.

@ExpandingMan
Copy link
Contributor Author

Yes, this will definitely be a bit tricky. I think we should try to get to it soon though as we don't want to bake in weird custom behavior for DataFrames long after the release of 1.0.

@bkamins
Copy link
Member

bkamins commented Sep 14, 2018

I have started collecting up the target behavior in #1380. Next week I will try to clean it up and propose a consistent upgrade that is:

  • Consistent with Julia 1.0 indexing in Base
  • broadcast-aware

@nalimilan
Copy link
Member

nalimilan commented Sep 15, 2018

Another related thing which is annoying/inconsistent with the current behavior is that df[:x] = 1 mutates the column vector while df[:x] = 1:n replaces it. It would be much more logical to have df[:x] .= 1 mutate, and df[:x] = 1 create a new vector (or throw an error). See also JuliaData/DataFramesMeta.jl#108 (comment).

@pdeffebach
Copy link
Contributor

Another issue here:

julia> df = DataFrame(a = [1], b = [2])
julia> df[1, :] = [5, 6]
ERROR: MethodError: Cannot `convert` an object of type Array{Int64,1} to an object of type Int64
Closest candidates are:
  convert(::Type{T<:Number}, ::T<:Number) where T<:Number at number.jl:6
  convert(::Type{T<:Number}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T<:Integer}, ::Ptr) where T<:Integer at pointer.jl:23
  ...
Stacktrace:
 [1] setindex!(::Array{Int64,1}, ::Array{Int64,1}, ::Int64) at ./array.jl:769
 [2] insert_single_entry!(::DataFrame, ::Array{Int64,1}, ::Int64, ::Int64) at /Users/peterdeffebach/.julia/dev/DataFrames/src/dataframe/dataframe.jl:367
 [3] setindex! at /Users/peterdeffebach/.julia/dev/DataFrames/src/dataframe/dataframe.jl:469 [inlined]
 [4] setindex!(::DataFrame, ::Array{Int64,1}, ::Int64, ::Colon) at /Users/peterdeffebach/.julia/dev/DataFrames/src/dataframe/dataframe.jl:623
 [5] top-level scope at none:0

@bkamins
Copy link
Member

bkamins commented Sep 23, 2018

I guess this was a design decision and the assignment you want should be done like this:

df[1, :] = DataFrame([5 6])

because your assignment has the following meaning now:

julia> df = DataFrame(a = Any[1], b = Any[2])
1×2 DataFrame
│ Row │ a   │ b   │
│     │ Any │ Any │
├─────┼─────┼─────┤
│ 1   │ 1   │ 2   │

julia> df[1, :] = [5, 6]
2-element Array{Int64,1}:
 5
 6

julia> df
1×2 DataFrame
│ Row │ a      │ b      │
│     │ Any    │ Any    │
├─────┼────────┼────────┤
│ 1   │ [5, 6] │ [5, 6] │

of course the question is if we want to keep this approach.

And this change would be conflicting with:

julia> df[[1], :] = [5, 6]
ERROR: DimensionMismatch("array could not be broadcast to match destination")

which is expected to work columnwise.

In general - as we now start treating DataFrame as a collection of rows maybe this should be rethinked, but OTOH it will be breaking so I am not sure of the benefit.

@pdeffebach
Copy link
Contributor

I would vote to re-think this while we are re-thinking other set index methods. The automatic broadcasted getindex method producing

│ Row │ a      │ b      │
│     │ Any    │ Any    │
├─────┼────────┼────────┤
│ 1   │ [5, 6] │ [5, 6] │

is strange and I would be surprised if many people use it.

Maybe the rule should be that any section of a DataFrame of size (n, m) should be able to be imputed with an array of size (n, m)?

Although df[1, :] = DataFrame([5 6]) isn't too hard to remember. Maybe there are performance issues with an intermediate copy?

@nalimilan
Copy link
Member

Arguably df[1, :] = [5, 6] should be written as df[1, :] .= [5, 6] if the goal is to set all entries to [5, 6]. So we could deprecate this (at least for vectors) so that we df[1, :] = x can set the full row when x is a vector, a tuple or a named tuple.

Maybe the rule should be that any section of a DataFrame of size (n, m) should be able to be imputed with an array of size (n, m)?

Yeah, why not.

Although df[1, :] = DataFrame([5 6]) isn't too hard to remember. Maybe there are performance issues with an intermediate copy?

Yes, it should be much slower, so that really shouldn't be recommended.

@oxinabox
Copy link
Contributor

oxinabox commented Apr 2, 2019

We should probably also allow

df[1, :] = (b=6, a=5)

Which should set the right columns by name

@bkamins
Copy link
Member

bkamins commented Apr 2, 2019

See #1646. This is the intention (though now I see we could be more precise in : case that we iterate names).

In general it is a good time, if you have some to spare, to review #1646, as this will be the next big thing after "ownership" that we will implement before DataFrames.jl 1.0.

@oxinabox
Copy link
Contributor

oxinabox commented Apr 2, 2019

Will do

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

Broadcasting assignment is done in 0.19, so closing it. Some of the setindex! functionality is now not done in 0.19 due to deprecation (but this is a separate issue and now is specified so we only need to implement it now).

@bkamins bkamins closed this as completed Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants