Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support adding columns to views #2794

Merged
merged 32 commits into from
Sep 1, 2021
Merged

Support adding columns to views #2794

merged 32 commits into from
Sep 1, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 21, 2021

This partially implements #2785 for:

  • setindex! with : as row selector
  • broadcasted assignment with : as row selector
  • insertcols!

as only these operations, as discussed, ensure that we ADD columns (and not replace them).

Before I add tests let us discuss what we think about this functionality.

@matthieugomez - in particular could you please comment if you would find it useful? (or your fundamental use case is transform! with groupby run on SubDataFrame and having only these methods is of not much value?)

The problematic point is that supporting select! and transfrorm!, as commented earlier creates a slight ambiguity (of course possible to resolve) when if we write :a => identity => :a as a transformation it is not clear if :a should be filled with missing in filtered-out rows or it should be left as is.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/subdataframe/subdataframe.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
bkamins and others added 3 commits June 27, 2021 09:51
@bkamins bkamins marked this pull request as draft June 27, 2021 11:26
@bkamins bkamins marked this pull request as ready for review August 8, 2021 12:02
Copy link
Contributor

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I will leave Milan to comment more on the technical details.

On the API, I wonder if transform!(sdf, ...) should return parent(sdf) after the transformation. This is more in-line with what we do for transfrorm(gd::GroupedDataFrames, ...). We decided to automatically combine because the ungroup stickiness can lead to bugs in R. Maybe we should add a keyword argument to transform! about this?

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Show resolved Hide resolved
src/subdataframe/subdataframe.jl Show resolved Hide resolved
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments. I haven't looked at tests yet.

On the API, I wonder if transform!(sdf, ...) should return parent(sdf) after the transformation. This is more in-line with what we do for transfrorm(gd::GroupedDataFrames, ...). We decided to automatically combine because the ungroup stickiness can lead to bugs in R. Maybe we should add a keyword argument to transform! about this?

Given that SubDataFrame is an AbstractDataFrame I think it should continue behaving as much as possible like a DataFrame. But we could add a keyword argument later if that turns out to be useful (notably for piping).

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
src/subdataframe/subdataframe.jl Outdated Show resolved Hide resolved
src/subdataframe/subdataframe.jl Outdated Show resolved Hide resolved
src/subdataframe/subdataframe.jl Outdated Show resolved Hide resolved
src/subdataframe/subdataframe.jl Show resolved Hide resolved
docs/src/lib/indexing.md Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Aug 23, 2021

Also, could you clarify why df.newcol = v is not allowed? I'm sure there is a reason but I always forget.

It is allowed.

What is not allowed is df.newcol .= v. It will be allowed in Julia 1.7. In Julia 1.6 it is not possible, as df.newcol is resolved before broadcasting is invoked. Only Julia 1.7 has feature allowing to delay resolving getproperty.

@bkamins
Copy link
Member Author

bkamins commented Aug 23, 2021

On the API, I wonder if transform!(sdf, ...) should return parent(sdf) after the transformation. (...) Maybe we should add a keyword argument to transform! about this?

In GroupedDataFrame we add an option to group/ungroup for performance reasons. Here, we could add a kwarg as you propose, but writing:

transform!(sdf, ..., parent=true)

is longer than just

parent(transform!(sdf, ...))

In general I would prefer to keep transform and transform! consistent in their return value.

bkamins and others added 2 commits August 25, 2021 08:14
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Aug 25, 2021

@nalimilan I have applied all the discussions we had. I have resolved all conversations that I thought I clear and left only three open.
The main issue is if we should use promote_type or Base.promote_typejoin when doing sdf[!, col] = v. I have discussed in the comments and in the code the potential consequences.

I will update the tests when we settle the design.

# This has an additional effect that for CategoricalVector levels
# and ordering will be retained or not depending on which code patch is taken.

# TODO: add tests when promote_type vs Base.promote_typejoin decision is made
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - this is the performance optimization we have discussed for select! and transform!. As commented - it has a side effect for CategoricalArrays.jl but I think it is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but this isn't just an optimization, as the difference can be observed from the outside, right? I think we should only apply tricks that are completely invisible for users, or the behavior will be to complex. Apparently I was wrong in thinking that select! and transform are amenable to such optimizations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it is visible from outside if the column has metadata. I will restrict the optimization to Vector, as Vector has no metadata (as opposed to PooledVector or CategoricalVector)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it visible also for Vector? That is, if one holds a reference to the column, mutating it rather than replacing it can have consequences. (I'm not overly concerned about PooledVector and CategoricalVector if that only makes a difference in weird cases.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can have consequences if there are aliases to this vector. But OK - let us just copy always and stick to "safety first" rule (if someone wants it fast it is always easy-enough to achieve with indexing).

@bkamins bkamins mentioned this pull request Aug 26, 2021
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
newcol = Tables.allocatecolumn(Union{T, Missing}, nrow(dfp))
fill!(newcol, missing)
newcol[rows(df)] = item_new
item_new = newcol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but item_new_df sounds like a weird name. How about something like item_new_orig?

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Aug 29, 2021

Why isn't : also mentioned here?

It is mentioned at the top in the following part:

The rules for a valid type of index into a row are the following:

  • a value, later denoted as row:
    • an Integer that is not Bool;
  • a vector, later denoted as rows:
    • a vector of Integer other than Bool (does not have to be a subtype of AbstractVector{<:Integer});
    • a vector of Bool that has to be a subtype of AbstractVector{Bool};
    • a Not expression;
    • a colon literal :;
  • an exclamation mark !.

@test df.a == [1.5, 2, 3, 4]
@test eltype(df.a) === Float64

# note that CategoricalVector is dropped as
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - this part of tests probably deserves your special attention. Thank you!

@bkamins
Copy link
Member Author

bkamins commented Aug 29, 2021

@nalimilan - this should be good to have a look at again. Thank you!

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive set of tests!

Comment on lines 1614 to 1615
# we first copy old data and then add new data so "1" is in levels
# although it is not present in df.a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK "1" would be in the levels even if we only copied the new values, as all values carry the whole set of levels. The only difference can be in the ordering of levels, since there's no well-defined merged order here.

df = DataFrame(a=1:4)
a = df.a
sdf = @view df[1:1, :]
select!(sdf, :a => (x -> x) => :a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test x -> [1] (here and below)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 153 to 155
df.x1 -= [1, 1, 1]
df.x2 -= [100, 100, 0]
@test df == refdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply compare df with a literal DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - changed

@@ -220,7 +224,8 @@ end
@test df[:, Not("x1")] == refdf[:, 2:end]

dfv = @view df[1:2, 2:end]
@test_throws ArgumentError dfv[!, 1] .+= [0, 1] .+ 1
dfv[!, 1] .+= [0, 1] .+ 1
@test df.x2 == [5.5, 7.5, 6.5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the contents of the whole df while you're at it (here and elsewhere)? For example, the column could be re-added in the wrong place...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test/broadcasting.jl Outdated Show resolved Hide resolved
@test_throws BoundsError sdf[:, 4] = ["a", "b", "c"]
@test_throws DimensionMismatch sdf[:, 1] = [1]
@test_throws MethodError sdf[:, 1] = 1
if DataFrames.is_column_insertion_allowed(sdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any simple way to check this without relying on the internal function? That would allow checking that it's correct too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

test/indexing.jl Show resolved Hide resolved
tmpa = df.a
sdf[:, [:c, :b, :a]] = DataFrame(c=[5, 6], b=[1.0, 2.0], a=[13, 12])
@test df == DataFrame(a=[1, 12, 13, 4, 5],
b=[11.0, 2.0, 1.0, 14.0, 15.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation here and below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Aug 29, 2021

Impressive set of tests!

Frankly - we are at the level of complexity of the logic and number of corner cases where I need the tests to convince myself that I can make a PR public.

@bkamins
Copy link
Member Author

bkamins commented Aug 29, 2021

All comments should be now resolved. Thank you!

NEWS.md Outdated Show resolved Hide resolved
@bkamins bkamins merged commit 3a71ae5 into main Sep 1, 2021
@bkamins bkamins deleted the bk/view_add_column branch September 1, 2021 07:01
@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2021

Thank you!

@pdeffebach - now it is probably the time to experiment with this functionality in combination with what DataFramesMeta.jl could provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants