Support adding columns to views #2794

bkamins · 2021-06-21T09:22:08Z

This partially implements #2785 for:

setindex! with : as row selector
broadcasted assignment with : as row selector
insertcols!

as only these operations, as discussed, ensure that we ADD columns (and not replace them).

Before I add tests let us discuss what we think about this functionality.

@matthieugomez - in particular could you please comment if you would find it useful? (or your fundamental use case is transform! with groupby run on SubDataFrame and having only these methods is of not much value?)

The problematic point is that supporting select! and transfrorm!, as commented earlier creates a slight ambiguity (of course possible to resolve) when if we write :a => identity => :a as a transformation it is not clear if :a should be filled with missing in filtered-out rows or it should be left as is.

docs/src/lib/indexing.md

src/abstractdataframe/abstractdataframe.jl

src/other/broadcasting.jl

src/subdataframe/subdataframe.jl

src/other/broadcasting.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

pdeffebach

Thanks!

I will leave Milan to comment more on the technical details.

On the API, I wonder if transform!(sdf, ...) should return parent(sdf) after the transformation. This is more in-line with what we do for transfrorm(gd::GroupedDataFrames, ...). We decided to automatically combine because the ungroup stickiness can lead to bugs in R. Maybe we should add a keyword argument to transform! about this?

docs/src/lib/indexing.md

NEWS.md

docs/src/lib/indexing.md

src/subdataframe/subdataframe.jl

nalimilan

Here are some comments. I haven't looked at tests yet.

On the API, I wonder if transform!(sdf, ...) should return parent(sdf) after the transformation. This is more in-line with what we do for transfrorm(gd::GroupedDataFrames, ...). We decided to automatically combine because the ungroup stickiness can lead to bugs in R. Maybe we should add a keyword argument to transform! about this?

Given that SubDataFrame is an AbstractDataFrame I think it should continue behaving as much as possible like a DataFrame. But we could add a keyword argument later if that turns out to be useful (notably for piping).

NEWS.md

src/subdataframe/subdataframe.jl

docs/src/lib/indexing.md

bkamins · 2021-08-23T13:41:21Z

Also, could you clarify why df.newcol = v is not allowed? I'm sure there is a reason but I always forget.

It is allowed.

What is not allowed is df.newcol .= v. It will be allowed in Julia 1.7. In Julia 1.6 it is not possible, as df.newcol is resolved before broadcasting is invoked. Only Julia 1.7 has feature allowing to delay resolving getproperty.

bkamins · 2021-08-23T14:09:45Z

On the API, I wonder if transform!(sdf, ...) should return parent(sdf) after the transformation. (...) Maybe we should add a keyword argument to transform! about this?

In GroupedDataFrame we add an option to group/ungroup for performance reasons. Here, we could add a kwarg as you propose, but writing:

transform!(sdf, ..., parent=true)

is longer than just

parent(transform!(sdf, ...))

In general I would prefer to keep transform and transform! consistent in their return value.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-08-25T07:54:47Z

@nalimilan I have applied all the discussions we had. I have resolved all conversations that I thought I clear and left only three open.
The main issue is if we should use promote_type or Base.promote_typejoin when doing sdf[!, col] = v. I have discussed in the comments and in the code the potential consequences.

I will update the tests when we settle the design.

bkamins · 2021-08-25T07:56:14Z

src/subdataframe/subdataframe.jl

+        # This has an additional effect that for CategoricalVector levels
+        # and ordering will be retained or not depending on which code patch is taken.
+
+        # TODO: add tests when promote_type vs Base.promote_typejoin decision is made


@nalimilan - this is the performance optimization we have discussed for select! and transform!. As commented - it has a side effect for CategoricalArrays.jl but I think it is OK.

Ah, but this isn't just an optimization, as the difference can be observed from the outside, right? I think we should only apply tricks that are completely invisible for users, or the behavior will be to complex. Apparently I was wrong in thinking that select! and transform are amenable to such optimizations.

Yes - it is visible from outside if the column has metadata. I will restrict the optimization to Vector, as Vector has no metadata (as opposed to PooledVector or CategoricalVector)

Isn't it visible also for Vector? That is, if one holds a reference to the column, mutating it rather than replacing it can have consequences. (I'm not overly concerned about PooledVector and CategoricalVector if that only makes a difference in weird cases.)

It can have consequences if there are aliases to this vector. But OK - let us just copy always and stick to "safety first" rule (if someone wants it fast it is always easy-enough to achieve with indexing).

docs/src/lib/indexing.md

nalimilan · 2021-08-28T15:49:52Z

src/dataframe/dataframe.jl

+            newcol = Tables.allocatecolumn(Union{T, Missing}, nrow(dfp))
+            fill!(newcol, missing)
+            newcol[rows(df)] = item_new
+            item_new = newcol


OK, but item_new_df sounds like a weird name. How about something like item_new_orig?

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-08-29T10:07:34Z

Why isn't : also mentioned here?

It is mentioned at the top in the following part:

The rules for a valid type of index into a row are the following:

a value, later denoted as row:

an Integer that is not Bool;

a vector, later denoted as rows:

a vector of Integer other than Bool (does not have to be a subtype of AbstractVector{<:Integer});

a vector of Bool that has to be a subtype of AbstractVector{Bool};

a Not expression;

a colon literal :;

an exclamation mark !.

bkamins · 2021-08-29T10:31:13Z

test/subdataframe_mutation.jl

+    @test df.a == [1.5, 2, 3, 4]
+    @test eltype(df.a) === Float64
+
+    # note that CategoricalVector is dropped as


@nalimilan - this part of tests probably deserves your special attention. Thank you!

bkamins · 2021-08-29T12:34:37Z

@nalimilan - this should be good to have a look at again. Thank you!

nalimilan

Impressive set of tests!

nalimilan · 2021-08-29T14:17:19Z

test/subdataframe_mutation.jl

+    # we first copy old data and then add new data so "1" is in levels
+    # although it is not present in df.a


AFAIK "1" would be in the levels even if we only copied the new values, as all values carry the whole set of levels. The only difference can be in the ordering of levels, since there's no well-defined merged order here.

nalimilan · 2021-08-29T14:20:38Z

test/subdataframe_mutation.jl

+    df = DataFrame(a=1:4)
+    a = df.a
+    sdf = @view df[1:1, :]
+    select!(sdf, :a => (x -> x) => :a)


Also test x -> [1] (here and below)?

nalimilan · 2021-08-29T20:28:33Z

test/broadcasting.jl

+    df.x1 -= [1, 1, 1]
+    df.x2 -= [100, 100, 0]
+    @test df == refdf


Why not simply compare df with a literal DataFrame?

ok - changed

nalimilan · 2021-08-29T20:31:21Z

test/broadcasting.jl

@@ -220,7 +224,8 @@ end
    @test df[:, Not("x1")] == refdf[:, 2:end]

    dfv = @view df[1:2, 2:end]
-    @test_throws ArgumentError dfv[!, 1] .+= [0, 1] .+ 1
+    dfv[!, 1] .+= [0, 1] .+ 1
+    @test df.x2 == [5.5, 7.5, 6.5]


Check the contents of the whole df while you're at it (here and elsewhere)? For example, the column could be re-added in the wrong place...

test/broadcasting.jl

nalimilan · 2021-08-29T20:40:33Z

test/indexing.jl

        @test_throws BoundsError sdf[:, 4] = ["a", "b", "c"]
        @test_throws DimensionMismatch sdf[:, 1] = [1]
        @test_throws MethodError sdf[:, 1] = 1
+        if DataFrames.is_column_insertion_allowed(sdf)


Any simple way to check this without relying on the internal function? That would allow checking that it's correct too.

test/indexing.jl

nalimilan · 2021-08-29T20:55:43Z

test/subdataframe_mutation.jl

+        tmpa = df.a
+        sdf[:, [:c, :b, :a]] = DataFrame(c=[5, 6], b=[1.0, 2.0], a=[13, 12])
+        @test df == DataFrame(a=[1, 12, 13, 4, 5],
+                            b=[11.0, 2.0, 1.0, 14.0, 15.0],


Incorrect indentation here and below.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-08-29T21:11:26Z

Impressive set of tests!

Frankly - we are at the level of complexity of the logic and number of corner cases where I need the tests to convince myself that I can make a PR public.

bkamins · 2021-08-29T21:44:44Z

All comments should be now resolved. Thank you!

NEWS.md

bkamins · 2021-09-01T07:03:19Z

Thank you!

@pdeffebach - now it is probably the time to experiment with this functionality in combination with what DataFramesMeta.jl could provide.

bkamins added 4 commits June 11, 2021 19:50

add setindex! rules

289fadf

implement setindex! and broadcasting assignment

a253236

implement insertcols!

e33c605

add NEWS.md entry

7f1814a

bkamins requested review from nalimilan and pdeffebach June 21, 2021 09:22

bkamins mentioned this pull request Jun 22, 2021

Use standard Tables.Schema constructor instead of constructing directly #2797

Merged

bkamins added this to the 1.3 milestone Jun 23, 2021

nalimilan reviewed Jun 25, 2021

View reviewed changes

bkamins mentioned this pull request Jun 26, 2021

Assignment to SubDataFrame #2785

Closed

bkamins and others added 3 commits June 27, 2021 09:51

Apply suggestions from code review

4a22a1d

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

changes after code review part 2

b721a46

docs update

412b89c

bkamins marked this pull request as draft June 27, 2021 11:26

setindex! for ! and setproperty

b95f07a

bkamins mentioned this pull request Aug 1, 2021

Add spreadmissings, the backend for column-wise @passmissing JuliaData/DataFramesMeta.jl#276

Open

bkamins added 13 commits August 6, 2021 17:35

Merge branch 'main' into bk/view_add_column

af54c29

fix NEWS.md

dc0d241

another NEWS.md fix

9f02571

another small NEWS.md change

121bb54

finished tests for df[!, col] assignment and broadcasted assignment

1a83b61

some more tests

7d5a65b

done tests of ! assignment and broadcasting assignment

f614e58

finished assignment, broadcasted assignment and insertcols!

e886dc1

fix tests

59afd61

fix tests on Julia 1.7

700e65d

one more test fix

50d9f8b

finalize all required changes

6045034

fix 1.7 broadcasting

1ae0534

bkamins marked this pull request as ready for review August 8, 2021 12:02

pdeffebach reviewed Aug 19, 2021

View reviewed changes

docs/src/lib/indexing.md Outdated Show resolved Hide resolved

NEWS.md Outdated Show resolved Hide resolved

docs/src/lib/indexing.md Show resolved Hide resolved

src/subdataframe/subdataframe.jl Show resolved Hide resolved

nalimilan reviewed Aug 20, 2021

View reviewed changes

bkamins and others added 2 commits August 25, 2021 08:14

Apply suggestions from code review

971c282

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

apply suggestions after code review

c4cb1ae

bkamins commented Aug 25, 2021

View reviewed changes

fix fast path is select!/transform!

cadc128

bkamins mentioned this pull request Aug 26, 2021

Bk/add leftjoin! #2843

Merged

Merge branch 'main' into bk/view_add_column

bdbf09a

nalimilan reviewed Aug 28, 2021

View reviewed changes

Apply suggestions from code review

4ad940c

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

changes after code review and promote_type decision

8f134ef

bkamins commented Aug 29, 2021

View reviewed changes

fix tests

1f4aa78

nalimilan reviewed Aug 29, 2021

View reviewed changes

Apply suggestions from code review

55a6d75

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

apply changes after code review

93de064

nalimilan approved these changes Aug 31, 2021

View reviewed changes

Merge branch 'main' into bk/view_add_column

3e5d8d8

bkamins commented Sep 1, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Update NEWS.md

f25d333

bkamins merged commit 3a71ae5 into main Sep 1, 2021

bkamins deleted the bk/view_add_column branch September 1, 2021 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support adding columns to views #2794

Support adding columns to views #2794

bkamins commented Jun 21, 2021

pdeffebach left a comment

nalimilan left a comment

bkamins commented Aug 23, 2021

bkamins commented Aug 23, 2021

bkamins commented Aug 25, 2021 •

edited

Loading

bkamins Aug 25, 2021

nalimilan Aug 28, 2021

bkamins Aug 29, 2021

nalimilan Aug 29, 2021

bkamins Aug 29, 2021

nalimilan Aug 28, 2021

bkamins commented Aug 29, 2021

bkamins Aug 29, 2021

bkamins commented Aug 29, 2021

nalimilan left a comment

nalimilan Aug 29, 2021

nalimilan Aug 29, 2021

bkamins Aug 29, 2021

nalimilan Aug 29, 2021

bkamins Aug 29, 2021

nalimilan Aug 29, 2021

bkamins Aug 29, 2021

nalimilan Aug 29, 2021

bkamins Aug 29, 2021

nalimilan Aug 29, 2021

bkamins Aug 29, 2021

bkamins commented Aug 29, 2021

bkamins commented Aug 29, 2021

bkamins commented Sep 1, 2021

		# we first copy old data and then add new data so "1" is in levels
		# although it is not present in df.a

Support adding columns to views #2794

Support adding columns to views #2794

Conversation

bkamins commented Jun 21, 2021

pdeffebach left a comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Aug 23, 2021

bkamins commented Aug 23, 2021

bkamins commented Aug 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Aug 29, 2021

Choose a reason for hiding this comment

bkamins commented Aug 29, 2021

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Aug 29, 2021

bkamins commented Aug 29, 2021

bkamins commented Sep 1, 2021

bkamins commented Aug 25, 2021 •

edited

Loading