Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment to SubDataFrame #2785

Closed
bkamins opened this issue Jun 10, 2021 · 13 comments
Closed

Assignment to SubDataFrame #2785

bkamins opened this issue Jun 10, 2021 · 13 comments
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Jun 10, 2021

@pdeffebach following our discussion on WhereDataFrame I propose the following extra new rules for assignment to SubDataFrame.

They add complexity unfortunately, but the benefit is that they are doing what probably user would want (I hope).
The crucial decision is the following:

  • for adding new columns the issue is simple I think and it is clear we should add the functionality;
  • Stata and data.table DO NOT allow replacing existing columns; however, if we want we can (and my proposal assumes this); the point is that with df[:, col] = v and df[!, col] = v distinction we can support both functionalities cleanly. I thought users might find it useful to replace existing columns with filtered-out values rather than only being able to create a new column this way. But maybe you will feel that this is too flexible. Then we can only focus on adding functionalities for adding columns
  • also this means that what I propose now allows to remove columns from the parent via select!

In summary - this is a "full option" proposal. I am OK to limit its scope to only add things that we find really useful in practice (but I thought it is better to have at this stage all things laid out on a table and decide what we allow and what we disallow).

The extra rules are only for SubDataFrame that does not subset columns (uses : as column selector - the point is to use : as selector not to select all columns in their original order as this is a less strict requirements).
This is easily identified using the AbstractIndex in the type. Example:

julia> df = DataFrame(a=1,b=2,c=3)
1×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      2      3

julia> typeof(view(df, 1:1, 1:2)) # NOT ALLOWED
SubDataFrame{DataFrame, DataFrames.SubIndex{DataFrames.Index, UnitRange{Int64}, UnitRange{Int64}}, UnitRange{Int64}}   

julia> typeof(view(df, 1:1, :)) # ALLOWED
SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}

julia> typeof(view(df, 1:1, All())) # NOT ALLOWED
SubDataFrame{DataFrame, DataFrames.SubIndex{DataFrames.Index, UnitRange{Int64}, UnitRange{Int64}}, UnitRange{Int64}} 

Functions to cover:

  • insertcols! - natural (it does not allow column overwriting already)
  • select! and transform!: columns that are not transformed at all (i.e. are passed as bare :a or r"a", or column renaming, or identity function) are reused; all other operations create columns with missing in filtered-out rows
  • sdf[!, col] = v, sdf[!, cols] = v, sdf.col = v, sdf[!, col] .= v, sdf[!, cols] .= v, and sdf.col .= v are allowed and replaces columns putting missing in filtered-out rows (all other rules are the same as for DataFrame)
  • sdf[:, col] = v and sdf[:, col] .= v if col is not present in a data frame are allowed and replaces columns putting missing in filtered-out rows

(if I have missed any of the functions that add/remove columns from a data frame in-place please comment)

CC @nalimilan @matthieugomez

@bkamins bkamins added this to the 1.x milestone Jun 10, 2021
@nalimilan
Copy link
Member

Sounds good. I agree it's probably better to start with the more essential features, and evaluate in a second phase whether replacing columns should be allowed.

The extra rules are only for SubDataFrame that does not subset columns (uses : as column selector - the point is to use : as selector not to select all columns in their original order as this is a less strict requirements).

What extra rules are you referring to?

(Ref. #2467 (comment))

@bkamins
Copy link
Member Author

bkamins commented Jun 11, 2021

OK. Then I will add only essential set of methods allowing for adding columns.

Extra rules: means rules I describe in this issue. If SubDataFrame is created with something else than : as column selector then we will not allow any of the actions described in this issue - all will work as it currently does.

I think we do not need any mutable=true or something like this field.

The reason is that if you create a SubDataFrame with : then it inherits Index from its parent so all things are guarantted remain consistent (including - other SubDataFrames will not get invalidated under the "reduced set of rules" that only allows for adding of the columns).

Is this clear?

When I make a PR I will add testset showing the difference.

@nalimilan
Copy link
Member

nalimilan commented Jun 25, 2021

Something that I had not realized is that having e.g. transform(view(df, ...), :x => ByRow(x -> x < 0 ? 0 : x) => :x) set entries in df.x to missing for all rows not retained by view could be unexpected. In many cases, it would be more natural to keep their original value. This is what Stata does with replace ... if and people use that a lot there. Of course doing that would be problematic for us, as contrary to Stata we also allow creating new columns -- so our behavior would change depending on whether the columns exist or not, which could also be confusing.

Maybe it would be worth thinking about the use cases though. For example we could introduce an inplace=false keyword argument that when true would force reusing existing vectors (both for DataFrame and SubDataFrame inputs).

@pdeffebach
Copy link
Contributor

transform(view(df, ...), :x => (x -> x < 0 ? 0 : x) => :x)

How are you getting this? On the PR I have

julia> df = DataFrame(a = [1, 2, 3, 4], b = [5, 6, 7, 8]);

julia> transform(view(df, 1:2, :), :a => ByRow(a -> a < 0 ? 0 : a) => :a)
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      5
   2 │     2      6

julia> df
4×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      5
   2 │     2      6
   3 │     3      7
   4 │     4      8

I agree that defaulting to in-place is the correct behavior. I hope we can accommodate it without violating semver though.

@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2021

I assume you both meant transform! not transform. The reason is that transform creates a NEW data frame with only the rows present in the AbstractDataFrame passed.

I agree that defaulting to in-place is the correct behavior.

Regarding transform! we potentially can have three behaviors:

  • in-place
  • new column filling omitted rows with missing
  • new column filling omitted rows with values from old column

Because of these ambiguities in the PR I have not implemented any of them now. Indeed each of them can potentially be useful. In transform! and select! indeed we could use a kwarg to handle all options. The problem is which option should be used in:

sdf.x = v

if x were already present in sdf?

Though maybe having "new column filling omitted rows with values from old column" as the only option provided is good enough? (the resason it is better than in-place is because new values might not have types matching the eltype of old column)

@nalimilan
Copy link
Member

Yes indeed I meant transform!.

Though maybe having "new column filling omitted rows with values from old column" as the only option provided is good enough?

Yeah, probably. The fact that the result of transform! or sdf.x = v would depend on whether the column exists or not is a bit annoying, but maybe that's OK as anyway if you call transform! on a SubDataFrame you probably know its structure quite precisely. The advantage of that approach is that calling transform! or sdf.x = v would never change the contents of existing columns in the parent for rows not included in sdf.

(the resason it is better than in-place is because new values might not have types matching the eltype of old column)

Right. Also in-place as a default would be weird as it would be completely different from what happens with DataFrame.

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2021

The fact that the result of transform! or sdf.x = v would depend on whether the column exists or not is a bit annoying, but maybe that's OK

I was thinking about it, and I think it is OK. The reasoning is that we essentially will say:

the result will be filled with the contents of the column in the parent unless this column does not exist in which case it will be filled with missing

So, in a sense, non-existing columns would work "as if" they contained virtual missing value, which I think is OK.

@pdeffebach - please comment if we are in agreement here. If yes, I will continue with #2794 (adding what we discuss here + responding to @nalimilan suggestions).

As a side note - adding this functionality to transform! and select! should be easy, as I can handle this in post processing, so fortunately, this means that we will not have to complicate the logic of these functions.

@pdeffebach
Copy link
Contributor

Yes, this seems good. I think Stata's rules are easy to understand and it's good to emulate them.

One thing to note, though, is that this doesn't solve problems with if entirely.

  1. It's very heavy, creating a subdataframe creates views for every array iirc.
  2. It wouldn't be at a per-transformation basis, so we aren't really emulating Stata's
gen y = x + z if a == 5

So we will still have work to do on the API after this.

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2021

It's very heavy

Could you please clarify what you mean here? I agree that requiring first to create a view, and then operating on it is a bit heavy, but I feel that you mean something else.

It wouldn't be at a per-transformation basis

As a next step we could consider where kwarg in select/select!/transform/transform! but I would leave the discussion how to implement them in detail for later (as what we discuss here is a required basis).

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 26, 2021

If I have a data frame with 1000 columns and you do something like

sdf = subset(df, ...; view = true)
@transform!(sdf, y = f(:x))

Am I going to pay a cost of creating a view for all 1000 columns, or do I just pay the cost of making a view of d.f.x?

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2021

Only df.x. The process will be (for transform!):

  1. execute select(sdf, :x => f => :y)
  2. check if :y is present in df; and either produce a final column filling with missing or with what was in :y previously

PS. Preferably I propose not to use DataFramesMeta.jl syntax in DataFrames.jl discussions (in particular DataFramesMeta.jl does not have a finalized syntax yet).

@pdeffebach
Copy link
Contributor

Ah okay, I will remember that in the future.

I think this is exactly correct. I'm fine with groupby(df, ...) and tranform(df ...) being next to each other in a block in the case of a grouped operation, instead of by(df, ...) so I think continuing that pattern is great.

@bkamins
Copy link
Member Author

bkamins commented Sep 7, 2021

closing as this is now implemented

@bkamins bkamins closed this as completed Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants