Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow rename when selecting #1975

Closed
wants to merge 5 commits into from

Conversation

innerlee
Copy link
Contributor

@innerlee innerlee commented Oct 5, 2019

Based on the previous pr #1974.

Now it supports

julia> df = DataFrame(A = 1:3, B = 'A':'C', C = [:x, :y, :z])
3×3 DataFrame
│ Row │ A     │ B    │ C      │
│     │ Int64 │ Char │ Symbol │
├─────┼───────┼──────┼────────┤
│ 11'A'  │ x      │
│ 22'B'  │ y      │
│ 33'C'  │ z      │

julia> df[:, [:B, :C=>:A]]
3×2 DataFrame
│ Row │ B    │ A      │
│     │ Char │ Symbol │
├─────┼──────┼────────┤
│ 1'A'  │ x      │
│ 2'B'  │ y      │
│ 3'C'  │ z      │

Signed-off-by: lizz <lizz@sensetime.com>
innerlee and others added 2 commits October 5, 2019 21:38
Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>
Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>
@bkamins
Copy link
Member

bkamins commented Oct 5, 2019

Thank you for the proposal. I have two general comments:

First, the PR has also changes from #1974 in it, this probably should be cleaned up at some point (this is a minor issue).

Second (and this is a crucial point), I am not 100% sure that this composes well. I.e. maybe it is better to keep select only specialized for selection only without renaming. I am not saying we should not do it, but first we should decide the scope of this change. We could keep the functionality limited to renaming in vectors (this is what you propose), but I am wondering if we want to support something like:

select(df, :a=>:b, :b=>:a)

and then the problem kicks in that we interpret it as select(df, All(:a=>:b, :b=>:a)) (and various interactions now can happen). Any thoughts on this? (note that here I am referring to the version of DataFrames.jl on master, where you can pass multiple positional arguments to select)

@innerlee innerlee force-pushed the zz/selectwithrename branch 2 times, most recently from 79c7e57 to 1ebb978 Compare October 5, 2019 17:01
@innerlee
Copy link
Contributor Author

innerlee commented Oct 5, 2019

Thanks for the comment. Let me describe the background story and motivation for the change.

I started (for the first time) to use DataFrames.jl two days ago when the readdlm and writedlm is not complex enough for a task at hand.
I met a use case where I want this change.

For example, when you want to modify two columns:

julia> df = DataFrame(name=["vid1","vid2","vid3"], pred1=["walk","talk","run"],p1=[0.3,0.5,0.6],pred2=["walk","run","talk"],p2=[0.7,0.6,0.5],outpred=["","",""],outp=[0.,0.,0.])
3×7 DataFrame
│ Row │ name   │ pred1  │ p1      │ pred2  │ p2      │ outpred │ outp    │
│     │ String │ String │ Float64 │ String │ Float64 │ String  │ Float64 │
├─────┼────────┼────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ 1   │ vid1   │ walk   │ 0.3     │ walk   │ 0.7     │         │ 0.0     │
│ 2   │ vid2   │ talk   │ 0.5     │ run    │ 0.6     │         │ 0.0     │
│ 3   │ vid3   │ run    │ 0.6     │ talk   │ 0.5     │         │ 0.0     │

julia> mask = df.p2 .> 0.5
3-element BitArray{1}:
 1
 1
 0

julia> df[mask, [:outpred, :outp]] .= df[mask, [:pred2, :p2]] # their names must match
ERROR: ArgumentError: Column names in broadcasted data frames must match. Non matching column names are pred2, p2, outpred and outp
Stacktrace:
...

julia> df[mask, [:outpred, :outp]] .= df[mask, [:pred2=>:outpred, :p2=>:outp]] # <- look here
2×2 SubDataFrame
│ Row │ outpred │ outp    │
│     │ String  │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ walk    │ 0.7     │
│ 2   │ run     │ 0.6     │

julia> df
3×7 DataFrame
│ Row │ name   │ pred1  │ p1      │ pred2  │ p2      │ outpred │ outp    │
│     │ String │ String │ Float64 │ String │ Float64 │ String  │ Float64 │
├─────┼────────┼────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ 1   │ vid1   │ walk   │ 0.3     │ walk   │ 0.7     │ walk    │ 0.7     │
│ 2   │ vid2   │ talk   │ 0.5     │ run    │ 0.6     │ run     │ 0.6     │
│ 3   │ vid3   │ run    │ 0.6     │ talk   │ 0.5     │         │ 0.0

So, the motivation is for the getindex method. Now most of the getindex is implemented through select, so I implement the select first. In this case, renaming only for vectors is enough.

@innerlee innerlee changed the title Allow rename when selecting Allow rename in getindex Oct 5, 2019
Signed-off-by: lizz <lizz@sensetime.com>
Signed-off-by: lizz <lizz@sensetime.com>
@innerlee
Copy link
Contributor Author

innerlee commented Oct 6, 2019

Properly rebased. Only the last commit belong to this pr.

@bkamins
Copy link
Member

bkamins commented Oct 6, 2019

Thank you for all the input. What I would propose to do before we move forward with the review of the implementation is to make sure what is the target API so that JuliaData memebers can comment on it.

The reason is, as I have commented earlier that mixing selection (or getindex which is essentially a selection combined with filtering) with renaming breaks composability a bit (it does not mean that I think we should disallow it, as I understand the use case, but we must have a clear target design before moving forward).

So would be willing to write down the target API for this change (like what should go into the documentation)? In particular please note that it seems that this change will not combine will @view and more importantly @views and Base.dotview, which is slightly problematic also.

@bkamins
Copy link
Member

bkamins commented Oct 6, 2019

Just as a note rename! change is not problematic as it is a local function of DataFrame.jl doing only one thing (that is why I think we can move forward fast in #1974 as opposed to this PR). While getindex is from Base, and should be composable with other things users might define for it, while select is from DataAPI.jl, and ideally should be consistent across different packages.

@innerlee
Copy link
Contributor Author

innerlee commented Oct 6, 2019

Agree with you, but assessing the composability and consistency is beyond my knowledge (as I seldom use Data-related packages), so I'm willing to wait for inputs from anyone else.

My intuition comes from SQL where renaming is often used with selection, e.g.

SELECT day_of_order AS "Date"  # <- a rename
Customer As "Client",  # <- another rename
Product,  
Quantity,  
FROM orders; 

@innerlee
Copy link
Contributor Author

innerlee commented Oct 6, 2019

A quick search:

The @select macro is interesting because it allows transform columns when selecting, just like in SQL

SELECT Id AS Identifier, 
LastName + ', ' + FirstName AS CustomerName,  # <- also come with rename
FROM Customer

Maybe we can change the syntax from

@select(df, :c, x = :b + :c)

into

@select(df, :c, :b + :c => :x)

@bkamins
Copy link
Member

bkamins commented Oct 6, 2019

OK - I will think about it and write down what I think is safe to do as I think this functionality is useful.

Regarding the packages - the major alternative to DataFrames.jl is JuliaDB.jl related ecosystem. There you have it defined here.

We are in the process of synchronizing the API across packages. That is the point of DataAPI.jl (actually I thought we migrated select there already, but it seems my memory was wrong and this has not happened yet).

A major point to consider is that Pair{Selection => Function} in JuliaDB.jl is already taken in JuliaDB.jl to denote a transformation of a variable without renaming it.

Also maybe we should only provide renaming in select not in getindex. Then the problems with views and broadcasting would not occur.

Finally, why do you think the syntax:

@select(df, :c, x = :b + :c)

is not OK?

@innerlee
Copy link
Contributor Author

innerlee commented Oct 6, 2019

The current version of select in JuliaDB.jl differs from DataFrames.jl very much.

here wrote

select(t::Table, which::Selection)
`Selection` is a type union of many types that can select from a table. It can be:
...

* `Pair{Selection => Function}` – selects and maps a function over the selection, returns the result.
* `AbstractArray` – returns the array itself. This must be the same length as the table.  # <- look here

The AbstractArray version simply returns the array itself whereas in DataFrames.jl it is used as an array of indices.
The point is, inconsistency seems to be huge at current status.

Furthermore, it seems the form Pair{Selection => Function} only returns one column without a name. It is not a good api because other options return a table while this one returns an array.

For the @select(df, :c, x = :b + :c) syntax, it does not allow a function form, so I do not like it.
With type piracy, we can have a function form of select with the ability of calculated columns with rename

select(df, [:c, :b + :c => :x])

@bkamins
Copy link
Member

bkamins commented Oct 6, 2019

I agree there is a discrepancy, but my intuition is that it would be good to decrease the level of it with time.

Regarding your proposal here is what I think:

  • I would prefer getindex not to allow changing column names
  • select and select! can allow changing column names; a basic pattern is that can be now callded with a single argument that is either a Pair or an AbstractVector of Pairs, Signed, Unsigned or Symbols (I am talking about actual contents of the vector not the type). Then the change should support passing multiple arguments and proper combining of the column renaming expressions with Regex, Not and All results (in particular note that All allows for duplicate column names, which is intended, and then we would have to specify if we take the first or the last column with the same name if we find duplicates).

The syntax could be:

source_column => transformation => target_column_name :: Pair{<:ColumnIndex,Pair{<:Function,Symbol}}

which would take a source column, transform it using the specified transformation (a function accepting one argument and returning a value; here we should make sure to correctly handle copycols kwarg but it should not be problematic) and store it under target column name. The transformation as in JuliaDB.jl should accept scalars (not vectors). CC @nalimilan - do you agree?

with two shorthands:

source_column => target_column_name :: Pair{<:ColumnIndex,Symbol}

expanding to:

source_column => identity => target_column_name

and

source_column => transformation :: Pair{<:ColumnIndex,<:Function}

expanding to:

source_column => transformation => source_column_name

If we went this way the following things should be reviewed in the whole code base against this change:

  • all getindex, view and dotview methods defined that touch columns (by all I mean for all types that support them)
  • all select and select! methods (by all I mean for all types that support them)

In particular I already see that:

Base.getindex(df::DataFrame, row_ind::Colon, col_inds::Union{AbstractVector, Regex, Not, Between, All, Colon})

and

Base.getindex(df::DataFrame, row_ind::typeof(!), col_inds::Union{AbstractVector, Regex, Not, Between, All, Colon})

should be reimplemented

and all view-related methods should intercept the new pattern and throw an informative error message (now they would pass down the argument to SubIndex and probably some cryptic error for the user would be thrown).

Overall, after thinking of it, I think your proposal would greatly improve the functionality of select and select! and I like it. However, as you can see - making this change digs deep into the core infrastructure of DataFrames.jl, but what I have described would need to be done I think if we want to provide a consistent user experience.

A simpler approach would be to implement this only for a single Pair or a single AbstractVector argument to select and select! if you would prefer to stick to this only and I would later make another PR extending it to work consistently with all other allowed options (like All, multiple arguments etc.).

CC @nalimilan, @quinnj, @oxinabox, @pdeffebach, @piever (in case you wanted to comment on it as it is a major change 😄)

@bkamins
Copy link
Member

bkamins commented Oct 6, 2019

Also now as I review #1849 by @pdeffebach then as an additional functionality (this does not have to be done now, but I want to keep things in a single place if we discuss the target design issues) is to allow things like:

r"^m" => sin

or

: => sin

as arguments to select.

@pdeffebach (the difference here from what you proposed in #1849 is that it is => not .=> as a broadcast would get resolved before the call to select, so we should handle this without broadcasting inside select).

@pdeffebach
Copy link
Contributor

I've always been critical of data frames libraries that combine select and rename. I think it makes code less readable. But I understand it's a feature many people want. I agree with @bkamins that getindex should not allow renaming.

Yes, I apologize for not working on that more. I totally agree that we should make the by functionality incorporate the new regular expressions syntax. I will finish up the PR as is though and we can add it later.

@oxinabox
Copy link
Contributor

oxinabox commented Oct 6, 2019

This change scares me.
Indexing is indexing,
renaming is renaming.

I'm not entirely sure I understand the use case.

I've certainly done such operations in combination. But I don't think it would naturally occur to me to do them as one.
And I am not sure they are often just one kind of pattern.

One pattern I encounter is masking by one column then renaming other column.
And I worry that would be confusing in this form.

Trying it on.
Think a bank statement. With columns description, amount (possituve for income), date, transaction_id

was_expense  = acct.amount .< 0
expenses = acct[was_expense, [:date, :description=>expense, :amount=>:cost, :transaction_id]
expenses.cost .*= -1

Vs

was_expense  = acct.amount .< 0
expenses = acct[was_expense, :]
rename!(expenses, :description=>expense, :amount=>:cost])
expenses.cost .*= -1

I'm not sure.

In the case of your original example.
I would definitely have done that one column at a time.
Which at least in my way of thinking makes more sense than constructing a new dataframe with different names just to do the operation.

df.outpred[mask] .= df.pred2.[mask]
df.outp[mask] .= df.p2[mask]

@piever
Copy link

piever commented Oct 7, 2019

I agree with the comments above that renaming should not be part of indexing.

On the other hand, adding renaming to select was proposed in JuliaDB as well (on slack I'm afraid as I can't find the relevant issue). The advantage is that transform would then also incorporate rename, if passed a pair of symbols. Currently, :a => :b goes into the fallback Selection => Callable, but if we are sure that symbols are never callable, then there should be no ambiguity.

The syntax could be:

source_column => transformation => target_column_name :: Pair{<:ColumnIndex,Pair{<:Function,Symbol}}
which would take a source column, transform it using the specified transformation (a function accepting one argument and returning a value; here we should make sure to correctly handle copycols kwarg but it should not be problematic) and store it under target column name. The transformation as in JuliaDB.jl should accept scalars (not vectors).

In JuliaDB the syntax is target_colum_name => source_column => transformation, say:
select(iris, :LogSepalLength => :SepalLength => log), which has the unfortunate consequence of being inconsistent with rename(t, source_column => target_column_name). I'm not sure if there is a good solution as transform(df, newcolname => newvector) seems natural, but transform(df, newcolname => oldcolname) is at odds with rename. Maybe the "pair order" alone is a sufficient reason to keep things separate?

@innerlee innerlee changed the title Allow rename in getindex Allow rename when selecting Oct 7, 2019
@innerlee
Copy link
Contributor Author

innerlee commented Oct 7, 2019

I agree changing the semantic of getindex is a bad idea since it belongs to Base. Lets forget about getindex and reconsider select.

@bkamins
Copy link
Member

bkamins commented Oct 7, 2019

@piever I proposed source_column => transformation => target_column_name (with two shorthand notations added) as this is a natural way to read => sign.

In my mental model this is a difference between = which works like an assignment to LHS and => which rather works as assignment to RHS (like in Dict a key is mapped to the value) - so => represents the flow of data left to right.

Now if one accepts the:

source_column => transformation => target_column_name

approach then naturally all other cases follow i.e.

source_column => transformation # use the same column name as target
vector => target_column_name # assign a vector to a target column name without taking a source
source_column  => target_column_name # use identity as trasnformation

The major point if I understand the situation is that it would be breaking for JuliaDB.jl. Right?

EDIT
To be clearer what I mean is that I think that rather than transform(df, newcolname => newvector) a natural way to write this thing for me is transform(df, newvector => newcolname) as newvector gets assigned to newcolname.

@oxinabox
Copy link
Contributor

oxinabox commented Oct 7, 2019

I feel like most of the stuff relating to this doesn't apply to DataFrames as much as it does to JuliaDB.
Because DataFrames are column-tables (and mutable).
so it is just as fast to tranform 2 columns in sequence
and to tranform 2 at one (maybe faster even because cache locality stuff).
Where as JuliaDB is row-tables,
so every tranform that adds new columns means reallocating every row.

So in DataFrames it is always more natural to write a series of

df.newdcolname = tranformation(df.oldcolumnname)

and renames.

@bkamins
Copy link
Member

bkamins commented Oct 7, 2019

For DataFrames.jl we have a major decision:

  1. do we want select to perform only a selection operation (so we stick to "do one thing and do it well" philosophy); this is probably an approach that is natural for some people (Unix was designed around this so it seems to work 😄)
  2. we want select to also do transformations and renames - this is probably something that many people want; I can see a point of such thinking as it is convenient and I would be OK to add this (having this functionality does not force me to use it).

And I am stating, that if we went for option 2, the natural syntax for select would be source_column => transformation => target_column_name.

@Drvi
Copy link
Contributor

Drvi commented Nov 5, 2019

@bkamins The second option is exactly what I've been working on a branch of my Selections.jl package, using only the Tables.jl. I think select is suited for definition 2. as there is a long SQL tradition that packs all this functionality into SELECT clause, and we can dedicate other functions/verbs to restricted functionality (having select, rename, transform, permutecols...).

Here is the API I'm finishing right now for the select function:

key clause | if clause | transform clause |  example                                                          |  meaning
---------------------------------------------------------------------------------------------------------------------------------------------------------
     -     |     x     |        -         |  select(df, if_matches("a"))                                      |  select columns containing "a"
     x     |     x     |        -         |  select(df, if_matches("a") => key_prefix("a_"))                  |  select columns containing "a" and prepend "a_"
     -     |     x     |        x         |  select(df, if_matches("a") => bycol.(string))                     |  select columns containing "a" make their value strings
     x     |     x     |        x         |  select(df, if_matches("a") => key_prefix("a_") => bycol.(string)) |  prepend "a_" to all columns containing "a" and make their value strings

(There are sensible default constructors for different combinations of types in the pairs supplied to select)

where the key clause are "renamings" (functions applied to selected columns -- key_prefix (adds prefix), key_suffix (adds suffix), key_map (apply function to name), key_replace (replace pattern in name) -- or a vector/tuple of Symbols to replace the names of selected columns),

the "if clause" are the selections which can be Symbols, Ints, AbstractVector{Bool}, AbstractRange (+ colrange function handles even ranges of symbols), and functions returning bool like if_matches (checks if regex or string occurs in column name), if_keys (predicate function applied to column names), if_values (predicate function applied to columns themselves), if_pairs (predicate functions that takes both column names and actual columns as arguments) and if_eltype. All these selections can be inverted/negated so that they match the complement of the columns they would match otherwise. There are also special selections like all_cols(), other_cols() (the columns that were not selected in any other part of the selection query) and else_cols() (the columns that the previous selection query didn't capture).

The transform clause is the new thing I was working on -- here is a table that summarized what it can do:

constructor  | example                                                    | meaning
-------------------------------------------------------------------------------------------------
bycol(f)     | bycol(x -> x ./ norm(x))                                   | Scale selected column x by its norm
bycol!(f)    | bycol!(x -> x ./ norm(x))                                  | -- || -- inplace (if the resulting type permits)
bycol.(f)    | bycol.(x -> x + 1)                                         | add 1 to every element of selected column x
bycol!.(f)   | bycol!.(x -> x + 1)                                        | -- || -- inplace (if the resulting type permits)
byrow(f)     | byrow((rowtable,name) -> f(rowtable, name)                 | Apply function to a vector of namedtuples (the rowtable), with name being the name of the current column
byrow!(f)    | byrow!((rowtable,name) -> f(rowtable, name))               | -- || -- inplace (if the resulting type permits)
byrow.(f)    | byrow.((row,name) -> row.a + row[name])                    | Apply function to each row, name is the name of the current column
byrow!.(f)   | byrow!.((row,name) -> row.a + row[name])                   | -- || -- inplace (if the resulting type permits)
bytab(f)     | bytab((coltable,name) -> coltable.a .+ coltable[name])     | Add the column :a to the current column
bytab!(f)    | bytab!((coltable,name) -> coltable.a .+ coltable[name])     | -- || -- inplace (if the resulting type permits)
bytab.(f)    | error                                                      | Broadcasting over namedtuples is not defined, and I'm not sure what this should do
bytab!.(f)   | error                                                      | -- || -- but inplace:)

Broadcasting plays nicely with your ability to chain multiple selections together (you can e.g. if_matches(r"a") | if_eltype(Number) to select all the columns that contain the letter "a" or are numeric) in a sense that when you chain multiple broadcasted transformations like (a: => bycol.(x->x+1)) | (all_cols() => bycol.(x -> 2x)) the in the case of column :a, the transformations will be fused as if there was only one transformation x-> (x + 1) * 2.

Finally, the thing I'm still working on is the ability to create new columns. For this I'd like to use the key word arguments as so:

select(df, :y, if_eltype(Numeric), fitted_values = bytab((tab, _) -> fited_values(model(y = tab[1], tab[2:end]))))

This would create new column fitted_values from model using all numerical columns (except of y) on y. We know :y is the first column because it appeared first in the args... part of select. Of course, the current way of creating new columns will still be available.

So the message is: I'm still working on this, if you want to discuss this, I'm @drvi at julia slack.
Sorry for hijacking this thread!
CC @nalimilan, @quinnj

@bkamins
Copy link
Member

bkamins commented Dec 8, 2019

Just a mention. When allowing column renaming we will have to be careful with duplicate column names produced. This is something we did not have to care about in the past, but now if someone writes :a => fun1 => :c, :b => fun2 => :c the question is: a) if we should allow it at all, b) if yes which result should be stored, c) if yes - if both fun1 and fun2 should be run (as they can have side effects).

@bkamins
Copy link
Member

bkamins commented Dec 8, 2019

Another thought. For: for DataFrame and insertcols! we use :newcolname => value syntax, maybe it can stay, as it is consistent with "key-value" meaning in Dict, where the Pair originated?

Then we would say that => as pipe would be used only in transformations.

The only drawback is that it is imaginable that in select we could allow creation of new columns from a value like select(df, :a, :b, some_vector => :c) which would be unambiguous, but inconsistent with "key-value" approach described approach (as we would pipe value to key here).

Does anyone have some thoughts over this?

@nalimilan
Copy link
Member

When allowing column renaming we will have to be careful with duplicate column names produced. This is something we did not have to care about in the past, but now if someone writes :a => fun1 => :c, :b => fun2 => :c the question is: a) if we should allow it at all, b) if yes which result should be stored, c) if yes - if both fun1 and fun2 should be run (as they can have side effects).

Let's throw an error for now, and (as always) we can change that later if it turns out to be useful.

For: for DataFrame and insertcols! we use :newcolname => value syntax, maybe it can stay, as it is consistent with "key-value" meaning in Dict, where the Pair originated?

Then we would say that => as pipe would be used only in transformations.

The only drawback is that it is imaginable that in select we could allow creation of new columns from a value like select(df, :a, :b, some_vector => :c) which would be unambiguous, but inconsistent with "key-value" approach described approach (as we would pipe value to key here).

In the context of insertcols! there's no ambiguity so that's not a big deal, but yeah, several interpretations are possible. We could almost allow value => :newcolname too... Or require using a named tuple instead of pairs...

@bkamins
Copy link
Member

bkamins commented Dec 11, 2019

Or require using a named tuple instead of pairs...

NamedTuple is not ideal because it is problematic with non-standard column names.

The problem with value => :newcolname in insertcols! and DataFrame is that now Pair{Symbol, Symbol} is allowed in both of them due to implicit broadcasting they make. So having both patterns would be ambiguous.

So I now tend to thing that (summarizing the design):

  1. we can leave DataFrame and insertcols! as they are now, treating it as key-value dict mapping.
  2. change the order in describe
  3. for select we will allow (there is a delicate interaction with copycols kwarg, but I will hopefully handle it efficiently)
    • :oldcol => fun => :newcol (a full pattern)
    • :oldcol => :newcol (shorthand for :oldcol => identity => :newcol)
    • :oldcol => fun(shorthand for:oldcol => fun => :oldcol`)
  4. in combine I would allow :oldcol => fun => :newcol pattern only (without shorthands, at least for now), where :oldcol can also be a list of columns; here the only question is how do we indicate if we want to pass a NamedTuple of columns vs a SubDataFrame (but this is a separate design issue - if @nalimilan would open a PR for combine with this changes we can discuss it there)

Now as for duplicates of :newcol in select I think we can use the following rules (I will have to validate them when I implement the PR, because this seems to be a tricky issue):

  • we do not allow :newcol to appear twice in :oldcol => fun => :newcol pattern or in any other selection pattern
  • :oldcol => fun => :newcol pattern is not allowed to be nested (i.e. it must be a top-level positional argument to select)

(the only tricky case that I see now is when :newcol was present in source data frame, and it gets selected by some other selection mechanism, but I think that for now we should throw an error in such a case)

Does it sound good? If yes I will implement 2. (describe - a new PR) and 3. (select - also in a new PR) and would kindly ask @nalimilan to have a look at 4.

@pdeffebach
Copy link
Contributor

I can do the describe PR if we have settled on always having fun => :name from now on. @nalimilan let me know if I should start a branch.

@pdeffebach
Copy link
Contributor

pdeffebach commented Dec 11, 2019

Also, given how complicated this discussion has gotten, are we sure we don't want to just have 3 functions, rename which only renames, select which only selects, and transform which only transforms? it sounds like at this point all functions will have the same functionality. Might as well call it renameselecttransform(df, args...).

I should note that if select allows renaming, I would still mandate in all code reviews for my research projects that each function should do only what it's name implies, for readability. That is to say, I would not take advantage of these features if they were added.

@bkamins
Copy link
Member

bkamins commented Dec 11, 2019

Yeah - this is kind of what I had in my mind originally in #1975 (comment).

We have two options in general:

Option 1

Make select be SQL-like select (this is what we ended up discussing as a major alternative).
Then selects is a function that allows renaming, transformations and selection.

Option 2

Leave select and rename as they are and add transform function.
For transform we still need some kind of pattern like :oldcol => fun => :newcol with an :oldcol => fun shorthand for :oldcol => fun => :oldcol.

The only difference is that in transform we could stick to the :newcol = :oldcol => fun pattern as it is now used in combine. The reason is that transform would not have positional arguments except for :oldcol => :newcol, which would replace an old column with a new column.

The reason it is fully valid is that as of Julia 1.3 we can write:

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                               b = repeat([2, 1], outer=[4]),
                               c = 1:8);

julia> by(df, :a, var"c sum" = :c => sum)
4×2 DataFrame
│ Row │ a     │ c sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

which means that we do not have a problem that column name is not a proper identifier.

As noted above - I prefer Option 2 (it is a cleaner design for me and does not require changing the rest of the API), but I see the point of Option 1, as it is SQL-like select.

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

After a discussion with @nalimilan we have settled to use "powerful select" option.

In particular a simple transform of a column (adding a column to a data frame) will be performed by:

select(df, :, :oldcol => fun => :newcol) # adds column :newcol at the end of the data frame, but keeps :oldcol in its old position

or

select(df, :, :oldcol => fun) # replaces :oldcol in its old position by a result of the transformation

We just have to decide what we do when copycols=false and source is a SubDataFrame - do we disallow renaming and transforms or materialize a DataFrame containing the transforms (I am tending to prefer the former option - i.e. we do not allow changing columns of a view in place).

@nalimilan - would you want to add some more comment about your thinking about the issue (i.e. the issue of specialized vs powerful select 😄).

@piever
Copy link

piever commented Dec 17, 2019

Sorry if this has been already discussed to death (and I may have argued differently in the past), but I'm starting to worry that the :oldcol => fun => :newcol order may create some unintuitive behaviors. To me the main distinction is whether naming a selection is done via Selection => Symbol or Symbol => Selection. As far as I understand the plan is:

  • select(df, :col => v) with v a vector is Symbol => Selection
  • select(df, col = v) with v a vector is Symbol => Selection
  • select(df, :col1 => f => :col2) with v a vector is Selection => Symbol
  • select(df, col1 = f => :col2) I imagine is disallowed?
  • select(df, :, col1 = :col2 => f) is Symbol => Selection (but maybe disallowed?)

Starting from this, I find a bit hard to guess what select(df, :col1 => :col2) should do. I'm afraid users are equally likely to expect select(df, :col1 => :col2) == select(df, :col1 => df.col2) or select(df, :col1 => :col2) == select(df, :col1 => identity => :col2), but the two are incompatible. Similarly, what about select(df, col1 = :col2)? Would this be allowed?

Or is the argument that, in general, Selection => Symbol and Symbol => Selection are both allowed, and in the overlap Symbol => Symbol, it was necessary to pick one and the decision criterion was compatibility with rename?

I suspect it may be safer to even disallow Symbol => Symbol, esp. because the behavior is different from rename. Unline rename(df, :oldcol => :newcol), select(df, :, :oldcol => :newcol), if I understand correctly, would keep a column named :oldcol (just like select(df, :, :oldcol => identity => :newcol) would).

If one does not want select(df, :col1 => :col2), then I imagine the Symbol => Selection approach is more consistent across the board, so it may be a simpler option.

@bkamins
Copy link
Member

bkamins commented Dec 17, 2019

It is better to discuss it now than later. My understanding is:

  • select(df, :col => v) with v a vector is Symbol => Selection (disallowed)
  • select(df, col = v) with v a vector is Symbol => Selection (disallowed)
  • select(df, :col1 => f => :col2) with v a vector is Selection => Symbol (gets :col1 transforms it using f and stores the result in :col2)
  • select(df, col1 = f => :col2) I imagine is disallowed? (disallowed)
  • select(df, :, col1 = :col2 => f) is Symbol => Selection (but maybe disallowed?) (disallowed)

also you have not asked but there are two other "shorthands" that are planned to be allowed

  • select(df, :col1 => f) the same as select(df, :col1 => f => :col1)
  • select(df, :col1 => :col2) the same as select(df, :col1 => identity => :col2)

Also :col1 in all above is allowed to be any selector (single column, many columns, Not, Between, :` etc.)

@bkamins
Copy link
Member

bkamins commented Dec 17, 2019

Just to expand on two important design considerations:

  • we do not use kwargs to process data in select, kwargs are reserved for options passed to select; this will keep things consistent; hopefully this rule can be propagated to other functions in the future (the DataFrame constructor is probably the only exception that I think is OK to keep for convenience)
  • select disallows passing external data to a processed data frame, so col => value pattern is disallowed, however we currently allow this in insertcols!, DataFrame and DataFrame!; it is a slight inconsistency, but we can decide later what to do with this as it is non-conflicting (as a value is passed there and here we accept functions); we might decide to keep it as col => value is naturally produced when you splat a named tuple into a function; if we decide to keep it then the rule will be that col => value is allowed only because it is a value not a function (so nothing gets transformed there - it is kind of key-value mapping in that case);
  • describe uses col => function pattern where col is a target name; this should not be allowed will be fixed if we decide to make all those changes we discuss here

@nalimilan
Copy link
Member

I agree. The name of the resulting column (if provided) should always be the right-most element. So select(df, :col => v) should not be allowed.

OTOH we could use a special syntax to allow using external data, for example select(df, nothing => v => :col) (as in that case there nothing to select as input in the data frame) or even select(df, v => :col). This is not urgent, but it's nice to keep that possibility open, as it could be convenient in some cases to be able to apply several operations based on data from the data frame and from outside it in a single select call.

select(df, :col1 => f) the same as select(df, :col1 => f => :col1)

Currently in combine this is generates a column named :col1_f. This is similar to dplyr's mutate, which uses the parsed expression as a name by default (which often gives very complex names BTW). What's the best choice? The avantages of using a new name are that you won't get name conflicts if you do e.g. select(df, :col1 => abs, :col1 => -), and that select! won't overwrite existing variables too easily. The drawback is that with anonymous functions the generated name isn't very useful.

@bkamins
Copy link
Member

bkamins commented Dec 19, 2019

I drop some comments (this post does not try to propose a fully consistent specification but rather notes some observations - and all of them do not have to be introduced all at once, but maybe they should so that we are sure that we are consistent from the start).

In general my notes below base on the fact that whatever we pass to select it will be expanded to :oldcols => something => :newcol canonical form before any processing takes place later (so our thinking should be about how we transform specific expressions to their canonical equivalents).

So select(df, :col => v) should not be allowed.

I agree it should be not allowed not to create confusion, but I just realized that if we rewrite it as select(df, :col => v => :col) (which was the plan to do implicitly as noted above) this seems to be non problematic as it would be clear what it means (i.e. replace column :col with v). The only extension we would need to make is to allow :col not to exist before (as otherwise it would be an error). We just need to make sure that we accept v to be an AbstractVector (this restriction is to disambiguate it from :newcol - the third argument, which will be always a Symbol or an Integer)

So essentially select(df, :col => v), select(df, :col => v => :col) and select(df, v => :col) will expand to have exactly the same end-effect.

example select(df, nothing => v => :col)

Actually it should be select(df, () => v => :col) I think (passing an empty tuple in this case or an empty vector [] as leftmost value). the same would apply to select(df, () => fun => :col) if we passed fun that took no arguments.

or even select(df, v => :col)

So the only decision here is to what select(df, v => :col) should expand to. There are two options:

  • select(df, () => v => :col)
  • select(df, : => v => :col)

in case of vector v they are equivalent, but we should decide which one is used in the case of select(df, fun => :col) (as we have to know what we pass to fun).

Currently in combine this is generates a column named :col1_f.

I agree this is my major design painpoint. A first comment is that we should add makeunique kwarg to select that will serve the same purpose as in any other function in DataFrames.jl that takes it.

Now which default we take with an implicit column name is something that is non obvious. Especially as select(df, cols => fun can mean that cols is a collection of columns like : or Not(r"x"). In combine we generate :x1 etc. names. I think that a reasonable rule is that we keep being consistent with combine, which means that changing column value under the same name will require writing :col => fun => :col explicitly.

@nalimilan
Copy link
Member

+1

So the only decision here is to what select(df, v => :col) should expand to. There are two options:

* `select(df, () => v => :col)`

* `select(df, : => v => :col)`

in case of vector v they are equivalent, but we should decide which one is used in the case of select(df, fun => :col) (as we have to know what we pass to fun).

Probably the second one is more useful. But let's disallow that for now.

Now which default we take with an implicit column name is something that is non obvious. Especially as select(df, cols => fun can mean that cols is a collection of columns like : or Not(r"x"). In combine we generate :x1 etc. names. I think that a reasonable rule is that we keep being consistent with combine, which means that changing column value under the same name will require writing :col => fun => :col explicitly.

Something we might want to change is that for two-arguments functions we could concatenate the name of the function and the names of the columns, e.g. select(df, [:x, :y] => ((x, y) -> x > 0 ? x : y)) would give :x_y_fun rather than :x1. Same in combine (I think I was just a bit lazy when I restricted that to single-argument functions). But if we do that we need to decide where to put the threshold on the number of arguments.

@pdeffebach
Copy link
Contributor

select(df, [:x, :y] => ((x, y) -> x > 0 ? x : y)) would give :x_y_fun rather than :x1. Same in combine (I think I was just a bit lazy when I restricted that to single-argument functions). But if we do that we need to decide where to put the threshold on the number of arguments.

We should take advantage of broadcasting the => in this case I think. It would allow the user to say "Here are a bunch of arguments, but apply the function to each argument rather than have a function that takes in multiple arguments.

select(df, [:x, :y] => ((x, y) -> x > 0 ? x : y)) # works fine
select(df, [:x, :y] .=> ((x, y) -> x > 0 ? x : y)) #1404 

@nalimilan
Copy link
Member

Sorry, I don't get your point. Since the function takes two arguments, why would you want to broadcast? Also, what's "1404"?

@pdeffebach
Copy link
Contributor

Sorry, I don't get your point. Since the function takes two arguments, why would you want to broadcast? Also, what's "1404"?

sorry I think I misunderstood the above conversation. #1404 was just dumb github autocomplete.

Nevermind my above comment, though I will advocate for broadcasted => after more details are hammered out.

@bkamins
Copy link
Member

bkamins commented Dec 19, 2019

We should add broadcasting into consideration now, to be able to define a comprehensive solution:

select(df, [:x, :y] .=> abs)

Should be rewritten as:

select(df, [:x => abs, :y => abs])

which should be rewritten as:

select(df, [:x => abs => :x_abs, :y => abs => :y_abs])

and should be equivalent to (this point is new in the discussion as it assumes we would unpack AbstractVector{<:Pair} as if it were splatted):

select(df, :x => abs => :x_abs, :y => abs => :y_abs)

Now the additional thing is that if the first argument is wrapped in Col it still should be allowed (I understand we settle that we prefer wrapping sourcecols in Col or similar name). This is doable but will be a bit tricky as Col should then inherit axes from Base.broadcastable called on what it wraps.

@bkamins
Copy link
Member

bkamins commented Dec 25, 2019

This is an initial writeup of the new functionality we discussed:

struct Col{T}
    c::T
end

normalize_selection(idx::AbstractIndex, sel) = idx[sel]
normalize_selection(idx::AbstractIndex, sel::AbstractVector{<:Integer}) = idx[sel]

normalize_selection(idx::AbstractIndex, sel::Pair{<:Any,<:Pair{<:Function,Symbol}}) =
    (sel isa Col ? Col(idx[first(sel.c)]) : idx[first(sel)]) => last(sel)

function normalize_selection(idx::AbstractIndex, sel::Pair{<:Any, <:Function})
    c = idx[first(sel isa Col ? sel.c : sel)]
    fun = last(sel)
    n = String.(_names(idx)[c])
    if length(n) == 1
        newcol = Symbol(n[1], "_", funname(fun))
        return (sel isa Col ? Col(c) : c) => fun => newcol
    else
        throw(ArgumentError("target column name "))
    end
end

normalize_selection(idx::AbstractIndex, sel::Pair{<:ColumnIndex, <:Symbol}) =
    Col(idx[first(sel)]) => identity => last(sel)

function select(df::DataFrame, cs...; copycols::Bool=true)
    isempty(cs) && throw ArgumentError("no selectors passed")
    ncs = normalize_selection.(cs)
    newdf = DataFrame()
    trans_cols = Set{Symbol}()
    for nc in ncs
        if nc isa Pair
            newname = last(last(nc))
            @assert newname isa Symbol
            if haskey(trans_cols, newname)
                throw(ArgumentError("duplicate column names selected"))
            end
            push!(trans_cols, newname)
        end
    end
    for nc in ncs
        if nc isa Int
            newname = _names(df)[nc]
            if !hasproperty(newdf, newname) && !haskey(trans_cols, newname)
                newdf[!, newname] = copycols ? df[:, nc] : df[!, nc]
            end
        elseif nc isa AbstractVector{Int}
            allunique(nc) || throw(ArgumentError("duplicate column names selected"))
            for i in nc
                newname = _names(df)[i]
                if !hasproperty(newdf, newname) && !haskey(trans_cols, newname)
                    newdf[!, newname] = copycols ? df[:, i] : df[!, i]
                end
            end
        elseif nc isa Pair{Int, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            # a copy will be always performed independent of copycols
            newdf[!, newname] = first(last(nc)).(df[!, first(nc)])
        elseif nc isa Pair{AbstractVector{Int}, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            # we pass a a Vector of NamedTuples as an argument
            newdf[!, newname] = first(last(nc)).(Tables.rowtable(df[!, first(nc)]))
        elseif nc isa Pair{Col{Int}, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            oldcol = df[!, first(nc).c]
            newcol = first(last(nc))(oldcol)
            # handle the case that the function could return the passed column as-is
            if newcol === oldcol && copycols
                newdf[:, newname] = newcol
            else
                newdf[!, newname] = newcol
            end
        elseif nc isa Pair{Col{<:AbstractVector{Int}}, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            # we pass a NamedTuple of columns as an argument
            newdf[!, newname] = first(last(nc))(Tables.columntable(df[!, first(nc)]))
        else
            throw(ExceptionError("this should not happen"))
        end
    end
end

What is missing / important notes (left it out to have a gradual implementation):

  • support for broadcasting (in general and for Col in particular, but I want to add it in the second step when we also have broadcasting support for All, :, Not and Between)
  • a related issue is that => notation is now not allowed to be nested (it has to be top-level); again - this will have to be changed after we add broadcasting
  • implementation of select for SubDataFrame
  • implementation of select! for DataFrame
  • note that duplicate names in selection like select(df, [:a, :a]) are not allowed but in select(df, :a, :a) they are allowed with the exception that if some target name appears in :source => fun => :target then :target is not allowed to be a duplicate name
  • possibly some performance improvements can be added, especially to replace Tables.rowtable with a more efficient iterator (probably a barrier function can be used here)
  • I do not check if the new data frame has the same number of rows as the source data frame, which means that something like select(df, Col(:x1) => sum) is allowed;
  • the rules for naming of the columns if we pass zero or more than one column in the form cols => fun and Col(cols) => fun have to be finalized as noted by @nalimilan (now I give an implementation for only one source column)

@bkamins
Copy link
Member

bkamins commented Jan 6, 2020

In #2080 I have proposed a first step towards a full blown select/transform combo

@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

@innerlee - I would close this PR given #2080. OK?

@bkamins bkamins removed the decision label Feb 12, 2020
@bkamins bkamins removed this from the 1.0 milestone Feb 12, 2020
@innerlee
Copy link
Contributor Author

Okay, thanks for the effort! I'm still busy these days, so please go ahead

@innerlee innerlee closed this Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants