Allow rename when selecting #1975

innerlee · 2019-10-05T13:31:30Z

Based on the previous pr #1974.

Now it supports

julia> df = DataFrame(A = 1:3, B = 'A':'C', C = [:x, :y, :z])
3×3 DataFrame
│ Row │ A     │ B    │ C      │
│     │ Int64 │ Char │ Symbol │
├─────┼───────┼──────┼────────┤
│ 1   │ 1     │ 'A'  │ x      │
│ 2   │ 2     │ 'B'  │ y      │
│ 3   │ 3     │ 'C'  │ z      │

julia> df[:, [:B, :C=>:A]]
3×2 DataFrame
│ Row │ B    │ A      │
│     │ Char │ Symbol │
├─────┼──────┼────────┤
│ 1   │ 'A'  │ x      │
│ 2   │ 'B'  │ y      │
│ 3   │ 'C'  │ z      │

Signed-off-by: lizz <lizz@sensetime.com>

Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>

bkamins · 2019-10-05T14:11:19Z

Thank you for the proposal. I have two general comments:

First, the PR has also changes from #1974 in it, this probably should be cleaned up at some point (this is a minor issue).

Second (and this is a crucial point), I am not 100% sure that this composes well. I.e. maybe it is better to keep select only specialized for selection only without renaming. I am not saying we should not do it, but first we should decide the scope of this change. We could keep the functionality limited to renaming in vectors (this is what you propose), but I am wondering if we want to support something like:

select(df, :a=>:b, :b=>:a)

and then the problem kicks in that we interpret it as select(df, All(:a=>:b, :b=>:a)) (and various interactions now can happen). Any thoughts on this? (note that here I am referring to the version of DataFrames.jl on master, where you can pass multiple positional arguments to select)

innerlee · 2019-10-05T17:08:58Z

Thanks for the comment. Let me describe the background story and motivation for the change.

I started (for the first time) to use DataFrames.jl two days ago when the readdlm and writedlm is not complex enough for a task at hand.
I met a use case where I want this change.

For example, when you want to modify two columns:

julia> df = DataFrame(name=["vid1","vid2","vid3"], pred1=["walk","talk","run"],p1=[0.3,0.5,0.6],pred2=["walk","run","talk"],p2=[0.7,0.6,0.5],outpred=["","",""],outp=[0.,0.,0.])
3×7 DataFrame
│ Row │ name   │ pred1  │ p1      │ pred2  │ p2      │ outpred │ outp    │
│     │ String │ String │ Float64 │ String │ Float64 │ String  │ Float64 │
├─────┼────────┼────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ 1   │ vid1   │ walk   │ 0.3     │ walk   │ 0.7     │         │ 0.0     │
│ 2   │ vid2   │ talk   │ 0.5     │ run    │ 0.6     │         │ 0.0     │
│ 3   │ vid3   │ run    │ 0.6     │ talk   │ 0.5     │         │ 0.0     │

julia> mask = df.p2 .> 0.5
3-element BitArray{1}:
 1
 1
 0

julia> df[mask, [:outpred, :outp]] .= df[mask, [:pred2, :p2]] # their names must match
ERROR: ArgumentError: Column names in broadcasted data frames must match. Non matching column names are pred2, p2, outpred and outp
Stacktrace:
...

julia> df[mask, [:outpred, :outp]] .= df[mask, [:pred2=>:outpred, :p2=>:outp]] # <- look here
2×2 SubDataFrame
│ Row │ outpred │ outp    │
│     │ String  │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ walk    │ 0.7     │
│ 2   │ run     │ 0.6     │

julia> df
3×7 DataFrame
│ Row │ name   │ pred1  │ p1      │ pred2  │ p2      │ outpred │ outp    │
│     │ String │ String │ Float64 │ String │ Float64 │ String  │ Float64 │
├─────┼────────┼────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ 1   │ vid1   │ walk   │ 0.3     │ walk   │ 0.7     │ walk    │ 0.7     │
│ 2   │ vid2   │ talk   │ 0.5     │ run    │ 0.6     │ run     │ 0.6     │
│ 3   │ vid3   │ run    │ 0.6     │ talk   │ 0.5     │         │ 0.0     │

So, the motivation is for the getindex method. Now most of the getindex is implemented through select, so I implement the select first. In this case, renaming only for vectors is enough.

Signed-off-by: lizz <lizz@sensetime.com>

innerlee · 2019-10-06T04:03:25Z

Properly rebased. Only the last commit belong to this pr.

bkamins · 2019-10-06T07:56:46Z

Thank you for all the input. What I would propose to do before we move forward with the review of the implementation is to make sure what is the target API so that JuliaData memebers can comment on it.

The reason is, as I have commented earlier that mixing selection (or getindex which is essentially a selection combined with filtering) with renaming breaks composability a bit (it does not mean that I think we should disallow it, as I understand the use case, but we must have a clear target design before moving forward).

So would be willing to write down the target API for this change (like what should go into the documentation)? In particular please note that it seems that this change will not combine will @view and more importantly @views and Base.dotview, which is slightly problematic also.

bkamins · 2019-10-06T07:58:45Z

Just as a note rename! change is not problematic as it is a local function of DataFrame.jl doing only one thing (that is why I think we can move forward fast in #1974 as opposed to this PR). While getindex is from Base, and should be composable with other things users might define for it, while select is from DataAPI.jl, and ideally should be consistent across different packages.

innerlee · 2019-10-06T09:30:26Z

Agree with you, but assessing the composability and consistency is beyond my knowledge (as I seldom use Data-related packages), so I'm willing to wait for inputs from anyone else.

My intuition comes from SQL where renaming is often used with selection, e.g.

SELECT day_of_order AS "Date"  # <- a rename
Customer As "Client",  # <- another rename
Product,  
Quantity,  
FROM orders;

innerlee · 2019-10-06T10:03:23Z

A quick search:

DataAPI.jl does not contains select
Tables.jl does not contains select
DataFrames.jl directly exports them without importing
CSV.jl does not contains select
DataFramesMeta.jl has a very simple select function and also has a @select macro

The @select macro is interesting because it allows transform columns when selecting, just like in SQL

SELECT Id AS Identifier, 
LastName + ', ' + FirstName AS CustomerName,  # <- also come with rename
FROM Customer

Maybe we can change the syntax from

@select(df, :c, x = :b + :c)

into

@select(df, :c, :b + :c => :x)

bkamins · 2019-10-06T10:37:19Z

OK - I will think about it and write down what I think is safe to do as I think this functionality is useful.

Regarding the packages - the major alternative to DataFrames.jl is JuliaDB.jl related ecosystem. There you have it defined here.

We are in the process of synchronizing the API across packages. That is the point of DataAPI.jl (actually I thought we migrated select there already, but it seems my memory was wrong and this has not happened yet).

A major point to consider is that Pair{Selection => Function} in JuliaDB.jl is already taken in JuliaDB.jl to denote a transformation of a variable without renaming it.

Also maybe we should only provide renaming in select not in getindex. Then the problems with views and broadcasting would not occur.

Finally, why do you think the syntax:

@select(df, :c, x = :b + :c)

is not OK?

innerlee · 2019-10-06T11:12:22Z

The current version of select in JuliaDB.jl differs from DataFrames.jl very much.

here wrote

select(t::Table, which::Selection)
`Selection` is a type union of many types that can select from a table. It can be:
...

* `Pair{Selection => Function}` – selects and maps a function over the selection, returns the result.
* `AbstractArray` – returns the array itself. This must be the same length as the table.  # <- look here

The AbstractArray version simply returns the array itself whereas in DataFrames.jl it is used as an array of indices.
The point is, inconsistency seems to be huge at current status.

Furthermore, it seems the form Pair{Selection => Function} only returns one column without a name. It is not a good api because other options return a table while this one returns an array.

For the @select(df, :c, x = :b + :c) syntax, it does not allow a function form, so I do not like it.
With type piracy, we can have a function form of select with the ability of calculated columns with rename

select(df, [:c, :b + :c => :x])

bkamins · 2019-10-06T14:41:13Z

I agree there is a discrepancy, but my intuition is that it would be good to decrease the level of it with time.

Regarding your proposal here is what I think:

I would prefer getindex not to allow changing column names
select and select! can allow changing column names; a basic pattern is that can be now callded with a single argument that is either a Pair or an AbstractVector of Pairs, Signed, Unsigned or Symbols (I am talking about actual contents of the vector not the type). Then the change should support passing multiple arguments and proper combining of the column renaming expressions with Regex, Not and All results (in particular note that All allows for duplicate column names, which is intended, and then we would have to specify if we take the first or the last column with the same name if we find duplicates).

The syntax could be:

source_column => transformation => target_column_name :: Pair{<:ColumnIndex,Pair{<:Function,Symbol}}

which would take a source column, transform it using the specified transformation (a function accepting one argument and returning a value; here we should make sure to correctly handle copycols kwarg but it should not be problematic) and store it under target column name. The transformation as in JuliaDB.jl should accept scalars (not vectors). CC @nalimilan - do you agree?

with two shorthands:

source_column => target_column_name :: Pair{<:ColumnIndex,Symbol}

expanding to:

source_column => identity => target_column_name

and

source_column => transformation :: Pair{<:ColumnIndex,<:Function}

expanding to:

source_column => transformation => source_column_name

If we went this way the following things should be reviewed in the whole code base against this change:

all getindex, view and dotview methods defined that touch columns (by all I mean for all types that support them)
all select and select! methods (by all I mean for all types that support them)

In particular I already see that:

Base.getindex(df::DataFrame, row_ind::Colon, col_inds::Union{AbstractVector, Regex, Not, Between, All, Colon})

and

Base.getindex(df::DataFrame, row_ind::typeof(!), col_inds::Union{AbstractVector, Regex, Not, Between, All, Colon})

should be reimplemented

and all view-related methods should intercept the new pattern and throw an informative error message (now they would pass down the argument to SubIndex and probably some cryptic error for the user would be thrown).

Overall, after thinking of it, I think your proposal would greatly improve the functionality of select and select! and I like it. However, as you can see - making this change digs deep into the core infrastructure of DataFrames.jl, but what I have described would need to be done I think if we want to provide a consistent user experience.

A simpler approach would be to implement this only for a single Pair or a single AbstractVector argument to select and select! if you would prefer to stick to this only and I would later make another PR extending it to work consistently with all other allowed options (like All, multiple arguments etc.).

CC @nalimilan, @quinnj, @oxinabox, @pdeffebach, @piever (in case you wanted to comment on it as it is a major change 😄)

bkamins · 2019-10-06T14:48:43Z

Also now as I review #1849 by @pdeffebach then as an additional functionality (this does not have to be done now, but I want to keep things in a single place if we discuss the target design issues) is to allow things like:

r"^m" => sin

or

: => sin

as arguments to select.

@pdeffebach (the difference here from what you proposed in #1849 is that it is => not .=> as a broadcast would get resolved before the call to select, so we should handle this without broadcasting inside select).

pdeffebach · 2019-10-06T15:13:26Z

I've always been critical of data frames libraries that combine select and rename. I think it makes code less readable. But I understand it's a feature many people want. I agree with @bkamins that getindex should not allow renaming.

Yes, I apologize for not working on that more. I totally agree that we should make the by functionality incorporate the new regular expressions syntax. I will finish up the PR as is though and we can add it later.

oxinabox · 2019-10-06T15:35:33Z

This change scares me.
Indexing is indexing,
renaming is renaming.

I'm not entirely sure I understand the use case.

I've certainly done such operations in combination. But I don't think it would naturally occur to me to do them as one.
And I am not sure they are often just one kind of pattern.

One pattern I encounter is masking by one column then renaming other column.
And I worry that would be confusing in this form.

Trying it on.
Think a bank statement. With columns description, amount (possituve for income), date, transaction_id

was_expense  = acct.amount .< 0
expenses = acct[was_expense, [:date, :description=>expense, :amount=>:cost, :transaction_id]
expenses.cost .*= -1

Vs

was_expense  = acct.amount .< 0
expenses = acct[was_expense, :]
rename!(expenses, :description=>expense, :amount=>:cost])
expenses.cost .*= -1

I'm not sure.

In the case of your original example.
I would definitely have done that one column at a time.
Which at least in my way of thinking makes more sense than constructing a new dataframe with different names just to do the operation.

df.outpred[mask] .= df.pred2.[mask]
df.outp[mask] .= df.p2[mask]

piever · 2019-10-07T10:11:50Z

I agree with the comments above that renaming should not be part of indexing.

On the other hand, adding renaming to select was proposed in JuliaDB as well (on slack I'm afraid as I can't find the relevant issue). The advantage is that transform would then also incorporate rename, if passed a pair of symbols. Currently, :a => :b goes into the fallback Selection => Callable, but if we are sure that symbols are never callable, then there should be no ambiguity.

The syntax could be:

source_column => transformation => target_column_name :: Pair{<:ColumnIndex,Pair{<:Function,Symbol}}
which would take a source column, transform it using the specified transformation (a function accepting one argument and returning a value; here we should make sure to correctly handle copycols kwarg but it should not be problematic) and store it under target column name. The transformation as in JuliaDB.jl should accept scalars (not vectors).

In JuliaDB the syntax is target_colum_name => source_column => transformation, say:
select(iris, :LogSepalLength => :SepalLength => log), which has the unfortunate consequence of being inconsistent with rename(t, source_column => target_column_name). I'm not sure if there is a good solution as transform(df, newcolname => newvector) seems natural, but transform(df, newcolname => oldcolname) is at odds with rename. Maybe the "pair order" alone is a sufficient reason to keep things separate?

innerlee · 2019-10-07T11:49:25Z

I agree changing the semantic of getindex is a bad idea since it belongs to Base. Lets forget about getindex and reconsider select.

bkamins · 2019-10-07T14:21:33Z

@piever I proposed source_column => transformation => target_column_name (with two shorthand notations added) as this is a natural way to read => sign.

In my mental model this is a difference between = which works like an assignment to LHS and => which rather works as assignment to RHS (like in Dict a key is mapped to the value) - so => represents the flow of data left to right.

Now if one accepts the:

source_column => transformation => target_column_name

approach then naturally all other cases follow i.e.

source_column => transformation # use the same column name as target
vector => target_column_name # assign a vector to a target column name without taking a source
source_column  => target_column_name # use identity as trasnformation

The major point if I understand the situation is that it would be breaking for JuliaDB.jl. Right?

EDIT
To be clearer what I mean is that I think that rather than transform(df, newcolname => newvector) a natural way to write this thing for me is transform(df, newvector => newcolname) as newvector gets assigned to newcolname.

oxinabox · 2019-10-07T15:02:24Z

I feel like most of the stuff relating to this doesn't apply to DataFrames as much as it does to JuliaDB.
Because DataFrames are column-tables (and mutable).
so it is just as fast to tranform 2 columns in sequence
and to tranform 2 at one (maybe faster even because cache locality stuff).
Where as JuliaDB is row-tables,
so every tranform that adds new columns means reallocating every row.

So in DataFrames it is always more natural to write a series of

df.newdcolname = tranformation(df.oldcolumnname)

and renames.

bkamins · 2019-10-07T15:11:44Z

For DataFrames.jl we have a major decision:

do we want select to perform only a selection operation (so we stick to "do one thing and do it well" philosophy); this is probably an approach that is natural for some people (Unix was designed around this so it seems to work 😄)
we want select to also do transformations and renames - this is probably something that many people want; I can see a point of such thinking as it is convenient and I would be OK to add this (having this functionality does not force me to use it).

And I am stating, that if we went for option 2, the natural syntax for select would be source_column => transformation => target_column_name.

Drvi · 2019-11-05T08:33:59Z

@bkamins The second option is exactly what I've been working on a branch of my Selections.jl package, using only the Tables.jl. I think select is suited for definition 2. as there is a long SQL tradition that packs all this functionality into SELECT clause, and we can dedicate other functions/verbs to restricted functionality (having select, rename, transform, permutecols...).

Here is the API I'm finishing right now for the select function:

key clause | if clause | transform clause |  example                                                          |  meaning
---------------------------------------------------------------------------------------------------------------------------------------------------------
     -     |     x     |        -         |  select(df, if_matches("a"))                                      |  select columns containing "a"
     x     |     x     |        -         |  select(df, if_matches("a") => key_prefix("a_"))                  |  select columns containing "a" and prepend "a_"
     -     |     x     |        x         |  select(df, if_matches("a") => bycol.(string))                     |  select columns containing "a" make their value strings
     x     |     x     |        x         |  select(df, if_matches("a") => key_prefix("a_") => bycol.(string)) |  prepend "a_" to all columns containing "a" and make their value strings

(There are sensible default constructors for different combinations of types in the pairs supplied to select)

where the key clause are "renamings" (functions applied to selected columns -- key_prefix (adds prefix), key_suffix (adds suffix), key_map (apply function to name), key_replace (replace pattern in name) -- or a vector/tuple of Symbols to replace the names of selected columns),

the "if clause" are the selections which can be Symbols, Ints, AbstractVector{Bool}, AbstractRange (+ colrange function handles even ranges of symbols), and functions returning bool like if_matches (checks if regex or string occurs in column name), if_keys (predicate function applied to column names), if_values (predicate function applied to columns themselves), if_pairs (predicate functions that takes both column names and actual columns as arguments) and if_eltype. All these selections can be inverted/negated so that they match the complement of the columns they would match otherwise. There are also special selections like all_cols(), other_cols() (the columns that were not selected in any other part of the selection query) and else_cols() (the columns that the previous selection query didn't capture).

The transform clause is the new thing I was working on -- here is a table that summarized what it can do:

constructor  | example                                                    | meaning
-------------------------------------------------------------------------------------------------
bycol(f)     | bycol(x -> x ./ norm(x))                                   | Scale selected column x by its norm
bycol!(f)    | bycol!(x -> x ./ norm(x))                                  | -- || -- inplace (if the resulting type permits)
bycol.(f)    | bycol.(x -> x + 1)                                         | add 1 to every element of selected column x
bycol!.(f)   | bycol!.(x -> x + 1)                                        | -- || -- inplace (if the resulting type permits)
byrow(f)     | byrow((rowtable,name) -> f(rowtable, name)                 | Apply function to a vector of namedtuples (the rowtable), with name being the name of the current column
byrow!(f)    | byrow!((rowtable,name) -> f(rowtable, name))               | -- || -- inplace (if the resulting type permits)
byrow.(f)    | byrow.((row,name) -> row.a + row[name])                    | Apply function to each row, name is the name of the current column
byrow!.(f)   | byrow!.((row,name) -> row.a + row[name])                   | -- || -- inplace (if the resulting type permits)
bytab(f)     | bytab((coltable,name) -> coltable.a .+ coltable[name])     | Add the column :a to the current column
bytab!(f)    | bytab!((coltable,name) -> coltable.a .+ coltable[name])     | -- || -- inplace (if the resulting type permits)
bytab.(f)    | error                                                      | Broadcasting over namedtuples is not defined, and I'm not sure what this should do
bytab!.(f)   | error                                                      | -- || -- but inplace:)

Broadcasting plays nicely with your ability to chain multiple selections together (you can e.g. if_matches(r"a") | if_eltype(Number) to select all the columns that contain the letter "a" or are numeric) in a sense that when you chain multiple broadcasted transformations like (a: => bycol.(x->x+1)) | (all_cols() => bycol.(x -> 2x)) the in the case of column :a, the transformations will be fused as if there was only one transformation x-> (x + 1) * 2.

Finally, the thing I'm still working on is the ability to create new columns. For this I'd like to use the key word arguments as so:

select(df, :y, if_eltype(Numeric), fitted_values = bytab((tab, _) -> fited_values(model(y = tab[1], tab[2:end]))))

This would create new column fitted_values from model using all numerical columns (except of y) on y. We know :y is the first column because it appeared first in the args... part of select. Of course, the current way of creating new columns will still be available.

So the message is: I'm still working on this, if you want to discuss this, I'm @drvi at julia slack.
Sorry for hijacking this thread!
CC @nalimilan, @quinnj

bkamins · 2019-12-08T06:27:52Z

Just a mention. When allowing column renaming we will have to be careful with duplicate column names produced. This is something we did not have to care about in the past, but now if someone writes :a => fun1 => :c, :b => fun2 => :c the question is: a) if we should allow it at all, b) if yes which result should be stored, c) if yes - if both fun1 and fun2 should be run (as they can have side effects).

bkamins · 2019-12-08T15:44:25Z

Another thought. For: for DataFrame and insertcols! we use :newcolname => value syntax, maybe it can stay, as it is consistent with "key-value" meaning in Dict, where the Pair originated?

Then we would say that => as pipe would be used only in transformations.

The only drawback is that it is imaginable that in select we could allow creation of new columns from a value like select(df, :a, :b, some_vector => :c) which would be unambiguous, but inconsistent with "key-value" approach described approach (as we would pipe value to key here).

Does anyone have some thoughts over this?

nalimilan · 2019-12-11T10:27:00Z

When allowing column renaming we will have to be careful with duplicate column names produced. This is something we did not have to care about in the past, but now if someone writes :a => fun1 => :c, :b => fun2 => :c the question is: a) if we should allow it at all, b) if yes which result should be stored, c) if yes - if both fun1 and fun2 should be run (as they can have side effects).

Let's throw an error for now, and (as always) we can change that later if it turns out to be useful.

For: for DataFrame and insertcols! we use :newcolname => value syntax, maybe it can stay, as it is consistent with "key-value" meaning in Dict, where the Pair originated?

Then we would say that => as pipe would be used only in transformations.

The only drawback is that it is imaginable that in select we could allow creation of new columns from a value like select(df, :a, :b, some_vector => :c) which would be unambiguous, but inconsistent with "key-value" approach described approach (as we would pipe value to key here).

In the context of insertcols! there's no ambiguity so that's not a big deal, but yeah, several interpretations are possible. We could almost allow value => :newcolname too... Or require using a named tuple instead of pairs...

bkamins · 2019-12-11T13:22:46Z

Or require using a named tuple instead of pairs...

NamedTuple is not ideal because it is problematic with non-standard column names.

The problem with value => :newcolname in insertcols! and DataFrame is that now Pair{Symbol, Symbol} is allowed in both of them due to implicit broadcasting they make. So having both patterns would be ambiguous.

So I now tend to thing that (summarizing the design):

we can leave DataFrame and insertcols! as they are now, treating it as key-value dict mapping.
change the order in describe
for select we will allow (there is a delicate interaction with copycols kwarg, but I will hopefully handle it efficiently)
- :oldcol => fun => :newcol (a full pattern)
- :oldcol => :newcol (shorthand for :oldcol => identity => :newcol)
- :oldcol => fun(shorthand for:oldcol => fun => :oldcol`)
in combine I would allow :oldcol => fun => :newcol pattern only (without shorthands, at least for now), where :oldcol can also be a list of columns; here the only question is how do we indicate if we want to pass a NamedTuple of columns vs a SubDataFrame (but this is a separate design issue - if @nalimilan would open a PR for combine with this changes we can discuss it there)

Now as for duplicates of :newcol in select I think we can use the following rules (I will have to validate them when I implement the PR, because this seems to be a tricky issue):

we do not allow :newcol to appear twice in :oldcol => fun => :newcol pattern or in any other selection pattern
:oldcol => fun => :newcol pattern is not allowed to be nested (i.e. it must be a top-level positional argument to select)

(the only tricky case that I see now is when :newcol was present in source data frame, and it gets selected by some other selection mechanism, but I think that for now we should throw an error in such a case)

Does it sound good? If yes I will implement 2. (describe - a new PR) and 3. (select - also in a new PR) and would kindly ask @nalimilan to have a look at 4.

pdeffebach · 2019-12-11T18:49:19Z

I can do the describe PR if we have settled on always having fun => :name from now on. @nalimilan let me know if I should start a branch.

pdeffebach · 2019-12-11T18:52:51Z

Also, given how complicated this discussion has gotten, are we sure we don't want to just have 3 functions, rename which only renames, select which only selects, and transform which only transforms? it sounds like at this point all functions will have the same functionality. Might as well call it renameselecttransform(df, args...).

I should note that if select allows renaming, I would still mandate in all code reviews for my research projects that each function should do only what it's name implies, for readability. That is to say, I would not take advantage of these features if they were added.

bkamins · 2019-12-11T21:08:14Z

Yeah - this is kind of what I had in my mind originally in #1975 (comment).

We have two options in general:

Option 1

Make select be SQL-like select (this is what we ended up discussing as a major alternative).
Then selects is a function that allows renaming, transformations and selection.

Option 2

Leave select and rename as they are and add transform function.
For transform we still need some kind of pattern like :oldcol => fun => :newcol with an :oldcol => fun shorthand for :oldcol => fun => :oldcol.

The only difference is that in transform we could stick to the :newcol = :oldcol => fun pattern as it is now used in combine. The reason is that transform would not have positional arguments except for :oldcol => :newcol, which would replace an old column with a new column.

The reason it is fully valid is that as of Julia 1.3 we can write:

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                               b = repeat([2, 1], outer=[4]),
                               c = 1:8);

julia> by(df, :a, var"c sum" = :c => sum)
4×2 DataFrame
│ Row │ a     │ c sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

which means that we do not have a problem that column name is not a proper identifier.

As noted above - I prefer Option 2 (it is a cleaner design for me and does not require changing the rest of the API), but I see the point of Option 1, as it is SQL-like select.

bkamins · 2019-12-15T10:44:07Z

After a discussion with @nalimilan we have settled to use "powerful select" option.

In particular a simple transform of a column (adding a column to a data frame) will be performed by:

select(df, :, :oldcol => fun => :newcol) # adds column :newcol at the end of the data frame, but keeps :oldcol in its old position

or

select(df, :, :oldcol => fun) # replaces :oldcol in its old position by a result of the transformation

We just have to decide what we do when copycols=false and source is a SubDataFrame - do we disallow renaming and transforms or materialize a DataFrame containing the transforms (I am tending to prefer the former option - i.e. we do not allow changing columns of a view in place).

@nalimilan - would you want to add some more comment about your thinking about the issue (i.e. the issue of specialized vs powerful select 😄).

piever · 2019-12-17T13:05:34Z

Sorry if this has been already discussed to death (and I may have argued differently in the past), but I'm starting to worry that the :oldcol => fun => :newcol order may create some unintuitive behaviors. To me the main distinction is whether naming a selection is done via Selection => Symbol or Symbol => Selection. As far as I understand the plan is:

select(df, :col => v) with v a vector is Symbol => Selection
select(df, col = v) with v a vector is Symbol => Selection
select(df, :col1 => f => :col2) with v a vector is Selection => Symbol
select(df, col1 = f => :col2) I imagine is disallowed?
select(df, :, col1 = :col2 => f) is Symbol => Selection (but maybe disallowed?)

Starting from this, I find a bit hard to guess what select(df, :col1 => :col2) should do. I'm afraid users are equally likely to expect select(df, :col1 => :col2) == select(df, :col1 => df.col2) or select(df, :col1 => :col2) == select(df, :col1 => identity => :col2), but the two are incompatible. Similarly, what about select(df, col1 = :col2)? Would this be allowed?

Or is the argument that, in general, Selection => Symbol and Symbol => Selection are both allowed, and in the overlap Symbol => Symbol, it was necessary to pick one and the decision criterion was compatibility with rename?

I suspect it may be safer to even disallow Symbol => Symbol, esp. because the behavior is different from rename. Unline rename(df, :oldcol => :newcol), select(df, :, :oldcol => :newcol), if I understand correctly, would keep a column named :oldcol (just like select(df, :, :oldcol => identity => :newcol) would).

If one does not want select(df, :col1 => :col2), then I imagine the Symbol => Selection approach is more consistent across the board, so it may be a simpler option.

bkamins · 2019-12-17T13:24:54Z

It is better to discuss it now than later. My understanding is:

select(df, :col => v) with v a vector is Symbol => Selection (disallowed)
select(df, col = v) with v a vector is Symbol => Selection (disallowed)
select(df, :col1 => f => :col2) with v a vector is Selection => Symbol (gets :col1 transforms it using f and stores the result in :col2)
select(df, col1 = f => :col2) I imagine is disallowed? (disallowed)
select(df, :, col1 = :col2 => f) is Symbol => Selection (but maybe disallowed?) (disallowed)

also you have not asked but there are two other "shorthands" that are planned to be allowed

select(df, :col1 => f) the same as select(df, :col1 => f => :col1)
select(df, :col1 => :col2) the same as select(df, :col1 => identity => :col2)

Also :col1 in all above is allowed to be any selector (single column, many columns, Not, Between, :` etc.)

bkamins · 2019-12-17T14:09:14Z

Just to expand on two important design considerations:

we do not use kwargs to process data in select, kwargs are reserved for options passed to select; this will keep things consistent; hopefully this rule can be propagated to other functions in the future (the DataFrame constructor is probably the only exception that I think is OK to keep for convenience)
select disallows passing external data to a processed data frame, so col => value pattern is disallowed, however we currently allow this in insertcols!, DataFrame and DataFrame!; it is a slight inconsistency, but we can decide later what to do with this as it is non-conflicting (as a value is passed there and here we accept functions); we might decide to keep it as col => value is naturally produced when you splat a named tuple into a function; if we decide to keep it then the rule will be that col => value is allowed only because it is a value not a function (so nothing gets transformed there - it is kind of key-value mapping in that case);
describe uses col => function pattern where col is a target name; this should not be allowed will be fixed if we decide to make all those changes we discuss here

nalimilan · 2019-12-19T09:22:51Z

I agree. The name of the resulting column (if provided) should always be the right-most element. So select(df, :col => v) should not be allowed.

OTOH we could use a special syntax to allow using external data, for example select(df, nothing => v => :col) (as in that case there nothing to select as input in the data frame) or even select(df, v => :col). This is not urgent, but it's nice to keep that possibility open, as it could be convenient in some cases to be able to apply several operations based on data from the data frame and from outside it in a single select call.

select(df, :col1 => f) the same as select(df, :col1 => f => :col1)

Currently in combine this is generates a column named :col1_f. This is similar to dplyr's mutate, which uses the parsed expression as a name by default (which often gives very complex names BTW). What's the best choice? The avantages of using a new name are that you won't get name conflicts if you do e.g. select(df, :col1 => abs, :col1 => -), and that select! won't overwrite existing variables too easily. The drawback is that with anonymous functions the generated name isn't very useful.

bkamins · 2019-12-19T12:44:22Z

I drop some comments (this post does not try to propose a fully consistent specification but rather notes some observations - and all of them do not have to be introduced all at once, but maybe they should so that we are sure that we are consistent from the start).

In general my notes below base on the fact that whatever we pass to select it will be expanded to :oldcols => something => :newcol canonical form before any processing takes place later (so our thinking should be about how we transform specific expressions to their canonical equivalents).

So select(df, :col => v) should not be allowed.

I agree it should be not allowed not to create confusion, but I just realized that if we rewrite it as select(df, :col => v => :col) (which was the plan to do implicitly as noted above) this seems to be non problematic as it would be clear what it means (i.e. replace column :col with v). The only extension we would need to make is to allow :col not to exist before (as otherwise it would be an error). We just need to make sure that we accept v to be an AbstractVector (this restriction is to disambiguate it from :newcol - the third argument, which will be always a Symbol or an Integer)

So essentially select(df, :col => v), select(df, :col => v => :col) and select(df, v => :col) will expand to have exactly the same end-effect.

example select(df, nothing => v => :col)

Actually it should be select(df, () => v => :col) I think (passing an empty tuple in this case or an empty vector [] as leftmost value). the same would apply to select(df, () => fun => :col) if we passed fun that took no arguments.

or even select(df, v => :col)

So the only decision here is to what select(df, v => :col) should expand to. There are two options:

select(df, () => v => :col)
select(df, : => v => :col)

in case of vector v they are equivalent, but we should decide which one is used in the case of select(df, fun => :col) (as we have to know what we pass to fun).

Currently in combine this is generates a column named :col1_f.

I agree this is my major design painpoint. A first comment is that we should add makeunique kwarg to select that will serve the same purpose as in any other function in DataFrames.jl that takes it.

Now which default we take with an implicit column name is something that is non obvious. Especially as select(df, cols => fun can mean that cols is a collection of columns like : or Not(r"x"). In combine we generate :x1 etc. names. I think that a reasonable rule is that we keep being consistent with combine, which means that changing column value under the same name will require writing :col => fun => :col explicitly.

nalimilan · 2019-12-19T13:47:45Z

+1

So the only decision here is to what select(df, v => :col) should expand to. There are two options:
* `select(df, () => v => :col)`

* `select(df, : => v => :col)`
in case of vector v they are equivalent, but we should decide which one is used in the case of select(df, fun => :col) (as we have to know what we pass to fun).

Probably the second one is more useful. But let's disallow that for now.

Now which default we take with an implicit column name is something that is non obvious. Especially as select(df, cols => fun can mean that cols is a collection of columns like : or Not(r"x"). In combine we generate :x1 etc. names. I think that a reasonable rule is that we keep being consistent with combine, which means that changing column value under the same name will require writing :col => fun => :col explicitly.

Something we might want to change is that for two-arguments functions we could concatenate the name of the function and the names of the columns, e.g. select(df, [:x, :y] => ((x, y) -> x > 0 ? x : y)) would give :x_y_fun rather than :x1. Same in combine (I think I was just a bit lazy when I restricted that to single-argument functions). But if we do that we need to decide where to put the threshold on the number of arguments.

pdeffebach · 2019-12-19T14:22:39Z

select(df, [:x, :y] => ((x, y) -> x > 0 ? x : y)) would give :x_y_fun rather than :x1. Same in combine (I think I was just a bit lazy when I restricted that to single-argument functions). But if we do that we need to decide where to put the threshold on the number of arguments.

We should take advantage of broadcasting the => in this case I think. It would allow the user to say "Here are a bunch of arguments, but apply the function to each argument rather than have a function that takes in multiple arguments.

select(df, [:x, :y] => ((x, y) -> x > 0 ? x : y)) # works fine
select(df, [:x, :y] .=> ((x, y) -> x > 0 ? x : y)) #1404

nalimilan · 2019-12-19T14:26:24Z

Sorry, I don't get your point. Since the function takes two arguments, why would you want to broadcast? Also, what's "1404"?

pdeffebach · 2019-12-19T14:30:04Z

Sorry, I don't get your point. Since the function takes two arguments, why would you want to broadcast? Also, what's "1404"?

sorry I think I misunderstood the above conversation. #1404 was just dumb github autocomplete.

Nevermind my above comment, though I will advocate for broadcasted => after more details are hammered out.

bkamins · 2019-12-19T20:33:12Z

We should add broadcasting into consideration now, to be able to define a comprehensive solution:

select(df, [:x, :y] .=> abs)

Should be rewritten as:

select(df, [:x => abs, :y => abs])

which should be rewritten as:

select(df, [:x => abs => :x_abs, :y => abs => :y_abs])

and should be equivalent to (this point is new in the discussion as it assumes we would unpack AbstractVector{<:Pair} as if it were splatted):

select(df, :x => abs => :x_abs, :y => abs => :y_abs)

Now the additional thing is that if the first argument is wrapped in Col it still should be allowed (I understand we settle that we prefer wrapping sourcecols in Col or similar name). This is doable but will be a bit tricky as Col should then inherit axes from Base.broadcastable called on what it wraps.

bkamins · 2019-12-25T11:00:55Z

This is an initial writeup of the new functionality we discussed:

struct Col{T}
    c::T
end

normalize_selection(idx::AbstractIndex, sel) = idx[sel]
normalize_selection(idx::AbstractIndex, sel::AbstractVector{<:Integer}) = idx[sel]

normalize_selection(idx::AbstractIndex, sel::Pair{<:Any,<:Pair{<:Function,Symbol}}) =
    (sel isa Col ? Col(idx[first(sel.c)]) : idx[first(sel)]) => last(sel)

function normalize_selection(idx::AbstractIndex, sel::Pair{<:Any, <:Function})
    c = idx[first(sel isa Col ? sel.c : sel)]
    fun = last(sel)
    n = String.(_names(idx)[c])
    if length(n) == 1
        newcol = Symbol(n[1], "_", funname(fun))
        return (sel isa Col ? Col(c) : c) => fun => newcol
    else
        throw(ArgumentError("target column name "))
    end
end

normalize_selection(idx::AbstractIndex, sel::Pair{<:ColumnIndex, <:Symbol}) =
    Col(idx[first(sel)]) => identity => last(sel)

function select(df::DataFrame, cs...; copycols::Bool=true)
    isempty(cs) && throw ArgumentError("no selectors passed")
    ncs = normalize_selection.(cs)
    newdf = DataFrame()
    trans_cols = Set{Symbol}()
    for nc in ncs
        if nc isa Pair
            newname = last(last(nc))
            @assert newname isa Symbol
            if haskey(trans_cols, newname)
                throw(ArgumentError("duplicate column names selected"))
            end
            push!(trans_cols, newname)
        end
    end
    for nc in ncs
        if nc isa Int
            newname = _names(df)[nc]
            if !hasproperty(newdf, newname) && !haskey(trans_cols, newname)
                newdf[!, newname] = copycols ? df[:, nc] : df[!, nc]
            end
        elseif nc isa AbstractVector{Int}
            allunique(nc) || throw(ArgumentError("duplicate column names selected"))
            for i in nc
                newname = _names(df)[i]
                if !hasproperty(newdf, newname) && !haskey(trans_cols, newname)
                    newdf[!, newname] = copycols ? df[:, i] : df[!, i]
                end
            end
        elseif nc isa Pair{Int, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            # a copy will be always performed independent of copycols
            newdf[!, newname] = first(last(nc)).(df[!, first(nc)])
        elseif nc isa Pair{AbstractVector{Int}, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            # we pass a a Vector of NamedTuples as an argument
            newdf[!, newname] = first(last(nc)).(Tables.rowtable(df[!, first(nc)]))
        elseif nc isa Pair{Col{Int}, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            oldcol = df[!, first(nc).c]
            newcol = first(last(nc))(oldcol)
            # handle the case that the function could return the passed column as-is
            if newcol === oldcol && copycols
                newdf[:, newname] = newcol
            else
                newdf[!, newname] = newcol
            end
        elseif nc isa Pair{Col{<:AbstractVector{Int}}, <:Pair{<:Function, Symbol}}
            newname = last(last(nc))
            # we pass a NamedTuple of columns as an argument
            newdf[!, newname] = first(last(nc))(Tables.columntable(df[!, first(nc)]))
        else
            throw(ExceptionError("this should not happen"))
        end
    end
end

What is missing / important notes (left it out to have a gradual implementation):

support for broadcasting (in general and for Col in particular, but I want to add it in the second step when we also have broadcasting support for All, :, Not and Between)
a related issue is that => notation is now not allowed to be nested (it has to be top-level); again - this will have to be changed after we add broadcasting
implementation of select for SubDataFrame
implementation of select! for DataFrame
note that duplicate names in selection like select(df, [:a, :a]) are not allowed but in select(df, :a, :a) they are allowed with the exception that if some target name appears in :source => fun => :target then :target is not allowed to be a duplicate name
possibly some performance improvements can be added, especially to replace Tables.rowtable with a more efficient iterator (probably a barrier function can be used here)
I do not check if the new data frame has the same number of rows as the source data frame, which means that something like select(df, Col(:x1) => sum) is allowed;
the rules for naming of the columns if we pass zero or more than one column in the form cols => fun and Col(cols) => fun have to be finalized as noted by @nalimilan (now I give an implementation for only one source column)

bkamins · 2020-01-06T22:01:25Z

In #2080 I have proposed a first step towards a full blown select/transform combo

bkamins · 2020-02-12T10:53:15Z

@innerlee - I would close this PR given #2080. OK?

innerlee · 2020-02-12T11:52:57Z

Okay, thanks for the effort! I'm still busy these days, so please go ahead

Allow permutaion of names in rename!

06410cb

Signed-off-by: lizz <lizz@sensetime.com>

innerlee mentioned this pull request Oct 5, 2019

Allow permutaion of names in rename! #1974

Merged

innerlee and others added 2 commits October 5, 2019 21:38

Update src/other/index.jl

67b4995

Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>

Update src/other/index.jl

b77c40c

Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>

innerlee force-pushed the zz/selectwithrename branch 2 times, most recently from 79c7e57 to 1ebb978 Compare October 5, 2019 17:01

innerlee changed the title ~~Allow rename when selecting~~ Allow rename in getindex Oct 5, 2019

innerlee added 2 commits October 6, 2019 11:56

let 1.0 pass

a740b42

Signed-off-by: lizz <lizz@sensetime.com>

Allow rename when selecting

93219f4

Signed-off-by: lizz <lizz@sensetime.com>

innerlee force-pushed the zz/selectwithrename branch from 1ebb978 to 93219f4 Compare October 6, 2019 04:01

innerlee changed the title ~~Allow rename in getindex~~ Allow rename when selecting Oct 7, 2019

pdeffebach mentioned this pull request Dec 19, 2019

Standardizing working with multiple columns #2016

Closed

Drvi mentioned this pull request Dec 22, 2019

Selections.jl + DataFrames.jl #1936

Closed

This was referenced Jan 1, 2020

Allow containers with no columns in combine #2066

Merged

Add transformation and renaming to select and select! #2080

Merged

pdeffebach mentioned this pull request Jan 17, 2020

Standardizing where the target column goes in transform, insertcols and constructors #2084

Closed

bkamins removed the decision label Feb 12, 2020

bkamins removed this from the 1.0 milestone Feb 12, 2020

innerlee closed this Feb 12, 2020

Allow rename when selecting #1975

Allow rename when selecting #1975

Conversation

innerlee commented Oct 5, 2019

bkamins commented Oct 5, 2019

innerlee commented Oct 5, 2019

innerlee commented Oct 6, 2019

bkamins commented Oct 6, 2019

bkamins commented Oct 6, 2019

innerlee commented Oct 6, 2019 • edited Loading

innerlee commented Oct 6, 2019

bkamins commented Oct 6, 2019

innerlee commented Oct 6, 2019 • edited Loading

bkamins commented Oct 6, 2019

bkamins commented Oct 6, 2019 • edited Loading

pdeffebach commented Oct 6, 2019

oxinabox commented Oct 6, 2019

piever commented Oct 7, 2019

innerlee commented Oct 7, 2019

bkamins commented Oct 7, 2019 • edited Loading

oxinabox commented Oct 7, 2019

bkamins commented Oct 7, 2019

Drvi commented Nov 5, 2019 • edited Loading

bkamins commented Dec 8, 2019

bkamins commented Dec 8, 2019

nalimilan commented Dec 11, 2019

bkamins commented Dec 11, 2019

pdeffebach commented Dec 11, 2019

pdeffebach commented Dec 11, 2019 • edited Loading

bkamins commented Dec 11, 2019 • edited Loading

Option 1

Option 2

bkamins commented Dec 15, 2019 • edited Loading

piever commented Dec 17, 2019

bkamins commented Dec 17, 2019 • edited Loading

bkamins commented Dec 17, 2019

nalimilan commented Dec 19, 2019

bkamins commented Dec 19, 2019

nalimilan commented Dec 19, 2019

pdeffebach commented Dec 19, 2019

nalimilan commented Dec 19, 2019

pdeffebach commented Dec 19, 2019

bkamins commented Dec 19, 2019

bkamins commented Dec 25, 2019

bkamins commented Jan 6, 2020

bkamins commented Feb 12, 2020

innerlee commented Feb 12, 2020

innerlee commented Oct 6, 2019 •

edited

Loading

innerlee commented Oct 6, 2019 •

edited

Loading

bkamins commented Oct 6, 2019 •

edited

Loading

bkamins commented Oct 7, 2019 •

edited

Loading

Drvi commented Nov 5, 2019 •

edited

Loading

pdeffebach commented Dec 11, 2019 •

edited

Loading

bkamins commented Dec 11, 2019 •

edited

Loading

bkamins commented Dec 15, 2019 •

edited

Loading

bkamins commented Dec 17, 2019 •

edited

Loading