add an option to intersect arguments passed to Cols #3224

bkamins · 2022-11-15T17:56:27Z

Cols selector provides an easy way to take a union of column names.
I think it is natural to add names(df, cols1, cols2) that returns an intersection of column names selected by cols1 and cols2.
Why it is natural? names(df) returns all columns currently. names(df, cols) adds one condition on all columns. So it is natural to extend it to more conditions.

In this way we will have an easy way to do both column union and intersection. Why no special object for intersection:

not to require users to learn another object.
most common case for intersection is when you want conditions of different nature, e.g. column type and column name; and column types are accessible only via names as they are context sensitive

Example:

julia> select(df, names(df, Int, r"b"))
1×1 DataFrame
 Row │ b1    
     │ Int64
─────┼───────
   1 │     2

@yjunechoe - what is your opinion on this proposal?

CC @pdeffebach

pdeffebach · 2022-11-15T19:19:00Z

I'm iffy on this. It's not obvious from reading code (aside from documentation) that multiple arguments should mean AND rather than OR.

But it is awkward that Cols is an OR operation. But its not a good idea to add an AND struct equivalent since that complicates the mini-language and Cols is in DataAPI.jl.

I think OP's complaint is not that big a deal, tbh. using names twice isn't the end of the world.

bkamins · 2022-11-15T19:25:02Z

I think OP's complaint is not that big a deal, tbh. using names twice isn't the end of the world.

Also you can use ∩ instead of calling intersect. Still - I want to put it on a table to discuss the issue.

yjunechoe · 2022-11-15T19:49:19Z

I'm happy with leaving the whole set operation vs. boolean algebra issue as a philosophical difference, where you just pick one and stick with it! In R, data.table sticks to the former while dplyr/tidyselect sticks to the latter - I accept their respective choices and people don't complain within their chosen bubbles.

Anyways if/when DF.jl is seriously considering supporting boolean algebra for column selection, I think it might be worth getting input from existing users to see their preferences/difficulties - maybe sets are more intuitive for Julia's more math-y users!

But really thank you for giving my ramble a serious thought :)

Edit: I forgot to give an actual thought about the proposal but just in case - I like this extension of names() and would use it myself. To me this just extends names() to take multiple predicates as opposed to just 1, so not much overhead for grokking this new feature. Collecting the conditions with AND also makes sense to me, as I understand it like "adding more conditions" with a consequence of "selecting a more restricted set"

bkamins · 2022-11-16T17:03:22Z

Let us wait for @nalimilan to comment (as usual 😄). Plus maybe let me add the three options to vote:

👍 add it
👎 do not add it
👀 can be added, but it is not crucial to have it, so it is also OK not to add it

(I vote for :three but I add all options below to make easy voting)

bkamins · 2022-11-16T22:28:15Z

Given a discussion on Slack the following thing could be added alternatively:

make All accept passed arguments and return their intersection (Cols returns union); it would be logical and not introduce a new type, but the problem is that All had a different meaning in the past, so it would require DataAPI.jl 2.0 release. Fortunately I do not think that any package depends on the old meaning now (in DataFrames.jl we error if someone tries to pass the old behavior)
add Condition selector (name to be discussed) that would select the columns based on their contents, so e.g. one would write

Condition(x -> eltype(x) >: Missing) # pick all columns that allow missings
Condition(x -> mean(x) > 0) # pick all columns that have mean greater than 0
Condition(x -> !any(ismissing, x)) # pick all columns that do not have any missing values actually stored

adienes · 2022-11-17T00:27:11Z

For context: I like the correspondence between names(df, pred) and names(df[!, Cols(pred)]) and making names take multiple arguments as an implicit AND would break that. It seems fairly natural to have All function as AND and Cols function as OR, but the names are not nicely matched so might have to think about what names fit.

One question. Why does Condition need to be explicit in the case of checking types? Would it be possible to pass as a condition to Cols / All / Not anything we can pass to names? Imagine if you could write df[!, Not(String)]

bkamins · 2022-11-17T07:44:05Z

Why does Condition need to be explicit in the case of checking types?

It currently is required. We could change it (this requires a careful consideration, but should be doable).

So the current list of things to do is:

Add support for passing types in Cols and Not selectors.
Add Condition (or something similar) to allow passing conditions based on column contents.
Deprecate (but not really remove in the long term) Cols in favor of AnyOf and similarly All() should be AllOf() and add AllOf() that would take an intersection of passed selectors. (AnyOf and AllOf are tentative names)

nalimilan · 2022-11-17T21:16:00Z

I'm not a fan of passing multiple arguments to names either, it's not obvious that it should take the intersection rather than the union. Maybe we could allow something like union(Cols(...), Cols(...)) and intersect(Cols(...), Cols(...)) (and corresponding operators)?

Note that dplyr uses all_of and any_of for something different: these are equivalent except that the former throws an error when a column doesn't exist. So it would be confusing to reuse these names for a different purpose IMO.

Regarding Condition, something like that would indeed be useful, but I find it hard to find a good name. dplyr uses where for that, which is too general (like "condition"). Maybe Cols(values=pred)? Or ColsSOMETHING(...)?

bkamins · 2022-11-17T22:28:45Z

Yes - I think the concept of passing multiple arguments to names was not the best given the feedback.

union(Cols(...), Cols(...)) is not needed, it is just Cols.

intersect(Cols(...), Cols(...)) - this could be added, but we would need to call it Intersect (to enforce non-standard behavior, as intersect is a standard function in Base Julia).

Cols(values=pred) - this would require least changes.

Let us wait for others to comment. There is no rush with making a decision here.

Thank you!

nalimilan · 2022-11-23T08:16:19Z

intersect(Cols(...), Cols(...)) - this could be added, but we would need to call it Intersect (to enforce non-standard behavior, as intersect is a standard function in Base Julia).

Currently intersect(Cols(...), Cols(...)) throws an error, so at least defining a method for it wouldn't be incompatible with the definition of intersect in Base. Actually it wouldn't be too different from the function definition "Construct the set containing those elements which appear in all of the arguments.": it would just be lazy, as Cols contains expressions which can only be resolved when actually trying to access columns.

bkamins · 2022-11-23T09:37:39Z

but the problem is that intersect([1,2,3], ["a", "b", "c"]) would produce an incorrect result (in other words - we need Intersect to make sure we pass normalized column names to it).

bkamins · 2022-11-28T14:34:42Z

Given the comments the to-do list would be (the idea is not to add any new exported names):

Add support for passing types in Cols selector. (relatively easy)
Add support for Cols(values=pred), where pred is applied to columns (note that then it is not allowed to pass any other condition in Cols);
Add support for Cols(args...; operation::Symbol=:union) where operation can be :union or :intersection; by default :union is performed.
Deprecate passing type or predicate to names (require passing a valid column selector, the deprecation will be to wrap them in Cols (in this way we will have a bit more typing, but the design will be more consistent - names and indexing will work in the same way)

nalimilan · 2022-11-29T07:27:50Z

Sounds good. I'm just not sure the last point (deprecation) is worth it. We allow select(df, r"^x" => f) so it makes sense to keep allowing names(df, r"x"). That said, we don't allow,select(df, r"^x" => Int), which is a bit inconsistent with names(df, Int).

bkamins · 2022-11-29T16:13:51Z

@nalimilan - Deprecation is indeed optional. However, can you clarify your last point why you do not think it is a good idea.

My reasoning is:

keep allowing names(df, r"x") as we allow select(df, r"x" => fun) and df[:, r"x"] (r"x" is a valid column selector);
do not allow names(df, Int) as we do not allow select(df, Int => fun) and df[:, Int] (Int is not a valid column selector); start requiring names(df, Cols(Int));
do not allow names(df, startswith("x")) as we do not allow select(df, startswith("x") => fun) and df[:, startswith("x")] (startswith("x") is not a valid column selector); start requiring names(df, Cols(startswith("x")));

adienes · 2022-11-29T19:02:28Z

with the recent ability to middle slurp would it be considered to allow Cols(args..., operation::Symbol) so that it can be accessed like Cols(r"x", r"y", :union) and omit the operation kwarg?

bkamins · 2022-11-29T19:19:55Z

We need to keep Julia 1.6 compatibility.

bkamins · 2022-12-02T21:31:33Z

OK - now I remember the problem with Cols(Int) and Cols(values=...) -> the issue is that AbstractIndex is not aware of column values (only of column names).

See #3034

@krynju - if we added this could DTables.jl efficiently support this? (I fear that not - and we have to stick with name-only selectors and eachcol-based style for value-based selection). But maybe it is OK.

Even if this is not doable we can add operation kwarg to Cols.

krynju · 2022-12-03T19:03:17Z

Under the hood I take col names and col types from Tables.schema, so if the underlying table provides types when calling schema on it then it wouldn't be an issue to support type based selectors in general

On type based selectors: I guess it may be useful in some specific cases? Personally I'd rather stick to name based just to keep my confidence high and not depend on the input type, which may be suddenly parsed differently from version to version (the String to InlineStrings transition that happened at some point)

On multiple column selectors: Confusing - at first look I thought that would return a union in the example from the OP.
#3224 (comment) sounds alright

bkamins · 2022-12-03T20:37:14Z

@krynju - just to clarify. It is not only types, but also column values, so what currently is done like this:

julia> df = DataFrame(x1=[1, missing, missing], x2=[3, 2, 4], x3=[3, missing, 2], x4=Union{Int, Missing}[2, 4, 4])
3×4 DataFrame
 Row │ x1       x2     x3       x4
     │ Int64?   Int64  Int64?   Int64?
─────┼─────────────────────────────────
   1 │       1      3        3       2
   2 │ missing      2  missing       4
   3 │ missing      4        2       4

julia> names(df, any.(ismissing, eachcol(df))) # pick columns that contain missing values
2-element Vector{String}:
 "x1"
 "x3"

would get some special syntax, e.g. Cols(values = x -> any(ismissing, x)). And I feared that this is something that would be problematic to support via schema.

(but in general I understand that you agree that just sticking to column names based selectors is safer for now - right?).

krynju · 2022-12-03T21:39:12Z

@krynju - just to clarify. It is not only types, but also column values, so what currently is done like this:
julia> df = DataFrame(x1=[1, missing, missing], x2=[3, 2, 4], x3=[3, missing, 2], x4=Union{Int, Missing}[2, 4, 4])
3×4 DataFrame
 Row │ x1       x2     x3       x4
     │ Int64?   Int64  Int64?   Int64?
─────┼─────────────────────────────────
   1 │       1      3        3       2
   2 │ missing      2  missing       4
   3 │ missing      4        2       4

julia> names(df, any.(ismissing, eachcol(df))) # pick columns that contain missing values
2-element Vector{String}:
 "x1"
 "x3"
would get some special syntax, e.g. Cols(values = x -> any(ismissing, x)). And I feared that this is something that would be problematic to support via schema.

Alright, I get it. Types are still ok

For values: I technically could support this by interpreting the input adequately, but I see little value in spending time on this as this seems like a niche use case and I'd rather have the user write the DTables code to figure this out and make it as simple as possible

For DTables running a full column check against all columns is just wasteful.

(but in general I understand that you agree that just sticking to column names based selectors is safer for now - right?).

Yes, names are always there and they're reliable.
Types are also ok, but I think they're not mandatory, so that's always a concern

bkamins · 2022-12-04T08:18:13Z

I also have just realized that we allow for names for DataFrameRow and GroupedDataFrame (and we want to). The same with all other column selectors. In this case it is even more problematic to do column selection using column values. Eltype would be acceptable though.

x-ref JuliaData/DataFrames.jl#3224

bkamins · 2022-12-04T22:27:10Z

After the discussions in this PR I am going to limit myself to adding operation kwarg to Cols. Other issues will be decided separately.

First JuliaData/DataAPI.jl#58 needs to be decided an released.

src/other/index.jl

NEWS.md

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

NEWS.md

src/other/index.jl

test/index.jl

…into bk/names

bkamins · 2022-12-17T18:55:58Z

Thank you!

bkamins requested a review from nalimilan November 15, 2022 17:56

bkamins added the feature label Nov 15, 2022

bkamins added this to the 1.5 milestone Nov 15, 2022

bkamins added a commit to JuliaData/DataAPI.jl that referenced this pull request Dec 4, 2022

Add operation kwarg to Cols

1e844ae

x-ref JuliaData/DataFrames.jl#3224

bkamins mentioned this pull request Dec 4, 2022

Add operator kwarg to Cols JuliaData/DataAPI.jl#58

Merged

bkamins changed the title ~~add an option to pass multiple column selectors to names~~ add an option to intersect arguments passed to Cols Dec 5, 2022

add intersect support

935c363

bkamins force-pushed the bk/names branch from e6ca7b8 to 935c363 Compare December 5, 2022 08:19

nalimilan approved these changes Dec 11, 2022

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

src/other/index.jl Outdated Show resolved Hide resolved

NEWS.md Outdated Show resolved Hide resolved

Apply suggestions from code review

a4e7218

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins commented Dec 11, 2022

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

bkamins commented Dec 11, 2022

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

bkamins commented Dec 11, 2022

View reviewed changes

test/index.jl Outdated Show resolved Hide resolved

bkamins added 3 commits December 11, 2022 23:50

Apply suggestions from code review

139602e

change operator to operation

661ccf0

Merge branch 'bk/names' of https://github.com/JuliaData/DataFrames.jl …

d5a03f9

…into bk/names

bkamins closed this Dec 16, 2022

bkamins reopened this Dec 16, 2022

fix tests

3679722

nalimilan approved these changes Dec 17, 2022

View reviewed changes

bkamins merged commit 83285f8 into main Dec 17, 2022

bkamins deleted the bk/names branch December 17, 2022 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add an option to intersect arguments passed to Cols #3224

add an option to intersect arguments passed to Cols #3224

bkamins commented Nov 15, 2022

pdeffebach commented Nov 15, 2022

bkamins commented Nov 15, 2022

yjunechoe commented Nov 15, 2022 •

edited

Loading

bkamins commented Nov 16, 2022 •

edited

Loading

bkamins commented Nov 16, 2022 •

edited

Loading

adienes commented Nov 17, 2022 •

edited

Loading

bkamins commented Nov 17, 2022

nalimilan commented Nov 17, 2022

bkamins commented Nov 17, 2022

nalimilan commented Nov 23, 2022

bkamins commented Nov 23, 2022

bkamins commented Nov 28, 2022

nalimilan commented Nov 29, 2022

bkamins commented Nov 29, 2022

adienes commented Nov 29, 2022

bkamins commented Nov 29, 2022

bkamins commented Dec 2, 2022

krynju commented Dec 3, 2022

bkamins commented Dec 3, 2022

krynju commented Dec 3, 2022

bkamins commented Dec 4, 2022

bkamins commented Dec 4, 2022

bkamins commented Dec 17, 2022

add an option to intersect arguments passed to Cols #3224

add an option to intersect arguments passed to Cols #3224

Conversation

bkamins commented Nov 15, 2022

pdeffebach commented Nov 15, 2022

bkamins commented Nov 15, 2022

yjunechoe commented Nov 15, 2022 • edited Loading

bkamins commented Nov 16, 2022 • edited Loading

bkamins commented Nov 16, 2022 • edited Loading

adienes commented Nov 17, 2022 • edited Loading

bkamins commented Nov 17, 2022

nalimilan commented Nov 17, 2022

bkamins commented Nov 17, 2022

nalimilan commented Nov 23, 2022

bkamins commented Nov 23, 2022

bkamins commented Nov 28, 2022

nalimilan commented Nov 29, 2022

bkamins commented Nov 29, 2022

adienes commented Nov 29, 2022

bkamins commented Nov 29, 2022

bkamins commented Dec 2, 2022

krynju commented Dec 3, 2022

bkamins commented Dec 3, 2022

krynju commented Dec 3, 2022

bkamins commented Dec 4, 2022

bkamins commented Dec 4, 2022

bkamins commented Dec 17, 2022

yjunechoe commented Nov 15, 2022 •

edited

Loading

bkamins commented Nov 16, 2022 •

edited

Loading

bkamins commented Nov 16, 2022 •

edited

Loading

adienes commented Nov 17, 2022 •

edited

Loading