Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an option to intersect arguments passed to Cols #3224

Merged
merged 6 commits into from
Dec 17, 2022
Merged

add an option to intersect arguments passed to Cols #3224

merged 6 commits into from
Dec 17, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 15, 2022

Cols selector provides an easy way to take a union of column names.
I think it is natural to add names(df, cols1, cols2) that returns an intersection of column names selected by cols1 and cols2.
Why it is natural? names(df) returns all columns currently. names(df, cols) adds one condition on all columns. So it is natural to extend it to more conditions.

In this way we will have an easy way to do both column union and intersection. Why no special object for intersection:

  1. not to require users to learn another object.
  2. most common case for intersection is when you want conditions of different nature, e.g. column type and column name; and column types are accessible only via names as they are context sensitive

Example:

julia> select(df, names(df, Int, r"b"))
1×1 DataFrame
 Row │ b1    
     │ Int64
─────┼───────
   1 │     2

@yjunechoe - what is your opinion on this proposal?

CC @pdeffebach

@bkamins bkamins added this to the 1.5 milestone Nov 15, 2022
@pdeffebach
Copy link
Contributor

I'm iffy on this. It's not obvious from reading code (aside from documentation) that multiple arguments should mean AND rather than OR.

But it is awkward that Cols is an OR operation. But its not a good idea to add an AND struct equivalent since that complicates the mini-language and Cols is in DataAPI.jl.

I think OP's complaint is not that big a deal, tbh. using names twice isn't the end of the world.

@bkamins
Copy link
Member Author

bkamins commented Nov 15, 2022

I think OP's complaint is not that big a deal, tbh. using names twice isn't the end of the world.

Also you can use instead of calling intersect. Still - I want to put it on a table to discuss the issue.

@yjunechoe
Copy link

yjunechoe commented Nov 15, 2022

I'm happy with leaving the whole set operation vs. boolean algebra issue as a philosophical difference, where you just pick one and stick with it! In R, data.table sticks to the former while dplyr/tidyselect sticks to the latter - I accept their respective choices and people don't complain within their chosen bubbles.

Anyways if/when DF.jl is seriously considering supporting boolean algebra for column selection, I think it might be worth getting input from existing users to see their preferences/difficulties - maybe sets are more intuitive for Julia's more math-y users!

But really thank you for giving my ramble a serious thought :)

Edit: I forgot to give an actual thought about the proposal but just in case - I like this extension of names() and would use it myself. To me this just extends names() to take multiple predicates as opposed to just 1, so not much overhead for grokking this new feature. Collecting the conditions with AND also makes sense to me, as I understand it like "adding more conditions" with a consequence of "selecting a more restricted set"

@bkamins
Copy link
Member Author

bkamins commented Nov 16, 2022

Let us wait for @nalimilan to comment (as usual 😄). Plus maybe let me add the three options to vote:

  • 👍 add it
  • 👎 do not add it
  • 👀 can be added, but it is not crucial to have it, so it is also OK not to add it

(I vote for :three but I add all options below to make easy voting)

@bkamins
Copy link
Member Author

bkamins commented Nov 16, 2022

Given a discussion on Slack the following thing could be added alternatively:

  1. make All accept passed arguments and return their intersection (Cols returns union); it would be logical and not introduce a new type, but the problem is that All had a different meaning in the past, so it would require DataAPI.jl 2.0 release. Fortunately I do not think that any package depends on the old meaning now (in DataFrames.jl we error if someone tries to pass the old behavior)
  2. add Condition selector (name to be discussed) that would select the columns based on their contents, so e.g. one would write
Condition(x -> eltype(x) >: Missing) # pick all columns that allow missings
Condition(x -> mean(x) > 0) # pick all columns that have mean greater than 0
Condition(x -> !any(ismissing, x)) # pick all columns that do not have any missing values actually stored

@adienes
Copy link
Contributor

adienes commented Nov 17, 2022

For context: I like the correspondence between names(df, pred) and names(df[!, Cols(pred)]) and making names take multiple arguments as an implicit AND would break that. It seems fairly natural to have All function as AND and Cols function as OR, but the names are not nicely matched so might have to think about what names fit.

One question. Why does Condition need to be explicit in the case of checking types? Would it be possible to pass as a condition to Cols / All / Not anything we can pass to names? Imagine if you could write df[!, Not(String)]

@bkamins
Copy link
Member Author

bkamins commented Nov 17, 2022

Why does Condition need to be explicit in the case of checking types?

It currently is required. We could change it (this requires a careful consideration, but should be doable).

So the current list of things to do is:

  1. Add support for passing types in Cols and Not selectors.
  2. Add Condition (or something similar) to allow passing conditions based on column contents.
  3. Deprecate (but not really remove in the long term) Cols in favor of AnyOf and similarly All() should be AllOf() and add AllOf() that would take an intersection of passed selectors. (AnyOf and AllOf are tentative names)

@nalimilan
Copy link
Member

I'm not a fan of passing multiple arguments to names either, it's not obvious that it should take the intersection rather than the union. Maybe we could allow something like union(Cols(...), Cols(...)) and intersect(Cols(...), Cols(...)) (and corresponding operators)?

Note that dplyr uses all_of and any_of for something different: these are equivalent except that the former throws an error when a column doesn't exist. So it would be confusing to reuse these names for a different purpose IMO.

Regarding Condition, something like that would indeed be useful, but I find it hard to find a good name. dplyr uses where for that, which is too general (like "condition"). Maybe Cols(values=pred)? Or ColsSOMETHING(...)?

@bkamins
Copy link
Member Author

bkamins commented Nov 17, 2022

Yes - I think the concept of passing multiple arguments to names was not the best given the feedback.

union(Cols(...), Cols(...)) is not needed, it is just Cols.

intersect(Cols(...), Cols(...)) - this could be added, but we would need to call it Intersect (to enforce non-standard behavior, as intersect is a standard function in Base Julia).

Cols(values=pred) - this would require least changes.

Let us wait for others to comment. There is no rush with making a decision here.

Thank you!

@nalimilan
Copy link
Member

intersect(Cols(...), Cols(...)) - this could be added, but we would need to call it Intersect (to enforce non-standard behavior, as intersect is a standard function in Base Julia).

Currently intersect(Cols(...), Cols(...)) throws an error, so at least defining a method for it wouldn't be incompatible with the definition of intersect in Base. Actually it wouldn't be too different from the function definition "Construct the set containing those elements which appear in all of the arguments.": it would just be lazy, as Cols contains expressions which can only be resolved when actually trying to access columns.

@bkamins
Copy link
Member Author

bkamins commented Nov 23, 2022

but the problem is that intersect([1,2,3], ["a", "b", "c"]) would produce an incorrect result (in other words - we need Intersect to make sure we pass normalized column names to it).

@bkamins
Copy link
Member Author

bkamins commented Nov 28, 2022

Given the comments the to-do list would be (the idea is not to add any new exported names):

  • Add support for passing types in Cols selector. (relatively easy)
  • Add support for Cols(values=pred), where pred is applied to columns (note that then it is not allowed to pass any other condition in Cols);
  • Add support for Cols(args...; operation::Symbol=:union) where operation can be :union or :intersection; by default :union is performed.
  • Deprecate passing type or predicate to names (require passing a valid column selector, the deprecation will be to wrap them in Cols (in this way we will have a bit more typing, but the design will be more consistent - names and indexing will work in the same way)

@nalimilan
Copy link
Member

Sounds good. I'm just not sure the last point (deprecation) is worth it. We allow select(df, r"^x" => f) so it makes sense to keep allowing names(df, r"x"). That said, we don't allow,select(df, r"^x" => Int), which is a bit inconsistent with names(df, Int).

@bkamins
Copy link
Member Author

bkamins commented Nov 29, 2022

@nalimilan - Deprecation is indeed optional. However, can you clarify your last point why you do not think it is a good idea.

My reasoning is:

  • keep allowing names(df, r"x") as we allow select(df, r"x" => fun) and df[:, r"x"] (r"x" is a valid column selector);
  • do not allow names(df, Int) as we do not allow select(df, Int => fun) and df[:, Int] (Int is not a valid column selector); start requiring names(df, Cols(Int));
  • do not allow names(df, startswith("x")) as we do not allow select(df, startswith("x") => fun) and df[:, startswith("x")] (startswith("x") is not a valid column selector); start requiring names(df, Cols(startswith("x")));

@adienes
Copy link
Contributor

adienes commented Nov 29, 2022

with the recent ability to middle slurp would it be considered to allow Cols(args..., operation::Symbol) so that it can be accessed like Cols(r"x", r"y", :union) and omit the operation kwarg?

@bkamins
Copy link
Member Author

bkamins commented Nov 29, 2022

We need to keep Julia 1.6 compatibility.

@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2022

OK - now I remember the problem with Cols(Int) and Cols(values=...) -> the issue is that AbstractIndex is not aware of column values (only of column names).

See #3034

@krynju - if we added this could DTables.jl efficiently support this? (I fear that not - and we have to stick with name-only selectors and eachcol-based style for value-based selection). But maybe it is OK.

Even if this is not doable we can add operation kwarg to Cols.

@krynju
Copy link

krynju commented Dec 3, 2022

Under the hood I take col names and col types from Tables.schema, so if the underlying table provides types when calling schema on it then it wouldn't be an issue to support type based selectors in general

On type based selectors: I guess it may be useful in some specific cases? Personally I'd rather stick to name based just to keep my confidence high and not depend on the input type, which may be suddenly parsed differently from version to version (the String to InlineStrings transition that happened at some point)

On multiple column selectors: Confusing - at first look I thought that would return a union in the example from the OP.
#3224 (comment) sounds alright

@bkamins
Copy link
Member Author

bkamins commented Dec 3, 2022

@krynju - just to clarify. It is not only types, but also column values, so what currently is done like this:

julia> df = DataFrame(x1=[1, missing, missing], x2=[3, 2, 4], x3=[3, missing, 2], x4=Union{Int, Missing}[2, 4, 4])
3×4 DataFrame
 Row │ x1       x2     x3       x4
     │ Int64?   Int64  Int64?   Int64?
─────┼─────────────────────────────────
   1 │       1      3        3       2
   2 │ missing      2  missing       4
   3 │ missing      4        2       4

julia> names(df, any.(ismissing, eachcol(df))) # pick columns that contain missing values
2-element Vector{String}:
 "x1"
 "x3"

would get some special syntax, e.g. Cols(values = x -> any(ismissing, x)). And I feared that this is something that would be problematic to support via schema.

(but in general I understand that you agree that just sticking to column names based selectors is safer for now - right?).

@krynju
Copy link

krynju commented Dec 3, 2022

@krynju - just to clarify. It is not only types, but also column values, so what currently is done like this:

julia> df = DataFrame(x1=[1, missing, missing], x2=[3, 2, 4], x3=[3, missing, 2], x4=Union{Int, Missing}[2, 4, 4])
3×4 DataFrame
 Row │ x1       x2     x3       x4
     │ Int64?   Int64  Int64?   Int64?
─────┼─────────────────────────────────
   1 │       1      3        3       2
   2 │ missing      2  missing       4
   3 │ missing      4        2       4

julia> names(df, any.(ismissing, eachcol(df))) # pick columns that contain missing values
2-element Vector{String}:
 "x1"
 "x3"

would get some special syntax, e.g. Cols(values = x -> any(ismissing, x)). And I feared that this is something that would be problematic to support via schema.

Alright, I get it. Types are still ok

For values: I technically could support this by interpreting the input adequately, but I see little value in spending time on this as this seems like a niche use case and I'd rather have the user write the DTables code to figure this out and make it as simple as possible

For DTables running a full column check against all columns is just wasteful.

(but in general I understand that you agree that just sticking to column names based selectors is safer for now - right?).

Yes, names are always there and they're reliable.
Types are also ok, but I think they're not mandatory, so that's always a concern

@bkamins
Copy link
Member Author

bkamins commented Dec 4, 2022

I also have just realized that we allow for names for DataFrameRow and GroupedDataFrame (and we want to). The same with all other column selectors. In this case it is even more problematic to do column selection using column values. Eltype would be acceptable though.

bkamins added a commit to JuliaData/DataAPI.jl that referenced this pull request Dec 4, 2022
@bkamins
Copy link
Member Author

bkamins commented Dec 4, 2022

After the discussions in this PR I am going to limit myself to adding operation kwarg to Cols. Other issues will be decided separately.

First JuliaData/DataAPI.jl#58 needs to be decided an released.

@bkamins bkamins changed the title add an option to pass multiple column selectors to names add an option to intersect arguments passed to Cols Dec 5, 2022
src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
NEWS.md Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
test/index.jl Outdated Show resolved Hide resolved
@bkamins bkamins closed this Dec 16, 2022
@bkamins bkamins reopened this Dec 16, 2022
@bkamins bkamins merged commit 83285f8 into main Dec 17, 2022
@bkamins bkamins deleted the bk/names branch December 17, 2022 18:55
@bkamins
Copy link
Member Author

bkamins commented Dec 17, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants