Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return data frame unaltered when Not only includes columns that are not in data frame #2197

Closed
CameronBieganek opened this issue Apr 16, 2020 · 8 comments
Labels
decision feature non-breaking The proposed change is not breaking
Milestone

Comments

@CameronBieganek
Copy link

CameronBieganek commented Apr 16, 2020

Currently we have this behavior:

julia> df = DataFrame(a=1:2);

julia> select(df, Not(:b))
ERROR: ArgumentError: column name :b not found in the data frame; existing most similar names are: :a
Stacktrace:
 [1] lookupname at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/other/index.jl:253 [inlined]
 [2] getindex at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/other/index.jl:259 [inlined]
 [3] getindex at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/other/index.jl:192 [inlined]
 [4] select(::DataFrame, ::InvertedIndex{Symbol}; copycols::Bool) at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/dataframe/dataframe.jl:881
 [5] select(::DataFrame, ::InvertedIndex{Symbol}) at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/dataframe/dataframe.jl:881
 [6] top-level scope at REPL[3]:1

However, it seems like the data frame should be returned unaltered if the Not only includes columns that don't exist in the data frame.

My motivation for this is that I want to apply select!(df, Not(...)) to a bunch of data frames, some of which have the columns I'm dropping, and some of which do not have the columns I'm dropping.

@bkamins bkamins added decision feature non-breaking The proposed change is not breaking labels Apr 16, 2020
@bkamins bkamins added this to the 1.x milestone Apr 16, 2020
@bkamins
Copy link
Member

bkamins commented Apr 16, 2020

The initial design is defensive (this is what @nalimilan likes) so that post 1.0 we can change an error into working code if we prefer to.

What you propose makes sense, but on the other hand I can imagine that some people would prefer it to error. Note that this is the behaviour implemented in InvertedIndices.jl:

julia> x = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> x[Not(4)]
ERROR: BoundsError: attempt to access 3-element Array{Int64,1} at index [Not(4)]

I will open an issue there as the decision should be first made in the source package.

@bkamins
Copy link
Member

bkamins commented Apr 16, 2020

OK - so probably we can add it.

@CameronBieganek
Copy link
Author

I looked at dplyr for inspiration, and I think the solution is to add Any. Except of course that conflicts with the Any type, so maybe AnyOf? Here is the dplyr behavior:

> library(dplyr)

> df <- data.frame(a=1:2); df
#   a
# 1 1
# 2 2

> df %>% select(-b)
# Error: Can't subset columns that don't exist.
# ✖ The column `b` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

> df %>% select(-all_of('b'))
# Error: Can't subset columns that don't exist.
# ✖ The column `b` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

> df %>% select(-any_of('b'))
#   a
# 1 1
# 2 2

So, all_of is strict and any_of is not strict. To be consistent, this might require renaming All to AllOf. 😂Then the correct idiom in Julia would be this:

df = DataFrame(a=1:2);
select(df, Not(AnyOf(:b)))

@CameronBieganek
Copy link
Author

I've been trying to reconcile the various operations in terms of set operations. Unfortunately it doesn't really hold together because of bounds checking. I was initially framing the set operations in the context of Ω, the set of all possible names. Let N be the set of column names in the data frame and let S be an arbitrary set of names. Column selection works by taking the intersection N ∩ S. As the documentation says, All(S1, S2) is the union of S1 and S2. Then Not(S) should be the complement of S (which I'll denote compl(S)) relative to Ω. So selection with Not(S) is equivalent to N ∩ compl(S) which is equivalent to the set difference N \ S. So when S and N are disjoint, N \ S == N. Thus all columns are selected, i.e. select(df, Not(:b)) == df. But in order for this analogy to completely work, we would have to remove bounds errors. For example, df[:, :b] would need to return an empty data frame, since the intersection of {:a} and {:b} is the null set. And similarly, df[:, All(:a, :b)] should return df, since the intersection of {:a} with {:a, :b} is {:a}.

The problems with the set interpretation go away if Ω = N, so maybe that's an argument for leaving things the way the are now. Perhaps it would be better just to add a clarifying sentence to the docs along the lines of "The set operations represented by All and Not are restricted to the set of names currently in the data frame. Attempts to use names outside this set will produce a bounds error."

@CameronBieganek
Copy link
Author

CameronBieganek commented Apr 17, 2020

Wait, I think I figured out a consistent way to interpret this. My previous idea was slightly more complicated than it needed to be. I think this approach makes sense:

  • Ω is the set of all possible names.
  • N is the set of all names in the data frame.
  • Other capital letters represent arbitrary subsets of Ω.
  • select has the signature select(df, S), where S ⊆ Ω. If S ⊈ N, then a bounds error is thrown.
  • AllOf(A, B) := A ∪ B
  • Not(A) := N \ A
  • AnyOf(A) := N ∩ A

Examples:

# Note that I'm mixing set notation and Julia syntax here. 

N := {a, b}  # df has cols a, b

select(df, :a)  # works as usual, since {a} ⊆ N

S = Not(c) = {a, b} \ {c} = N
select(df, S) == df

S = AllOf(b, c) = {b, c}
select(df, S)  # bounds error since {b, c} ⊈ N 

S = AnyOf(b, c) = {a, b}  {b, c} = {b}
select(df, S) == df[:, :b]

In Summary:

Extending Not as described in my original post does allow for a simple set interpretation of the actions of Not, AllOf, and AnyOf. In fact, with this interpretation, AnyOf is not needed to get the behavior of Not desired in the original post, although it still might be useful in some situations.

@bkamins
Copy link
Member

bkamins commented Apr 17, 2020

Thank you for a detailed analysis. We will get back to it later as it is non-breaking. Currently a massive rework following #1926 is a priority to get 0.21 release.

@pdeffebach
Copy link
Contributor

I agree that this should be added. But excited about #1926 though!

@bkamins
Copy link
Member

bkamins commented Oct 15, 2022

The practice that has settled here is to use Cols with predicate. It is a more general pattern, but covers this use-case:

julia> select(df, Cols(!=(:b)))
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

If you feel the issues should be re-opened please comment.

@bkamins bkamins closed this as completed Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision feature non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants