Return data frame unaltered when Not only includes columns that are not in data frame #2197

CameronBieganek · 2020-04-16T20:29:40Z

Currently we have this behavior:

julia> df = DataFrame(a=1:2);

julia> select(df, Not(:b))
ERROR: ArgumentError: column name :b not found in the data frame; existing most similar names are: :a
Stacktrace:
 [1] lookupname at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/other/index.jl:253 [inlined]
 [2] getindex at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/other/index.jl:259 [inlined]
 [3] getindex at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/other/index.jl:192 [inlined]
 [4] select(::DataFrame, ::InvertedIndex{Symbol}; copycols::Bool) at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/dataframe/dataframe.jl:881
 [5] select(::DataFrame, ::InvertedIndex{Symbol}) at /Users/bieganek/.julia/packages/DataFrames/S3ZFo/src/dataframe/dataframe.jl:881
 [6] top-level scope at REPL[3]:1

However, it seems like the data frame should be returned unaltered if the Not only includes columns that don't exist in the data frame.

My motivation for this is that I want to apply select!(df, Not(...)) to a bunch of data frames, some of which have the columns I'm dropping, and some of which do not have the columns I'm dropping.

The text was updated successfully, but these errors were encountered:

bkamins · 2020-04-16T21:33:36Z

The initial design is defensive (this is what @nalimilan likes) so that post 1.0 we can change an error into working code if we prefer to.

What you propose makes sense, but on the other hand I can imagine that some people would prefer it to error. Note that this is the behaviour implemented in InvertedIndices.jl:

julia> x = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> x[Not(4)]
ERROR: BoundsError: attempt to access 3-element Array{Int64,1} at index [Not(4)]

I will open an issue there as the decision should be first made in the source package.

bkamins · 2020-04-16T23:42:49Z

OK - so probably we can add it.

CameronBieganek · 2020-04-16T23:45:08Z

I looked at dplyr for inspiration, and I think the solution is to add Any. Except of course that conflicts with the Any type, so maybe AnyOf? Here is the dplyr behavior:

> library(dplyr)

> df <- data.frame(a=1:2); df
#   a
# 1 1
# 2 2

> df %>% select(-b)
# Error: Can't subset columns that don't exist.
# ✖ The column `b` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

> df %>% select(-all_of('b'))
# Error: Can't subset columns that don't exist.
# ✖ The column `b` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

> df %>% select(-any_of('b'))
#   a
# 1 1
# 2 2

So, all_of is strict and any_of is not strict. To be consistent, this might require renaming All to AllOf. 😂Then the correct idiom in Julia would be this:

df = DataFrame(a=1:2);
select(df, Not(AnyOf(:b)))

CameronBieganek · 2020-04-17T02:04:18Z

I've been trying to reconcile the various operations in terms of set operations. Unfortunately it doesn't really hold together because of bounds checking. I was initially framing the set operations in the context of Ω, the set of all possible names. Let N be the set of column names in the data frame and let S be an arbitrary set of names. Column selection works by taking the intersection N ∩ S. As the documentation says, All(S1, S2) is the union of S1 and S2. Then Not(S) should be the complement of S (which I'll denote compl(S)) relative to Ω. So selection with Not(S) is equivalent to N ∩ compl(S) which is equivalent to the set difference N \ S. So when S and N are disjoint, N \ S == N. Thus all columns are selected, i.e. select(df, Not(:b)) == df. But in order for this analogy to completely work, we would have to remove bounds errors. For example, df[:, :b] would need to return an empty data frame, since the intersection of {:a} and {:b} is the null set. And similarly, df[:, All(:a, :b)] should return df, since the intersection of {:a} with {:a, :b} is {:a}.

The problems with the set interpretation go away if Ω = N, so maybe that's an argument for leaving things the way the are now. Perhaps it would be better just to add a clarifying sentence to the docs along the lines of "The set operations represented by All and Not are restricted to the set of names currently in the data frame. Attempts to use names outside this set will produce a bounds error."

CameronBieganek · 2020-04-17T03:56:57Z

Wait, I think I figured out a consistent way to interpret this. My previous idea was slightly more complicated than it needed to be. I think this approach makes sense:

Ω is the set of all possible names.
N is the set of all names in the data frame.
Other capital letters represent arbitrary subsets of Ω.
select has the signature select(df, S), where S ⊆ Ω. If S ⊈ N, then a bounds error is thrown.
AllOf(A, B) := A ∪ B
Not(A) := N \ A
AnyOf(A) := N ∩ A

Examples:

# Note that I'm mixing set notation and Julia syntax here. 

N := {a, b}  # df has cols a, b

select(df, :a)  # works as usual, since {a} ⊆ N

S = Not(c) = {a, b} \ {c} = N
select(df, S) == df

S = AllOf(b, c) = {b, c}
select(df, S)  # bounds error since {b, c} ⊈ N 

S = AnyOf(b, c) = {a, b} ∩ {b, c} = {b}
select(df, S) == df[:, :b]

In Summary:

Extending Not as described in my original post does allow for a simple set interpretation of the actions of Not, AllOf, and AnyOf. In fact, with this interpretation, AnyOf is not needed to get the behavior of Not desired in the original post, although it still might be useful in some situations.

bkamins · 2020-04-17T06:04:09Z

Thank you for a detailed analysis. We will get back to it later as it is non-breaking. Currently a massive rework following #1926 is a priority to get 0.21 release.

pdeffebach · 2020-04-17T16:30:54Z

I agree that this should be added. But excited about #1926 though!

bkamins · 2022-10-15T22:03:41Z

The practice that has settled here is to use Cols with predicate. It is a more general pattern, but covers this use-case:

julia> select(df, Cols(!=(:b)))
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

If you feel the issues should be re-opened please comment.

bkamins added decision feature non-breaking The proposed change is not breaking labels Apr 16, 2020

bkamins added this to the 1.x milestone Apr 16, 2020

bkamins mentioned this issue Apr 16, 2020

Allow out of bounds indices in Not JuliaData/InvertedIndices.jl#17

Closed

pdeffebach mentioned this issue May 5, 2020

Ignore Not which column not in data frame #2228

Closed

bkamins closed this as completed Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return data frame unaltered when Not only includes columns that are not in data frame #2197

Return data frame unaltered when Not only includes columns that are not in data frame #2197

CameronBieganek commented Apr 16, 2020 •

edited

Loading

bkamins commented Apr 16, 2020

bkamins commented Apr 16, 2020

CameronBieganek commented Apr 16, 2020

CameronBieganek commented Apr 17, 2020

CameronBieganek commented Apr 17, 2020 •

edited

Loading

bkamins commented Apr 17, 2020

pdeffebach commented Apr 17, 2020

bkamins commented Oct 15, 2022

Return data frame unaltered when Not only includes columns that are not in data frame #2197

Return data frame unaltered when Not only includes columns that are not in data frame #2197

Comments

CameronBieganek commented Apr 16, 2020 • edited Loading

bkamins commented Apr 16, 2020

bkamins commented Apr 16, 2020

CameronBieganek commented Apr 16, 2020

CameronBieganek commented Apr 17, 2020

CameronBieganek commented Apr 17, 2020 • edited Loading

Examples:

In Summary:

bkamins commented Apr 17, 2020

pdeffebach commented Apr 17, 2020

bkamins commented Oct 15, 2022

CameronBieganek commented Apr 16, 2020 •

edited

Loading

CameronBieganek commented Apr 17, 2020 •

edited

Loading