Ignore `Not` which column not in data frame #2228

pdeffebach · 2020-05-05T19:10:44Z

df = DataFrame(a = [1], b = [2])
select(df, Not(:c)) # returns df

I still have to add tests.

bkamins · 2020-05-05T21:14:37Z

The rule you propose is incorrect as what is in Not can be any selector, e.g. Not(r"x") or Not(Between(:x1, :x2)), so unfortunately this PR will not be that easy (I have not spent much time on designing a good set of rules - maybe it is enough to add a third branch).

pdeffebach · 2020-05-24T01:39:46Z

Thanks for the feedback. I think the simplest thing to do in this context is add a bunch of if else statements. The tests show the behavior I want. Let me know if you have more feedback.

bkamins · 2020-05-24T06:03:52Z

src/other/index.jl

+            toskip = x[notidx.first]:x[notidx.last]
+            return setdiff(1:length(x), toskip)
+        elseif !(haskey(x, notidx.skip.first)) && !(haskey(x, notidx.skip.last))
+            return 1:length(x)


why do you think it should not error?

I would be fine having this error.

Here is my reasoning for the current behavior. Imagine a dataframe with infinite columns, and a contiguous set of columns comprises your data frame. When you do Not(Between(:x, :z)), either

:x and :z are both to the left of your columns of interest. Then no error, do nothing

:x and :z are both to the right of your columns of interest. Then no error, do nothing.

:x is on the left of your columns of interest, :z is on the right. In which case you should delete all columns.

This PR, currently, only assumes options 1 and 2, and thus does not error.

But if !(haskey(x, notidx.skip.first)) && !(haskey(x, notidx.skip.last)) is true then you are not sure if

:x is on the left and :z is on the right (and do what you proposed)

or

:z is on the left and :x is on the right (and then exactly the opposite should happen)

That is why I felt it should also error.

If we agree then Between should be removed from the list in my main comment.

src/other/index.jl

test/index.jl

bkamins · 2020-05-24T06:33:14Z

It looks almost good. Thank you!

The tests show the behavior I want.

Right 😄. We need to think where to put a precise description of the behaviour you implement. Probably https://juliadata.github.io/DataFrames.jl/latest/lib/indexing/ is the right place. The key thing to highlight there is that we will have different rules what errors if you select rows and what errors if you select columns. As now we are consistent:

julia> df = DataFrame(rand(2,2))
2×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.784868 │ 0.736577 │
│ 2   │ 0.230237 │ 0.628254 │

julia> df[Not(3), :]
ERROR: BoundsError: attempt to access 2-element Base.OneTo{Int64} at index [Not(3)]

julia> df[:, Not(3)]
ERROR: BoundsError: attempt to access data frame with 2 columns
  at index [3]

but this would not be the case any more after this PR.

Thinking about this - maybe (I am not 100% sure what would be best). An alternative of the only case we should accept for Not with an invalid column index should be Not(::Symbol), and Not(::AbstractString), as in practice these seem to be the real cases (or more flexibly - it should not be an error if column name is passed, but should be an error when column index is passed). I am not 100% sure here what is best here, as we have a tension between flexibility and consistency. I will have to think about it more.

@CameronBieganek and @nalimilan - maybe you have some thoughts.

src/other/index.jl

CameronBieganek · 2020-05-27T02:54:06Z

An alternative of the only case we should accept for Not with an invalid column index should be Not(::Symbol), and Not(::AbstractString), as in practice these seem to be the real cases (or more flexibly - it should not be an error if column name is passed, but should be an error when column index is passed). I am not 100% sure here what is best here, as we have a tension between flexibility and consistency.

If I understand correctly, you're asking whether Not(3) should be an error and Not(:x3) should be allowed in your above example? I think that might be a good idea. As you mentioned, it's inconsistent with Not(:x3), but at least it's consistent with df[Not(3), :].

By the way, it's not clear from the discussion on this PR so far, but I think Not(::Vector{Symbol}) should also be supported. I think that was how I originally came across this. 🙂

bkamins · 2020-05-27T05:58:42Z

My point is to consider supporting missing column in only:

Not(::Union{Symbol, AbstractString, AbstractVector{Symbol}, AbstractVector{<:AbstractString}, Between{Symbol, Symbol})

(i.e. cases where you pass a column name explicitly and it is not a nested selection)

nalimilan · 2020-05-27T08:45:11Z

Having Not(3) and Not(:x3) be inconsistent sounds problematic to me. I wonder whether we could use something like Not(Cols(:x3)) for that, since Cols is already more flexible by dropping duplicates silently.

pdeffebach · 2020-05-28T18:37:46Z

I responded to all comments about the implementation.

I think it's too much to require users to always to Not(Cols(:x)). Personally I am fine with a bit of inconsistency between row indexing and column indexing.

pdeffebach · 2020-05-28T19:04:31Z

Finally, should I refactor this to just do dispatch?

bkamins · 2020-05-28T19:11:59Z

Having reviewed the revision I still feel that we should only special case:

Not(::Union{Symbol, AbstractString, AbstractVector{Symbol}, AbstractVector{<:AbstractString}, Between{Symbol, Symbol})

and handle everything else like we do currently.

The reason is that now I see that there is a bug with handling Bool indexing. Then it is also clear that it will be hard to handle other corner cases of indexing.

Simply the proposed design (that is general) does not compose well with indexing of DataFrames.

If you feel we should special case some more cases except Not(::Union{Symbol, AbstractString, AbstractVector{Symbol}, AbstractVector{<:AbstractString}, Between{Symbol, Symbol}) then please just add them to the list, but I would be hesitant to allow an implementation that ignores type of what is stored in Not as proposed now. Simply the indexing rules of DataFrame are too complex (of course they could be replicated but it will lead to a very messy code).

bkamins · 2020-05-28T19:15:18Z

Finally, should I refactor this to just do dispatch?

This is what I would do, and in dispatch I would only allow the types that we are sure we handle correctly.

pdeffebach · 2020-05-28T21:12:24Z

Thanks! I refactored so it uses dispatch.

I also changed the Between behavior to be more strict. It errors unless both elements are in the index.

Final decision is how to type this up in docs and if we should update getindex for rows as well. I don't think we should, also because we want to maintain consistency with the behavior of Not with ordinary vectors.

src/other/index.jl

test/index.jl

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

pdeffebach · 2020-05-30T15:15:40Z

Okay I added all tests. I think all cases are covered.

bkamins · 2020-05-30T15:59:42Z

Thank you! I have restarted the CI as for some reason it did not spit up correctly after your last commit.

src/other/index.jl

bkamins · 2020-05-30T17:18:57Z

CI fails

test/index.jl

src/other/index.jl

bkamins · 2020-05-30T19:58:41Z

Apart from comment https://github.com/JuliaData/DataFrames.jl/pull/2228/files#r432882413 it looks good. So the only thing left is documentation update.

pdeffebach · 2020-05-31T14:10:11Z

Thanks for the catch. I have removed the methods. I now only check isunique for string and symbol vectors and the integers get the check in the default getindex method.

This should be ready to go.

bkamins · 2020-05-31T16:05:11Z

Looks good now. I approve it as it can be merged from the functionality perspective.

However, can you please add update to the documentation? The manual page describing indexing should cover special cases where we allow invalid indices.

docs/src/man/getting_started.md

bkamins · 2020-05-31T20:36:29Z

Thank you for the update of the getting started example.

What I meant earlier though is that it is crucial to update docs/src/lib/indexing.md with a description of precise rules - in the line where Not is introduced, as an additional comment (as this is a documentation and it has to be definitive description of the behaviour, while getting started - as you did it now - should give an intuitive explanation of the rule).

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

bkamins · 2020-06-01T06:12:08Z

docs/src/lib/indexing.md

@@ -29,6 +29,12 @@ The rules for a valid type of index into a column are the following:
    * an `All` or `Between` expression (see [DataAPI.jl](https://github.com/JuliaData/DataAPI.jl));
    * a colon literal `:`.

+Indexing with `Not` has special behavior, in particular:
+* `String`s and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:String}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame; 


Suggested change

* `String`s and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:String}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame;

* strings and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:AbstractString}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame;

I would also add that the same applies to the case if they are present in All expressions.

On the other hand - the three rules you described below are not special - they follow the standard rules for Not from InvertedIndices.jl. So we can do two things - either just remove them, or if you think it is worth to mention it, change the line above to be more verbose and say something like:

Indexing with `Not` selects indices that are a complement of the selection specified by its argument, in particular:

nalimilan · 2020-06-01T14:10:06Z

Sorry for not reacting again earlier. After discussing this deeper with @bkamins I'm still not happy with the inconsistency, and I've also realized that we need something more general to allow trying to select columns that may not exist.

Even dplyr, which is quite flexible, doesn't allow negating non-existent names with ! as noted by @CameronBieganek at #2197 (comment). What they do instead is to require you to do !any_of(...), whose documentation says "It is especially useful with negative selections, when you would like to make sure a variable is removed." That any_of function has the advantage that it also works when positively selecting columns.

We agreed with @bkamins that it would be better to adopt a powerful design along these lines. We already have All and recently introduced Cols in DataAPI, so we could use these to parallel the dplyr distinction. But our preferred solution so far is to add an argument to Cols to specify that it's fine for columns not to exist. Then you could do something like Not(Cols(...; strict=false)). That would probably be easier to grasp than subtle differences between All and Cols, and it would avoid introducing yet another name like AnyOf. We could even print an error message pointing to that argument when a column doesn't exist. What do you think?

pdeffebach · 2020-06-02T00:47:12Z

Conditional on not allowing Not(:zzz), which I still think is a good idea and an acceptable inconsistency, I think the second best solution would be petition to add a keyword strict to Not types in InvertedIndices.jl.

nalimilan · 2020-06-07T09:57:41Z

Let's try that and see what they think.

bkamins

@pdeffebach - do we want to resurrect this PR. I think we should assume that we cannot count on changes in external packages. What we do must be self contained to DataFrames.jl

pdeffebach · 2021-05-18T14:01:01Z

Let's close this. R and Stata don't support this, and typos are common.

pdeffebach added 2 commits May 5, 2020 15:08

Initial Commit

90d10b9

Fix project.toml

16bc8f1

bkamins added the non-breaking The proposed change is not breaking label May 5, 2020

bkamins added this to the 1.x milestone May 5, 2020

pdeffebach added 2 commits May 23, 2020 21:37

Rework and add tests

634ec20

Fix diff

3e6a75a

bkamins reviewed May 24, 2020

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

bkamins reviewed May 24, 2020

View reviewed changes

test/index.jl Outdated Show resolved Hide resolved

CameronBieganek reviewed May 27, 2020

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

respond to comments

24f2a71

refactor for dispatch

ff8ca22

bkamins reviewed May 28, 2020

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

bkamins reviewed May 28, 2020

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

bkamins reviewed May 28, 2020

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

bkamins reviewed May 28, 2020

View reviewed changes

test/index.jl Show resolved Hide resolved

pdeffebach and others added 3 commits May 28, 2020 21:07

Update src/other/index.jl

73eb277

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

Update src/other/index.jl

66f17d9

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

add tests for errors

26d445d

test everything

de2c776

bkamins closed this May 30, 2020

bkamins reopened this May 30, 2020

bkamins reviewed May 30, 2020

View reviewed changes

src/other/index.jl Show resolved Hide resolved

makeunique check

fa8a252

fix tests

d8dc59b

bkamins reviewed May 30, 2020

View reviewed changes

test/index.jl Show resolved Hide resolved

fix select.jl tests

985f491

bkamins reviewed May 30, 2020

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

remove extram methods and re-do some tests

847e16a

bkamins approved these changes May 31, 2020

View reviewed changes

add docs

8025e57

bkamins reviewed May 31, 2020

View reviewed changes

docs/src/man/getting_started.md Outdated Show resolved Hide resolved

pdeffebach and others added 2 commits May 31, 2020 18:06

add more detailed docs

51b5d33

Update docs/src/man/getting_started.md

780bcf0

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

bkamins reviewed Jun 1, 2020

View reviewed changes

pdeffebach mentioned this pull request Jun 8, 2020

Add some sort of Strict option JuliaData/InvertedIndices.jl#20

Open

bkamins requested changes May 17, 2021

View reviewed changes

pdeffebach closed this May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore `Not` which column not in data frame #2228

Ignore `Not` which column not in data frame #2228

pdeffebach commented May 5, 2020

bkamins commented May 5, 2020

pdeffebach commented May 24, 2020

bkamins May 24, 2020

pdeffebach May 28, 2020 •

edited

Loading

bkamins May 28, 2020

bkamins commented May 24, 2020

CameronBieganek commented May 27, 2020

bkamins commented May 27, 2020

nalimilan commented May 27, 2020

pdeffebach commented May 28, 2020

pdeffebach commented May 28, 2020

bkamins commented May 28, 2020

bkamins commented May 28, 2020

pdeffebach commented May 28, 2020

pdeffebach commented May 30, 2020

bkamins commented May 30, 2020

bkamins commented May 30, 2020

bkamins commented May 30, 2020

pdeffebach commented May 31, 2020

bkamins commented May 31, 2020

bkamins commented May 31, 2020

bkamins Jun 1, 2020

bkamins Jun 1, 2020

nalimilan commented Jun 1, 2020

pdeffebach commented Jun 2, 2020

nalimilan commented Jun 7, 2020

bkamins left a comment

pdeffebach commented May 18, 2021

	* `String`s and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:String}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame;
	* strings and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:AbstractString}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame;

Ignore Not which column not in data frame #2228

Ignore Not which column not in data frame #2228

Conversation

pdeffebach commented May 5, 2020

bkamins commented May 5, 2020

pdeffebach commented May 24, 2020

bkamins May 24, 2020

Choose a reason for hiding this comment

pdeffebach May 28, 2020 • edited Loading

Choose a reason for hiding this comment

bkamins May 28, 2020

Choose a reason for hiding this comment

bkamins commented May 24, 2020

CameronBieganek commented May 27, 2020

bkamins commented May 27, 2020

nalimilan commented May 27, 2020

pdeffebach commented May 28, 2020

pdeffebach commented May 28, 2020

bkamins commented May 28, 2020

bkamins commented May 28, 2020

pdeffebach commented May 28, 2020

pdeffebach commented May 30, 2020

bkamins commented May 30, 2020

bkamins commented May 30, 2020

bkamins commented May 30, 2020

pdeffebach commented May 31, 2020

bkamins commented May 31, 2020

bkamins commented May 31, 2020

bkamins Jun 1, 2020

Choose a reason for hiding this comment

bkamins Jun 1, 2020

Choose a reason for hiding this comment

nalimilan commented Jun 1, 2020

pdeffebach commented Jun 2, 2020

nalimilan commented Jun 7, 2020

bkamins left a comment

Choose a reason for hiding this comment

pdeffebach commented May 18, 2021

Ignore `Not` which column not in data frame #2228

Ignore `Not` which column not in data frame #2228

pdeffebach May 28, 2020 •

edited

Loading