Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore Not which column not in data frame #2228

Closed
wants to merge 20 commits into from

Conversation

pdeffebach
Copy link
Contributor

Fixed #2197

df = DataFrame(a = [1], b = [2])
select(df, Not(:c)) # returns df

I still have to add tests.

@bkamins
Copy link
Member

bkamins commented May 5, 2020

The rule you propose is incorrect as what is in Not can be any selector, e.g. Not(r"x") or Not(Between(:x1, :x2)), so unfortunately this PR will not be that easy (I have not spent much time on designing a good set of rules - maybe it is enough to add a third branch).

@bkamins bkamins added the non-breaking The proposed change is not breaking label May 5, 2020
@bkamins bkamins added this to the 1.x milestone May 5, 2020
@pdeffebach
Copy link
Contributor Author

Thanks for the feedback. I think the simplest thing to do in this context is add a bunch of if else statements. The tests show the behavior I want. Let me know if you have more feedback.

toskip = x[notidx.first]:x[notidx.last]
return setdiff(1:length(x), toskip)
elseif !(haskey(x, notidx.skip.first)) && !(haskey(x, notidx.skip.last))
return 1:length(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think it should not error?

Copy link
Contributor Author

@pdeffebach pdeffebach May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine having this error.

Here is my reasoning for the current behavior. Imagine a dataframe with infinite columns, and a contiguous set of columns comprises your data frame. When you do Not(Between(:x, :z)), either

  1. :x and :z are both to the left of your columns of interest. Then no error, do nothing
  2. :x and :z are both to the right of your columns of interest. Then no error, do nothing.
  3. :x is on the left of your columns of interest, :z is on the right. In which case you should delete all columns.

This PR, currently, only assumes options 1 and 2, and thus does not error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if !(haskey(x, notidx.skip.first)) && !(haskey(x, notidx.skip.last)) is true then you are not sure if

:x is on the left and :z is on the right (and do what you proposed)

or

:z is on the left and :x is on the right (and then exactly the opposite should happen)

That is why I felt it should also error.

If we agree then Between should be removed from the list in my main comment.

src/other/index.jl Outdated Show resolved Hide resolved
test/index.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented May 24, 2020

It looks almost good. Thank you!

The tests show the behavior I want.

Right 😄. We need to think where to put a precise description of the behaviour you implement. Probably https://juliadata.github.io/DataFrames.jl/latest/lib/indexing/ is the right place. The key thing to highlight there is that we will have different rules what errors if you select rows and what errors if you select columns. As now we are consistent:

julia> df = DataFrame(rand(2,2))
2×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.784868 │ 0.736577 │
│ 2   │ 0.230237 │ 0.628254 │

julia> df[Not(3), :]
ERROR: BoundsError: attempt to access 2-element Base.OneTo{Int64} at index [Not(3)]

julia> df[:, Not(3)]
ERROR: BoundsError: attempt to access data frame with 2 columns
  at index [3]

but this would not be the case any more after this PR.

Thinking about this - maybe (I am not 100% sure what would be best). An alternative of the only case we should accept for Not with an invalid column index should be Not(::Symbol), and Not(::AbstractString), as in practice these seem to be the real cases (or more flexibly - it should not be an error if column name is passed, but should be an error when column index is passed). I am not 100% sure here what is best here, as we have a tension between flexibility and consistency. I will have to think about it more.

@CameronBieganek and @nalimilan - maybe you have some thoughts.

src/other/index.jl Outdated Show resolved Hide resolved
@CameronBieganek
Copy link

An alternative of the only case we should accept for Not with an invalid column index should be Not(::Symbol), and Not(::AbstractString), as in practice these seem to be the real cases (or more flexibly - it should not be an error if column name is passed, but should be an error when column index is passed). I am not 100% sure here what is best here, as we have a tension between flexibility and consistency.

If I understand correctly, you're asking whether Not(3) should be an error and Not(:x3) should be allowed in your above example? I think that might be a good idea. As you mentioned, it's inconsistent with Not(:x3), but at least it's consistent with df[Not(3), :].

By the way, it's not clear from the discussion on this PR so far, but I think Not(::Vector{Symbol}) should also be supported. I think that was how I originally came across this. 🙂

@bkamins
Copy link
Member

bkamins commented May 27, 2020

My point is to consider supporting missing column in only:

Not(::Union{Symbol, AbstractString, AbstractVector{Symbol}, AbstractVector{<:AbstractString}, Between{Symbol, Symbol})

(i.e. cases where you pass a column name explicitly and it is not a nested selection)

@nalimilan
Copy link
Member

Having Not(3) and Not(:x3) be inconsistent sounds problematic to me. I wonder whether we could use something like Not(Cols(:x3)) for that, since Cols is already more flexible by dropping duplicates silently.

@pdeffebach
Copy link
Contributor Author

I responded to all comments about the implementation.

I think it's too much to require users to always to Not(Cols(:x)). Personally I am fine with a bit of inconsistency between row indexing and column indexing.

@pdeffebach
Copy link
Contributor Author

Finally, should I refactor this to just do dispatch?

@bkamins
Copy link
Member

bkamins commented May 28, 2020

Having reviewed the revision I still feel that we should only special case:

Not(::Union{Symbol, AbstractString, AbstractVector{Symbol}, AbstractVector{<:AbstractString}, Between{Symbol, Symbol})

and handle everything else like we do currently.

The reason is that now I see that there is a bug with handling Bool indexing. Then it is also clear that it will be hard to handle other corner cases of indexing.

Simply the proposed design (that is general) does not compose well with indexing of DataFrames.

If you feel we should special case some more cases except Not(::Union{Symbol, AbstractString, AbstractVector{Symbol}, AbstractVector{<:AbstractString}, Between{Symbol, Symbol}) then please just add them to the list, but I would be hesitant to allow an implementation that ignores type of what is stored in Not as proposed now. Simply the indexing rules of DataFrame are too complex (of course they could be replicated but it will lead to a very messy code).

@bkamins
Copy link
Member

bkamins commented May 28, 2020

Finally, should I refactor this to just do dispatch?

This is what I would do, and in dispatch I would only allow the types that we are sure we handle correctly.

@pdeffebach
Copy link
Contributor Author

Thanks! I refactored so it uses dispatch.

I also changed the Between behavior to be more strict. It errors unless both elements are in the index.

Final decision is how to type this up in docs and if we should update getindex for rows as well. I don't think we should, also because we want to maintain consistency with the behavior of Not with ordinary vectors.

src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
test/index.jl Show resolved Hide resolved
pdeffebach and others added 3 commits May 28, 2020 21:07
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@pdeffebach
Copy link
Contributor Author

Okay I added all tests. I think all cases are covered.

@bkamins bkamins closed this May 30, 2020
@bkamins bkamins reopened this May 30, 2020
@bkamins
Copy link
Member

bkamins commented May 30, 2020

Thank you! I have restarted the CI as for some reason it did not spit up correctly after your last commit.

@bkamins
Copy link
Member

bkamins commented May 30, 2020

CI fails

test/index.jl Show resolved Hide resolved
src/other/index.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented May 30, 2020

Apart from comment https://github.com/JuliaData/DataFrames.jl/pull/2228/files#r432882413 it looks good. So the only thing left is documentation update.

@pdeffebach
Copy link
Contributor Author

Thanks for the catch. I have removed the methods. I now only check isunique for string and symbol vectors and the integers get the check in the default getindex method.

This should be ready to go.

@bkamins
Copy link
Member

bkamins commented May 31, 2020

Looks good now. I approve it as it can be merged from the functionality perspective.

However, can you please add update to the documentation? The manual page describing indexing should cover special cases where we allow invalid indices.

@bkamins
Copy link
Member

bkamins commented May 31, 2020

Thank you for the update of the getting started example.

What I meant earlier though is that it is crucial to update docs/src/lib/indexing.md with a description of precise rules - in the line where Not is introduced, as an additional comment (as this is a documentation and it has to be definitive description of the behaviour, while getting started - as you did it now - should give an intuitive explanation of the rule).

pdeffebach and others added 2 commits May 31, 2020 18:06
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@@ -29,6 +29,12 @@ The rules for a valid type of index into a column are the following:
* an `All` or `Between` expression (see [DataAPI.jl](https://github.com/JuliaData/DataAPI.jl));
* a colon literal `:`.

Indexing with `Not` has special behavior, in particular:
* `String`s and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:String}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `String`s and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:String}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame;
* strings and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:AbstractString}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that the same applies to the case if they are present in All expressions.

On the other hand - the three rules you described below are not special - they follow the standard rules for Not from InvertedIndices.jl. So we can do two things - either just remove them, or if you think it is worth to mention it, change the line above to be more verbose and say something like:

Indexing with `Not` selects indices that are a complement of the selection specified by its argument, in particular:

@nalimilan
Copy link
Member

Sorry for not reacting again earlier. After discussing this deeper with @bkamins I'm still not happy with the inconsistency, and I've also realized that we need something more general to allow trying to select columns that may not exist.

Even dplyr, which is quite flexible, doesn't allow negating non-existent names with ! as noted by @CameronBieganek at #2197 (comment). What they do instead is to require you to do !any_of(...), whose documentation says "It is especially useful with negative selections, when you would like to make sure a variable is removed." That any_of function has the advantage that it also works when positively selecting columns.

We agreed with @bkamins that it would be better to adopt a powerful design along these lines. We already have All and recently introduced Cols in DataAPI, so we could use these to parallel the dplyr distinction. But our preferred solution so far is to add an argument to Cols to specify that it's fine for columns not to exist. Then you could do something like Not(Cols(...; strict=false)). That would probably be easier to grasp than subtle differences between All and Cols, and it would avoid introducing yet another name like AnyOf. We could even print an error message pointing to that argument when a column doesn't exist. What do you think?

@pdeffebach
Copy link
Contributor Author

Conditional on not allowing Not(:zzz), which I still think is a good idea and an acceptable inconsistency, I think the second best solution would be petition to add a keyword strict to Not types in InvertedIndices.jl.

@nalimilan
Copy link
Member

Let's try that and see what they think.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdeffebach - do we want to resurrect this PR. I think we should assume that we cannot count on changes in external packages. What we do must be self contained to DataFrames.jl

@pdeffebach
Copy link
Contributor Author

Let's close this. R and Stata don't support this, and typos are common.

@pdeffebach pdeffebach closed this May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return data frame unaltered when Not only includes columns that are not in data frame
4 participants