-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore Not
which column not in data frame
#2228
Conversation
The rule you propose is incorrect as what is in |
Thanks for the feedback. I think the simplest thing to do in this context is add a bunch of |
src/other/index.jl
Outdated
toskip = x[notidx.first]:x[notidx.last] | ||
return setdiff(1:length(x), toskip) | ||
elseif !(haskey(x, notidx.skip.first)) && !(haskey(x, notidx.skip.last)) | ||
return 1:length(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you think it should not error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be fine having this error.
Here is my reasoning for the current behavior. Imagine a dataframe with infinite columns, and a contiguous set of columns comprises your data frame. When you do Not(Between(:x, :z))
, either
:x
and:z
are both to the left of your columns of interest. Then no error, do nothing:x
and:z
are both to the right of your columns of interest. Then no error, do nothing.:x
is on the left of your columns of interest,:z
is on the right. In which case you should delete all columns.
This PR, currently, only assumes options 1 and 2, and thus does not error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if !(haskey(x, notidx.skip.first)) && !(haskey(x, notidx.skip.last))
is true then you are not sure if
:x
is on the left and :z
is on the right (and do what you proposed)
or
:z
is on the left and :x
is on the right (and then exactly the opposite should happen)
That is why I felt it should also error.
If we agree then Between
should be removed from the list in my main comment.
It looks almost good. Thank you!
Right 😄. We need to think where to put a precise description of the behaviour you implement. Probably https://juliadata.github.io/DataFrames.jl/latest/lib/indexing/ is the right place. The key thing to highlight there is that we will have different rules what errors if you select rows and what errors if you select columns. As now we are consistent:
but this would not be the case any more after this PR. Thinking about this - maybe (I am not 100% sure what would be best). An alternative of the only case we should accept for @CameronBieganek and @nalimilan - maybe you have some thoughts. |
If I understand correctly, you're asking whether By the way, it's not clear from the discussion on this PR so far, but I think |
My point is to consider supporting missing column in only:
(i.e. cases where you pass a column name explicitly and it is not a nested selection) |
Having |
I responded to all comments about the implementation. I think it's too much to require users to always to |
Finally, should I refactor this to just do dispatch? |
Having reviewed the revision I still feel that we should only special case:
and handle everything else like we do currently. The reason is that now I see that there is a bug with handling Simply the proposed design (that is general) does not compose well with indexing of If you feel we should special case some more cases except |
This is what I would do, and in dispatch I would only allow the types that we are sure we handle correctly. |
Thanks! I refactored so it uses dispatch. I also changed the Final decision is how to type this up in docs and if we should update getindex for rows as well. I don't think we should, also because we want to maintain consistency with the behavior of |
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Okay I added all tests. I think all cases are covered. |
Thank you! I have restarted the CI as for some reason it did not spit up correctly after your last commit. |
CI fails |
Apart from comment https://github.com/JuliaData/DataFrames.jl/pull/2228/files#r432882413 it looks good. So the only thing left is documentation update. |
Thanks for the catch. I have removed the methods. I now only check This should be ready to go. |
Looks good now. I approve it as it can be merged from the functionality perspective. However, can you please add update to the documentation? The manual page describing indexing should cover special cases where we allow invalid indices. |
Thank you for the update of the getting started example. What I meant earlier though is that it is crucial to update docs/src/lib/indexing.md with a description of precise rules - in the line where |
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
@@ -29,6 +29,12 @@ The rules for a valid type of index into a column are the following: | |||
* an `All` or `Between` expression (see [DataAPI.jl](https://github.com/JuliaData/DataAPI.jl)); | |||
* a colon literal `:`. | |||
|
|||
Indexing with `Not` has special behavior, in particular: | |||
* `String`s and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:String}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `String`s and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:String}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame; | |
* strings and `Symbol`s that do not belong in the column names of the data frame are ignored. The same applies for elements of `AbstractVector{<:AbstractString}` and `AbstractVector{Symbol}` that do not belong in the column names of the data frame; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add that the same applies to the case if they are present in All
expressions.
On the other hand - the three rules you described below are not special - they follow the standard rules for Not
from InvertedIndices.jl. So we can do two things - either just remove them, or if you think it is worth to mention it, change the line above to be more verbose and say something like:
Indexing with `Not` selects indices that are a complement of the selection specified by its argument, in particular:
Sorry for not reacting again earlier. After discussing this deeper with @bkamins I'm still not happy with the inconsistency, and I've also realized that we need something more general to allow trying to select columns that may not exist. Even dplyr, which is quite flexible, doesn't allow negating non-existent names with We agreed with @bkamins that it would be better to adopt a powerful design along these lines. We already have |
Conditional on not allowing |
Let's try that and see what they think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdeffebach - do we want to resurrect this PR. I think we should assume that we cannot count on changes in external packages. What we do must be self contained to DataFrames.jl
Let's close this. R and Stata don't support this, and typos are common. |
Fixed #2197
I still have to add tests.