-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a skipmissing kwarg to select/transform/combine #2258
Comments
This would essentially be a shorthand for The benefit would be that you would have automatically resolved The problem is that it would break the invariant of Given the functionality is easy enough to be obtained via |
In my mind it would not change the number of rows. It would just compute the function using rows where none of the cols is missing (and return a missing value for rows where one of the cols is missing). |
Something like
Note that currently |
Yes I guess this would be quite convenient. In practice most people who work with data would almost never have to use the But to be efficient, the implementation cannot always collect the result. The only way to be really fast in all cases is to wrap the columns in |
Alternatively, it could apply the function on a view of columns no? |
No, since you would need to go over all values first just to find out whether they are missing or not. That's quite costly if all you want to do is e.g. compute the mean. That's why we have |
I just looked over the |
Above we took the case of a function returning a scalar, in which case the value can just be broadcasted to match the expected number of rows. But we haven't discussed the situation where you would pass a function returning the same number of rows as the input, like Implementing this is easy in the So we would have to lose a bit of performance to be able to support this. Maybe it's not the end of the world given that it would make things simpler: we would be able to pass the user-provided function a Finally, note that this approach won't work for |
I answer to #2314 (comment) here, per suggestion of @nalimilan.
This is not what
Again
as with
Well they are accepted because they follow standard rules:
(so in short - please remember that we ALWAYS transform any expression to a canonical form |
I understood your question as asking the desired behavior, not how to implement it with the current implementation of these functions in DataFrames.
As for `combine(df, :x => identity, :y => identity, skipmissing = true)`, it would report "ERROR: ArgumentError: New columns must have the same length as old columns". Exactly the same as what happens when doing `combine(df, :x => collect ∘ skipmissing, :y => collect ∘ skipmissing)`.
|
Then for However, for
Did I understand correctly what you would like to have? |
No, I think
Yes exactly. |
Ah - OK, so then
Actually if we define the rule you propose it will make sense 😄 - depending on what you want in the output you will use |
I think However see #2211. Are we over-thinking this? Perhaps we can just operate on a |
I think the difference is relevant with piping. When we make
and the issue is that you have to use Having
so I would be ok with adding this But for me (as you probably know 😄) the crucial thing is to precisely understand the desired functionality and the dimensions here are:
|
I agree with @matthieugomez 's interpretation. |
Sorry to have this discussion across 2 issues, but I think it would be good to "un-filter" automatically in a What I think you are saying, though, is that this would require a special type because it is a slightly different logic than what one would expect from a normal |
Yes |
|
I was thinking that it would be nice to have a skipmissing keyword argument in transform/select/combine
The idea is that it would only use rows where none of the observations for the variables used in the call are missing. (It is possible because user must already specify which variables are used in a call to transform/select/combine).
That would make it easier to skip missing without, and, in particular, to handle the case of several variables (or weights).
The text was updated successfully, but these errors were encountered: