-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom functions to describe
#1664
Conversation
I posted on discourse about a thought related to this. One application I could see for including your own functions into
In Stata you can do this exact operation quite easily by specifying We have the opportunity to be more flexible here, I think. But above I mention how I opted to have an automatic Maybe it's not worth doing custom functions at all, if more of these roadblocks present themselves. |
Before going into details of the review I would like to discuss the design. Given current behavior of
Then:
I do not have a clear view about |
Skipping missing values by default would be consistent with what we currently do for default statistics, which is justified since we also print the number of missing values. I think we'd better defer the issue of the weighted mean: we need to find a general solution even outside of |
Just a note, currently everything is specified in the One thing I considered when writing this PR was the order of arguments. Going back to the example of a weighted mean, a user might want to have the weighted mean and the un-weighted mean side by side in the resulting DataFrame for comparison. |
I agree this would be breaking, but it would be more consistent with the rest of the package. Using a keyword argument with a vector is not a standard approach in Julia in general AFAIK. Of course this would mean that we would have to go through a deprecation period, in which a special case of only Order of arguments is a good point. But in this case I would go for no keyword arguments at all and using |
Indeed, passing pairs would be consistent with the new One issue with varargs is that it would force the specialization of the function on the specified stats, which is counter-productive here since we iterate over columns. Though that can be avoided by passing |
@pdeffebach The change was removal of indentation in module block. If you prefer that I do the rebasing then please let me know (in general you have to use the current upstream master and reapply only the changes you have introduced in the PR). |
@bkamins Let me know if I did this right! I'm still new at this. |
@pdeffebach Thank you for your work. You have performed git merge and I have suggested git rebase. The difference is that after merge we have thousands of lines of differences to compare and it is very hard to track down your changes. As a consequence in your PR you have now a lot of commits that are unrelated and already commited to master. For me it would be much simpler, if it is OK with you, that you revert the merge locally and force push the old version (before the merge) and I will resolve the merge conflicts for you. If you want to learn how to do it yourself the simplest option is to use GitHub GUI to resolve merge conflicts instead of performing full merge of master into your branch. Sorry for the problems (that is why I offer to fix them myself 😄), but I hope you can see in "Files changed" tab that now this PR is very hard to analyze. |
165296c
to
1be8e0f
Compare
Ah sorry! I swear I did I have force pushed after undo-ing a few things. |
Thanks - now we are clean with master 👍. Going back to the PR there are two things API and the implementation:
It would be also good to hear if @nalimilan approves the above before we move forward. After we have agreed on the design then in implementation I would add:
Thank you for pushing this forward! |
I didn't say we should avoid breaking things. :-) On the contrary, that's essential if we want to stabilize the API soon. I just think we should keep deprecations for a long time when they don't hinder progress. I'm fine with the implementation you suggest |
@bkamins I should do this change in a separate PR, though, right? Or should I keep pushing ahead with this one? |
I would use this PR. Now it does not have merge conflicts, so it should not be a problem to push a commit adding the features? Thank you for your effort as |
function StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...) | ||
|
||
predefined_funs = Symbol[s for s in stats if s isa Symbol] | ||
if :all in predefined_funs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that when passing :all
we should assume that it is the only Symbol
-type of argument.
StatsBase.describe(df::AbstractDataFrame) = describe(df, :mean, :min, :median, | ||
:max, :nunique, :nmissing, | ||
:eltype) | ||
function StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it would be better to write Union{Symbol, Pair{Symbol}}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note my previous remark about the implementation to avoid mostly useless specialization (second paragraph): #1664 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean I should add another function barrier _describe
that is _describe(df::AbstractDataFrame, Vector{<:Union{Symbol, Pair{Symbol}})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
custom_funs = Pair[s for s in stats if s isa Pair] | ||
|
||
# Get the names in the order they appear | ||
ordered_names = [stat isa Symbol ? stat : stat[1] for stat in stats] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be fixed to handle :all
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it would be good to make sure that allunique(ordered_names)
is true
as it will simplify the thinking about the code later simpler (and I guess we can assume that they must be unique).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check and now throw an error.
# for each statistic, loop through the columns array to find values | ||
# letting the comprehension choose the appropriate type | ||
data[stat] = [column_stats_dict[stat] for column_stats_dict in column_stats_dicts] | ||
end | ||
|
||
# re-order columns according to the names from above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is not needed here I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Thanks for the comments! Making a function wrapper also means I get to abandon overly-specific type signatures, which is good. I updated the tests (and simplified them a bit). Now I need to update the docs. One API question: What do we want for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you.
Regarding your question - I would stick to what you have implemented, i.e. add only columns explicitly specified.
Edited docs! and added a Let me know how I can improve the docs some more! |
I think this is ready to be merged! |
Looks good. @nalimilan - OK? |
`:all`, all summary statistics are reported. | ||
* `df` : the `AbstractDataFrame` | ||
* stats::Union{Symbol, Pair{Symbol}}... : the summary statistics to report. | ||
* Arguments can be symbols from the following: `:mean`, `:std`, `:min`, `:q25`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Arguments can be" should be moved outside of the list as it applies to all bullet points. Then you can go straight to the point for each type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
`:nmissing`. The default statistics used when no `Symbol`s or `Pair`s are provided | ||
are `:mean`, `:min`, `:median`, `:max`, `:nunique`, `:nmissing`, and `:eltype`. | ||
* Alternatively, specify `:all` as the only `Symbol` argument to return all statistics. | ||
* Finally, users can provide their own functions in the form of a `Pair{Symbol, Any}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pair{Symbol, Any}
should be Pair{Symbol, <:Any}
. But better use something more readable like "`name => function` pairs, with `name` a symbol"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -365,39 +369,58 @@ If the column does not allow missing values, `nothing` is returned. | |||
Consequently, `nmissing = 0` indicates that the column allows | |||
missing values, but does not currently contain any. | |||
|
|||
Custom functions perform call `skipmissing` on columns of eltype `Union{T, Missing}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest something like this:
If custom functions are provided, they are called repeatedly with the vector corresponding
to each column as the only argument. For columns allowing for missing values,
the vector is wrapped in a call to [`skipmissing`](@ref): custom functions must therefore
support such objects (and not only vectors), and cannot access missing values.
The last sentence isn't true right now since we allocate a vector without missing values, but we could avoid doing that in the future when neither the median nor quantiles need to be computed. That would be much faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
i = findfirst(s -> s == :all, stats) | ||
splice!(stats, i, allowed_fields) # insert in the stats vector to get a good order | ||
elseif :all in predefined_funs | ||
throw(ArgumentError("If the user specifies `:all` it must be the only `Symbol` argument.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid talking about "the user" to the user. :-) Anyway, avoid any unnecessary text.
throw(ArgumentError("If the user specifies `:all` it must be the only `Symbol` argument.")) | |
throw(ArgumentError("`:all` must be the only `Symbol` argument.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
throw(ArgumentError(":$stats not allowed." * allowed_msg)) | ||
else | ||
stats = [stats] | ||
# todo: fix the printing of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better printing.
ordered_names = [s isa Symbol ? s : s[1] for s in stats] | ||
|
||
if !allunique(ordered_names) | ||
d = StatsBase.countmap(ordered_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a heavy solution. setdiff(ordered_names, unique(ordered_names))
should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this earlier - performance wise it is ~ the same 😄. But your solution is more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work? I don't think it does. I improved the printing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK it doesn't work. What we do currently in names!
is unique(nms[nonunique(DataFrame(nms=nms))])
. The reason why I'd rather avoid using countmap
is that StatsBase is supposed to be moved to Statistics or other packages, so it may be deprecated at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used this and fixed printing.
test/dataframe.jl
Outdated
describe_output = DataFrame(variable = [:number, :number_missing, :string, | ||
:string_missing, :dates, :catarray], | ||
describe_output = DataFrame(variable = [:number, :number_missing, :string, | ||
:string_missing, :dates, :catarray], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
5c5ef32
to
97061ce
Compare
Responded to all these comments, and fixed last printing issues. I think this is ready to be merged. |
d[:first] = isempty(col) ? nothing : first(col) | ||
end | ||
|
||
if :last in stats | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -480,6 +508,12 @@ function get_stats(col::AbstractVector, stats::AbstractVector{Symbol}) | |||
return d | |||
end | |||
|
|||
function get_stats!(d::Dict, col::AbstractVector, stats::AbstractVector{Pair}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function get_stats!(d::Dict, col::AbstractVector, stats::AbstractVector{Pair}) | |
function get_stats!(d::Dict, col::AbstractVector, stats::AbstractVector{<:Pair}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
function StatsBase.describe(df::AbstractDataFrame; stats::Union{Symbol,AbstractVector{Symbol}} = | ||
[:mean, :min, :median, :max, :nunique, :nmissing, :eltype]) | ||
# Check that people don't specify the wrong fields. | ||
StatsBase.describe(df::AbstractDataFrame) = _describe(df, [:mean, :min, :median, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better put a line break here so that the array is on a single line. That will also fix the indentation below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
StatsBase.describe(df::AbstractDataFrame) = _describe(df, [:mean, :min, :median, | ||
:max, :nunique, :nmissing, | ||
:eltype]) | ||
StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{Symbol}}...) = _describe(df, collect(stats)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{Symbol}}...) = _describe(df, collect(stats)) | |
StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{Symbol}}...) = | |
_describe(df, collect(stats)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
elseif :all in predefined_funs | ||
throw(ArgumentError("`:all` must be the only `Symbol` argument.")) | ||
else | ||
if !issubset(predefined_funs, allowed_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this an elseif
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
_describe(df, [:mean, :min, :median, | ||
:max, :nunique, :nmissing, | ||
:eltype]) | ||
elseif stats === :all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch should also print the deprecation AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
In line https://github.com/JuliaData/DataFrames.jl/pull/1664/files#diff-6108c932ccdff77aea536b99a80d4a79R493 |
Probably a better fix after thinking about it is to move:
inside the earlier |
Fixed all the above comments, and changed the |
@deprecate showcols(io::IO, df::AbstractDataFrame, all::Bool=false, values::Bool=true) show(io, describe(df, stats = [:eltype, :nmissing, :first, :last]), all) | ||
@deprecate showcols(df::AbstractDataFrame, all::Bool=false, values::Bool=true) describe(df, :eltype, :nmissing, :first, :last) | ||
@deprecate showcols(io::IO, df::AbstractDataFrame, all::Bool=false, values::Bool=true) show(io, describe(df, :eltype, :nmissing, :first, :last), all) | ||
function StatsBase.describe(df::AbstractDataFrame; stats=nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this function overwrite the one defined above, printing a warning? If so, the other one should be commented out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does overwrite, but I thought that the warning was not printed. But I think that commenting it out for now with TODO note is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry im a bit confused. What do I comment out and which TODO do I write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment out the overwritten function and above it write TODO to remember to uncomment it when the deprecation period is finished. Have a look at abstractdataframe/iteration.jl for an example (nothing is commented out there but this is the idea). When we have such TODO notes it is easy to grep the whole repository to find things that need fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by "overwritten function" I mean "overwritten method definition" to be precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I think I did this right but let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 👍. This is what will allow to avoid the problem @nalimilan mentioned above.
`:nmissing`. The default statistics used | ||
are `:mean`, `:min`, `:median`, `:max`, `:nunique`, `:nmissing`, and `:eltype`. | ||
* `:all` as the only `Symbol` argument to return all statistics. | ||
* Finally, users can provide their own functions in the form of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a ... pairs"
I'd also remove "users can provide their own functions": it breaks the logic of the bullets and it's quite verbose. Just say that function
is a custom function. Also for name
it could be useful to say that it's the name of the resulting column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This should be ready for merging. Then we can fix that odd EDIT: as in, this PR fixes the |
Thank you. I will wait for @nalimilan to merge, as he has looked more into this PR 😄. |
This resolves #1436 by allowing the user to specify a
Pair
in their vector ofstats
.One design choice that needs consensus is whether or not we do the
skipmissing
for the user. I vote yes because we do that for:mean
and we dont make the user doskipmissing
there either.