Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom functions to describe #1664

Merged
merged 19 commits into from
Mar 6, 2019

Conversation

pdeffebach
Copy link
Contributor

This resolves #1436 by allowing the user to specify a Pair in their vector of stats .

One design choice that needs consensus is whether or not we do the skipmissing for the user. I vote yes because we do that for :mean and we dont make the user do skipmissing there either.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Jan 4, 2019

I posted on discourse about a thought related to this.

One application I could see for including your own functions into describe is weighted means. for example one might do

describe(df, stats = [:mean, :mean_w1 => x -> mean(x, weights(w1)), :mean_w2 => mean(x, weights(w2))])

In Stata you can do this exact operation quite easily by specifying pw = w1 etc. in the summarize command.

We have the opportunity to be more flexible here, I think. But above I mention how I opted to have an automatic skipmissing be done to the columns. This would make weighted means, and a lot of other operations, pretty worthless.

Maybe it's not worth doing custom functions at all, if more of these roadblocks present themselves.

@bkamins
Copy link
Member

bkamins commented Jan 4, 2019

Before going into details of the review I would like to discuss the design. Given current behavior of by I think it would be more natural to have the following signature of describe:

describe(::AbstractDataFrame, args...; kwargs...)

Then:

  1. any element of args can be:
    • a symbol from a predefined list of symbols except :all which is handled in point 3 (and here we use the current mechanics and this gives us some kind of backward compatibility)
    • a function, then column name is generated using funname
  2. kwargs have a form column_name=function
  3. we have a special call describe(::AbstractDataFrame, :all) for backward compatibility
  4. we set makeunique=true

I do not have a clear view about skipmissing by default (@nalimilan - any opinion?) as I can see pros and cons of both approaches.

@nalimilan
Copy link
Member

Skipping missing values by default would be consistent with what we currently do for default statistics, which is justified since we also print the number of missing values. I think we'd better defer the issue of the weighted mean: we need to find a general solution even outside of describe anyway (JuliaLang/julia#30596 and on Discourse).

@pdeffebach
Copy link
Contributor Author

we have a special call describe(::AbstractDataFrame, :all) for backward compatibility

Just a note, currently everything is specified in the stats keyword argument. I've got no problem with getting rid of that, but the switch from a keyword argument to args... would be breaking.

One thing I considered when writing this PR was the order of arguments. Going back to the example of a weighted mean, a user might want to have the weighted mean and the un-weighted mean side by side in the resulting DataFrame for comparison.

@bkamins
Copy link
Member

bkamins commented Jan 5, 2019

Just a note, currently everything is specified in the stats keyword argument.

I agree this would be breaking, but it would be more consistent with the rest of the package. Using a keyword argument with a vector is not a standard approach in Julia in general AFAIK.

Of course this would mean that we would have to go through a deprecation period, in which a special case of only stats keyword argument passed should be handled as it is now and print a deprecation warning (I know it is a pain but it cannot be helped).

Order of arguments is a good point. But in this case I would go for no keyword arguments at all and using Pairs as you have proposed, e.g. like this describe(df, :mean, fun, :custom=>fun2). I simply do not see a big value out of using stats and a vector. But I am not very strongly advocating that it should be removed - let us just judge which is cleaner.

@nalimilan
Copy link
Member

Indeed, passing pairs would be consistent with the new combine API. Like for combine, we could allow passing either one or more functions or col=>fun pairs, or a vector thereof ("functions" could also be a symbol among the predefined list). We can deprecate the stats keyword argument, simply making it positional.

One issue with varargs is that it would force the specialization of the function on the specified stats, which is counter-productive here since we iterate over columns. Though that can be avoided by passing collect(args) to the method taking a vector.

@bkamins bkamins mentioned this pull request Jan 15, 2019
31 tasks
@bkamins
Copy link
Member

bkamins commented Jan 22, 2019

@pdeffebach
Independent of @nalimilan comments I have recently merged #1691. Because of this the test files require rebasing as their major clean-up was performed.

The change was removal of indentation in module block. If you prefer that I do the rebasing then please let me know (in general you have to use the current upstream master and reapply only the changes you have introduced in the PR).

@pdeffebach
Copy link
Contributor Author

@bkamins Let me know if I did this right! I'm still new at this.

@bkamins
Copy link
Member

bkamins commented Jan 28, 2019

@pdeffebach Thank you for your work. You have performed git merge and I have suggested git rebase. The difference is that after merge we have thousands of lines of differences to compare and it is very hard to track down your changes.

As a consequence in your PR you have now a lot of commits that are unrelated and already commited to master.

For me it would be much simpler, if it is OK with you, that you revert the merge locally and force push the old version (before the merge) and I will resolve the merge conflicts for you.

If you want to learn how to do it yourself the simplest option is to use GitHub GUI to resolve merge conflicts instead of performing full merge of master into your branch.

Sorry for the problems (that is why I offer to fix them myself 😄), but I hope you can see in "Files changed" tab that now this PR is very hard to analyze.

@pdeffebach
Copy link
Contributor Author

Ah sorry! I swear I did rebase in gitkraken. I think it asked me to do a pull first and that's where it got messed up.

I have force pushed after undo-ing a few things.

@bkamins
Copy link
Member

bkamins commented Jan 28, 2019

Thanks - now we are clean with master 👍.

Going back to the PR there are two things API and the implementation:

  • regarding the API I can accept what you propose as it is fully non-breaking and recently @nalimilan convinced me to slow down breaking things 😄 (although I would prefer an API more similar to combine)
  • regarding the implementation - I would find it cleaner if we had get_stat function that would have two methods - one taking a DataFrame and a Symbol expecting a standard symbol and doing the predefined calculations, the second taking a DataFrame and a Pair and calculating a custom function; both these methods would return a pair: (column_name, column_values); then describe in main loop would simply iterate through the vector of statistics and add more columns to the output data frame. Such design would be easier to reason about; also then you would could more easily handle the case when user requests adding several columns with the same name (BTW: what do we want to do then: throw error, overwrite or generate a surrogate column name as with makeunique?)

It would be also good to hear if @nalimilan approves the above before we move forward.

After we have agreed on the design then in implementation I would add:

  • updated docstrings;
  • an update in the documentation (manual);
  • added tests.

Thank you for pushing this forward!

@nalimilan
Copy link
Member

regarding the API I can accept what you propose as it is fully non-breaking and recently @nalimilan convinced me to slow down breaking things smile (although I would prefer an API more similar to combine)

I didn't say we should avoid breaking things. :-) On the contrary, that's essential if we want to stabilize the API soon. I just think we should keep deprecations for a long time when they don't hinder progress.

I'm fine with the implementation you suggest

@pdeffebach
Copy link
Contributor Author

@bkamins I should do this change in a separate PR, though, right? Or should I keep pushing ahead with this one?

@bkamins
Copy link
Member

bkamins commented Feb 6, 2019

I would use this PR. Now it does not have merge conflicts, so it should not be a problem to push a commit adding the features?
(if you as if the "houskeeping" things - like documentation - should be left for a later PR I would definitely recommend to do it in one shot - in the past we had some PRs that did not update such things and now we have holes in the package that need patching).

Thank you for your effort as describe becomes a swiss army knife with it 😄.

function StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...)

predefined_funs = Symbol[s for s in stats if s isa Symbol]
if :all in predefined_funs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that when passing :all we should assume that it is the only Symbol-type of argument.

StatsBase.describe(df::AbstractDataFrame) = describe(df, :mean, :min, :median,
:max, :nunique, :nmissing,
:eltype)
function StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it would be better to write Union{Symbol, Pair{Symbol}}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note my previous remark about the implementation to avoid mostly useless specialization (second paragraph): #1664 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean I should add another function barrier _describe that is _describe(df::AbstractDataFrame, Vector{<:Union{Symbol, Pair{Symbol}})?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

custom_funs = Pair[s for s in stats if s isa Pair]

# Get the names in the order they appear
ordered_names = [stat isa Symbol ? stat : stat[1] for stat in stats]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fixed to handle :all case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it would be good to make sure that allunique(ordered_names) is true as it will simplify the thinking about the code later simpler (and I guess we can assume that they must be unique).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check and now throw an error.

# for each statistic, loop through the columns array to find values
# letting the comprehension choose the appropriate type
data[stat] = [column_stats_dict[stat] for column_stats_dict in column_stats_dicts]
end

# re-order columns according to the names from above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is not needed here I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@pdeffebach
Copy link
Contributor Author

Thanks for the comments! Making a function wrapper also means I get to abandon overly-specific type signatures, which is good.

I updated the tests (and simplified them a bit). Now I need to update the docs.

One API question: What do we want for describe(df, :newcol => fun). Should we add it to our default list of columns or should we just return a DataFrame with a :fun column?

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you.

Regarding your question - I would stick to what you have implemented, i.e. add only columns explicitly specified.

@pdeffebach
Copy link
Contributor Author

Edited docs! and added a @test_throws error for the case of describe(df, :all, :mean).

Let me know how I can improve the docs some more!

@pdeffebach
Copy link
Contributor Author

I think this is ready to be merged!

@bkamins
Copy link
Member

bkamins commented Feb 12, 2019

Looks good. @nalimilan - OK?

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
`:all`, all summary statistics are reported.
* `df` : the `AbstractDataFrame`
* stats::Union{Symbol, Pair{Symbol}}... : the summary statistics to report.
* Arguments can be symbols from the following: `:mean`, `:std`, `:min`, `:q25`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Arguments can be" should be moved outside of the list as it applies to all bullet points. Then you can go straight to the point for each type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

`:nmissing`. The default statistics used when no `Symbol`s or `Pair`s are provided
are `:mean`, `:min`, `:median`, `:max`, `:nunique`, `:nmissing`, and `:eltype`.
* Alternatively, specify `:all` as the only `Symbol` argument to return all statistics.
* Finally, users can provide their own functions in the form of a `Pair{Symbol, Any}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pair{Symbol, Any}should be Pair{Symbol, <:Any}. But better use something more readable like "`name => function` pairs, with `name` a symbol".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -365,39 +369,58 @@ If the column does not allow missing values, `nothing` is returned.
Consequently, `nmissing = 0` indicates that the column allows
missing values, but does not currently contain any.

Custom functions perform call `skipmissing` on columns of eltype `Union{T, Missing}`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest something like this:

If custom functions are provided, they are called repeatedly with the vector corresponding
to each column as the only argument. For columns allowing for missing values,
the vector is wrapped in a call to [`skipmissing`](@ref): custom functions must therefore
support such objects (and not only vectors), and cannot access missing values.

The last sentence isn't true right now since we allocate a vector without missing values, but we could avoid doing that in the future when neither the median nor quantiles need to be computed. That would be much faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

i = findfirst(s -> s == :all, stats)
splice!(stats, i, allowed_fields) # insert in the stats vector to get a good order
elseif :all in predefined_funs
throw(ArgumentError("If the user specifies `:all` it must be the only `Symbol` argument."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid talking about "the user" to the user. :-) Anyway, avoid any unnecessary text.

Suggested change
throw(ArgumentError("If the user specifies `:all` it must be the only `Symbol` argument."))
throw(ArgumentError("`:all` must be the only `Symbol` argument."))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

throw(ArgumentError(":$stats not allowed." * allowed_msg))
else
stats = [stats]
# todo: fix the printing of this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better printing.

ordered_names = [s isa Symbol ? s : s[1] for s in stats]

if !allunique(ordered_names)
d = StatsBase.countmap(ordered_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a heavy solution. setdiff(ordered_names, unique(ordered_names)) should be enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this earlier - performance wise it is ~ the same 😄. But your solution is more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work? I don't think it does. I improved the printing though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK it doesn't work. What we do currently in names! is unique(nms[nonunique(DataFrame(nms=nms))]). The reason why I'd rather avoid using countmap is that StatsBase is supposed to be moved to Statistics or other packages, so it may be deprecated at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used this and fixed printing.

test/dataframe.jl Outdated Show resolved Hide resolved
describe_output = DataFrame(variable = [:number, :number_missing, :string,
:string_missing, :dates, :catarray],
describe_output = DataFrame(variable = [:number, :number_missing, :string,
:string_missing, :dates, :catarray],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect alignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

nalimilan and others added 4 commits February 15, 2019 09:47
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
@pdeffebach
Copy link
Contributor Author

Responded to all these comments, and fixed last printing issues.

I think this is ready to be merged.

d[:first] = isempty(col) ? nothing : first(col)
end

if :last in stats

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -480,6 +508,12 @@ function get_stats(col::AbstractVector, stats::AbstractVector{Symbol})
return d
end

function get_stats!(d::Dict, col::AbstractVector, stats::AbstractVector{Pair})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function get_stats!(d::Dict, col::AbstractVector, stats::AbstractVector{Pair})
function get_stats!(d::Dict, col::AbstractVector, stats::AbstractVector{<:Pair})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

function StatsBase.describe(df::AbstractDataFrame; stats::Union{Symbol,AbstractVector{Symbol}} =
[:mean, :min, :median, :max, :nunique, :nmissing, :eltype])
# Check that people don't specify the wrong fields.
StatsBase.describe(df::AbstractDataFrame) = _describe(df, [:mean, :min, :median,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better put a line break here so that the array is on a single line. That will also fix the indentation below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

StatsBase.describe(df::AbstractDataFrame) = _describe(df, [:mean, :min, :median,
:max, :nunique, :nmissing,
:eltype])
StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{Symbol}}...) = _describe(df, collect(stats))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{Symbol}}...) = _describe(df, collect(stats))
StatsBase.describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{Symbol}}...) =
_describe(df, collect(stats))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

elseif :all in predefined_funs
throw(ArgumentError("`:all` must be the only `Symbol` argument."))
else
if !issubset(predefined_funs, allowed_fields)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this an elseif.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

_describe(df, [:mean, :min, :median,
:max, :nunique, :nmissing,
:eltype])
elseif stats === :all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should also print the deprecation AFAICT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@bkamins bkamins mentioned this pull request Feb 21, 2019
@bkamins
Copy link
Member

bkamins commented Feb 21, 2019

In line https://github.com/JuliaData/DataFrames.jl/pull/1664/files#diff-6108c932ccdff77aea536b99a80d4a79R493
you have to write d[:std] = try std(col) catch end
reusing m leads to the problem in #1730. Could you please change it?

@bkamins
Copy link
Member

bkamins commented Feb 21, 2019

Probably a better fix after thinking about it is to move:

if :std in stats
    d[:std] = try std(col, mean = m) catch end
end

inside the earlier if where we calculate the mean. Then we can reuse precalculated m.

@pdeffebach
Copy link
Contributor Author

Fixed all the above comments, and changed the if :std to prevent the segfault

@deprecate showcols(io::IO, df::AbstractDataFrame, all::Bool=false, values::Bool=true) show(io, describe(df, stats = [:eltype, :nmissing, :first, :last]), all)
@deprecate showcols(df::AbstractDataFrame, all::Bool=false, values::Bool=true) describe(df, :eltype, :nmissing, :first, :last)
@deprecate showcols(io::IO, df::AbstractDataFrame, all::Bool=false, values::Bool=true) show(io, describe(df, :eltype, :nmissing, :first, :last), all)
function StatsBase.describe(df::AbstractDataFrame; stats=nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this function overwrite the one defined above, printing a warning? If so, the other one should be commented out for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does overwrite, but I thought that the warning was not printed. But I think that commenting it out for now with TODO note is best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry im a bit confused. What do I comment out and which TODO do I write?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment out the overwritten function and above it write TODO to remember to uncomment it when the deprecation period is finished. Have a look at abstractdataframe/iteration.jl for an example (nothing is commented out there but this is the idea). When we have such TODO notes it is easy to grep the whole repository to find things that need fixing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by "overwritten function" I mean "overwritten method definition" to be precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I think I did this right but let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍. This is what will allow to avoid the problem @nalimilan mentioned above.

`:nmissing`. The default statistics used
are `:mean`, `:min`, `:median`, `:max`, `:nunique`, `:nmissing`, and `:eltype`.
* `:all` as the only `Symbol` argument to return all statistics.
* Finally, users can provide their own functions in the form of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a ... pairs"

I'd also remove "users can provide their own functions": it breaks the logic of the bullets and it's quite verbose. Just say that function is a custom function. Also for name it could be useful to say that it's the name of the resulting column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Mar 6, 2019

This should be ready for merging. Then we can fix that odd unreachable reached error.

EDIT: as in, this PR fixes the unreachable reached error.

@bkamins
Copy link
Member

bkamins commented Mar 6, 2019

Thank you. I will wait for @nalimilan to merge, as he has looked more into this PR 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow describe function to take arbitrary functions
3 participants