Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for more auto histogram methods #533

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nayyarv
Copy link

@nayyarv nayyarv commented Oct 20, 2019

Hi,

I've taken a stab at adding auto-binning for histograms in StatsBase.jl. Given that sturges estimator was the default, I've added the other 2 options available in R, the "scott" estimator and the "FD" estimator. This was inspired by #524 and the fact I've been drawing histograms and sturges is not really that good for large non-normal datasets.

While I'm confident in the implementations of the estimators, I'm less confident about things like API changes and default behaviour changes, especially with things like julia's multiple dispatch and kwargs being sort of unclear (to me). Things like the if statement on the string is a very pythonic thing to do, but I'm a little less confident with the julia way of doing things.

I've got some tests to add before this goes much further, but I'd also like to get a sense on how interested the maintainers would be in having this in StatsBase.jl.

I can also add the extra estimators available in numpy if there's interest there

@wildart
Copy link
Contributor

wildart commented Oct 20, 2019

nbins is an integer parameter for specifying particular number of bins. But, you could create a new keyword binestimator parameter to pass appropriate function like that

fit(::Type{Histogram{T}},v::AbstractVector, wv::AbstractWeights;
    closed::Symbol=:left, binestimator=sturges, nbins=binestimator(v))

@nalimilan
Copy link
Member

nbins is an integer parameter for specifying particular number of bins. But, you could create a new keyword binestimator parameter to pass appropriate function like that

fit(::Type{Histogram{T}},v::AbstractVector, wv::AbstractWeights;
    closed::Symbol=:left, binestimator=sturges, nbins=binestimator(v))

Another solution would be to deprecate nbins in favor of just bins, which would take a callable. Then one would do e.g. bins=Sturges(nbins), with Sturges a special type.

@nalimilan
Copy link
Member

See also previous work at https://github.com/JuliaPlots/Plots.jl/blob/648f26f075731940fad7319ac1bcef21ed444cc4/src/recipes.jl#L612-L665

@nayyarv nayyarv changed the title Adding support for more Adding support for more auto histogram methods Oct 20, 2019
@nayyarv
Copy link
Author

nayyarv commented Oct 22, 2019

Another solution would be to deprecate nbins in favor of just bins, which would take a callable. Then one would do e.g. bins=Sturges(nbins), with Sturges a special type.

Hmm, I also don't want to remove the option to pass in a number of bins as an integer. Like numpyand Plots.jl have a bins parameter that then decides based on type. Since nbins is a kwarg, we can't do any kind of multiple dispatch (AFAIK)

nbins is an integer parameter for specifying particular number of bins. But, you could create a new keyword binestimator parameter to pass appropriate function like that.

I dislike this, since we're now creating new syntax that is completely unique to StatsBase.jl and I can see myself needing to look it up every time I need to use it.

It feels somewhat awkward that I can pass edges as an arg, but not number of bins as an arg. I personally believe that the number of bins is a very important parameter in histograms and would suggest a couple of new signatures (and maybe extra versions with weights too?) to be added.

fit(::Type{Histogram{T}},v::AbstractVector, bins::Symbol; closed::Symbol=:left) where {T} =
    fit(Histogram{T},(v,), (edg,), closed=closed, nbins=bin_estimator(bins, v))
fit(::Type{Histogram{T}},v::AbstractVector, bins::Int; closed::Symbol=:left) where {T} =
    fit(Histogram{T},(v,), (edg,), closed=closed, nbins=bins)

This would match R, numpy, Plots.jl as a signature and we could deprecate nbins as a kwarg too for fit?

I will temper the above by saying that histrange does some work to ensure the intervals are multiples of 2 and 5 (like R) which I'm not personally a fan of, but could cause a bit of confusion when users are trying to plot something with 20 bins, but only get 11 or something.

@oschulz
Copy link
Contributor

oschulz commented Oct 22, 2019

CC @mkborregaard

@oschulz
Copy link
Contributor

oschulz commented Oct 22, 2019

Instead of using hard-coded options, like if (estimator == :sturges) - wouldn't it be more extensible to use dispatch here? We could have an abstract functor-like type [Abstract]BinningAlgorithm that binning algo's would extend.

@oschulz
Copy link
Contributor

oschulz commented Oct 22, 2019

In Plots, I did it with keywords, but it was always meant as kind of an interim solution.

@mkborregaard
Copy link
Contributor

I like the idea of using dispatch for extensibility. Though I know a change would be breaking I would also consider whether the default should really be Sturgess.

Take a look at the implementation in Plots (or consider just moving that - it should be here rather than there and was always meant to); we've caught a few bugs along the way, and there are some safeguards in there for that. The implementation here also doesn't seem to cover the multidimensional case, but we put some effort into getting that right.

In StatsPlots there's also some code for Wand's binning (which doesn't align with integers (https://github.com/JuliaPlots/StatsPlots.jl/blob/master/src/hist.jl#L70-L187), which I've ported into the MIT-licensed StatsPlots package with Wand's permission. I don't know if we want that here, but it's a good example of how dispatch would make this more extensible.

@nalimilan
Copy link
Member

Another solution would be to deprecate nbins in favor of just bins, which would take a callable. Then one would do e.g. bins=Sturges(nbins), with Sturges a special type.

Hmm, I also don't want to remove the option to pass in a number of bins as an integer. Like numpyand Plots.jl have a bins parameter that then decides based on type. Since nbins is a kwarg, we can't do any kind of multiple dispatch (AFAIK)

We can perfectly allow passing either a callable or an integer to bins. What we can't do is dispatch on keyword arguments, but we can handle different types internally (and the compiler will even optimize that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants