-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for more auto histogram methods #533
base: master
Are you sure you want to change the base?
Conversation
fit(::Type{Histogram{T}},v::AbstractVector, wv::AbstractWeights;
closed::Symbol=:left, binestimator=sturges, nbins=binestimator(v)) |
Another solution would be to deprecate |
Hmm, I also don't want to remove the option to pass in a number of bins as an integer. Like
I dislike this, since we're now creating new syntax that is completely unique to StatsBase.jl and I can see myself needing to look it up every time I need to use it. It feels somewhat awkward that I can pass fit(::Type{Histogram{T}},v::AbstractVector, bins::Symbol; closed::Symbol=:left) where {T} =
fit(Histogram{T},(v,), (edg,), closed=closed, nbins=bin_estimator(bins, v))
fit(::Type{Histogram{T}},v::AbstractVector, bins::Int; closed::Symbol=:left) where {T} =
fit(Histogram{T},(v,), (edg,), closed=closed, nbins=bins) This would match R, numpy, Plots.jl as a signature and we could deprecate I will temper the above by saying that histrange does some work to ensure the intervals are multiples of 2 and 5 (like R) which I'm not personally a fan of, but could cause a bit of confusion when users are trying to plot something with 20 bins, but only get 11 or something. |
Instead of using hard-coded options, like |
In Plots, I did it with keywords, but it was always meant as kind of an interim solution. |
I like the idea of using dispatch for extensibility. Though I know a change would be breaking I would also consider whether the default should really be Sturgess. Take a look at the implementation in Plots (or consider just moving that - it should be here rather than there and was always meant to); we've caught a few bugs along the way, and there are some safeguards in there for that. The implementation here also doesn't seem to cover the multidimensional case, but we put some effort into getting that right. In StatsPlots there's also some code for Wand's binning (which doesn't align with integers (https://github.com/JuliaPlots/StatsPlots.jl/blob/master/src/hist.jl#L70-L187), which I've ported into the MIT-licensed StatsPlots package with Wand's permission. I don't know if we want that here, but it's a good example of how dispatch would make this more extensible. |
We can perfectly allow passing either a callable or an integer to |
Hi,
I've taken a stab at adding auto-binning for histograms in StatsBase.jl. Given that sturges estimator was the default, I've added the other 2 options available in R, the "scott" estimator and the "FD" estimator. This was inspired by #524 and the fact I've been drawing histograms and sturges is not really that good for large non-normal datasets.
While I'm confident in the implementations of the estimators, I'm less confident about things like API changes and default behaviour changes, especially with things like julia's multiple dispatch and kwargs being sort of unclear (to me). Things like the if statement on the string is a very pythonic thing to do, but I'm a little less confident with the julia way of doing things.
I've got some tests to add before this goes much further, but I'd also like to get a sense on how interested the maintainers would be in having this in StatsBase.jl.
I can also add the extra estimators available in numpy if there's interest there