Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fit nbis=2 produces results of nbins=3 #410

Open
kakila opened this issue Sep 8, 2018 · 26 comments
Open

fit nbis=2 produces results of nbins=3 #410

kakila opened this issue Sep 8, 2018 · 26 comments
Labels

Comments

@kakila
Copy link

kakila commented Sep 8, 2018

When fitting a histogram to data with nbins=2 I get the results corresponding to 3 bins.
Same happens if I pass 0.0:0.5:1.5 and closed=:left instead. I get 3 bins.

julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=2, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false

julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=3, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false

I guess this is a bug, otherwise how to avoid this issue?

@nalimilan
Copy link
Member

I'm not familiar with the histogram code, but AFAICT that's expected, and I get the same thing with R. The number of bins is only a suggestion, and it's not always possible to achieve the requested number. Maybe the documentation needs to be more explicit (it already says "approximate number of bins").

@kakila
Copy link
Author

kakila commented Sep 11, 2018

Ok, I do not how R handles the issue.

Then, to specify a trivial 2 bin histogram, Is one forced to do something like the following?

h = fit(Histogram, [0,1,0,0,1,1,1,1], -0.02:0.51:1.0, closed=:right)

Why can the function not resolve nbins=2 in this case to any valid edges (the above just one out of an infinite)?

@jamblejoe
Copy link

Just to revive an old issue or "feature": Having a function, guessing how many bins to chose if one do not provide anything, is useful. BUT if I do provide the number of bins, I want to be sure that I exactly get that number of bins in my histogram. What is the point of not having this behaviour?

@Moelf
Copy link
Contributor

Moelf commented Mar 3, 2021

I think this can lead to unnecessary grief for the users: imagine two parties trying to sync bin counts...
cc @oschulz for perspective

I think we can make nbins=X strictly enforced since in real life people are probably counting on nbins being very closely realized anyways.

@jamblejoe
Copy link

@Moelf , could you be a bit pro precise on your example? I dont understand how two parties are involved in plotting a histogram. Could give an example?

@Moelf
Copy link
Contributor

Moelf commented Mar 3, 2021

it's common in science/data science people who processed data independently want to verify the result are the same (sync).

In this case, a naive understanding of nbins would lead to n bins (for example, if someone compares to numpy.histogram)

@jamblejoe
Copy link

@Moelf , yes sure! I misunderstood your comment. You suggest nbins specifying the exact number of bins.

@oschulz
Copy link
Contributor

oschulz commented Mar 3, 2021

I think we can make nbins=X strictly enforced

I fully agree: if the user specifies nbins = 2, then they should get two bins. Getting 3 bins is surprising to the user, and a histogram with 2 bins (and even one bin with nbins = 1) is a perfectly valid thing.

@nalimilan
Copy link
Member

nalimilan commented Mar 3, 2021

At the very least it would be interesting to start providing an argument to force using exactly the requested number of bins. That wouldn't be breaking. Then in the next breaking release we could change the default if we're happy with how the new argument works.

@thompsonmj
Copy link

It seems to be convention to make 'nice' numbers for bin widths. I like the suggestion by @nalimilan to add a new argument like nbins_exact. This could be implemented to match functionality between

h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=29))
and
h = fit(Histogram, x, nbins_exact=28)

@oschulz
Copy link
Contributor

oschulz commented Jul 28, 2021

Hm, having two competing nbins kwargs makes things ambiguous, though. Maybe some kind of annotation like nbins = (100, :exact) or so instead?

@Moelf
Copy link
Contributor

Moelf commented Jul 28, 2021

I believe the change is not breaking, reason: we said it's approximate anyways, which means

  • we can change the heuristics
  • one of the option is to use exact numbers as is :)

@oschulz
Copy link
Contributor

oschulz commented Jul 28, 2021

I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram in a proper way (with types), we just never got around to it. Maybe it's really time now.

@Moelf
Copy link
Contributor

Moelf commented Jul 28, 2021

I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:

hist1 / hist2

# thread-safe
push!(hist, val, weight)

@oschulz
Copy link
Contributor

oschulz commented Jul 29, 2021

It would be really nice to support the thread-safe push! directly in StatsBase.Histogram

@Moelf
Copy link
Contributor

Moelf commented Jul 29, 2021

Yes, but I think the idea is this is "Base" histogram. Well, I'm happy to be part of StatsBase for sure since I'm already doing (<: AbstractHistogram) but I don't know what the devs here think.

@thompsonmj
Copy link

More advanced usage can maybe go in a new issue ... I just want my bin counts 🙃

@nalimilan
Copy link
Member

See also #533.

I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:

hist1 / hist2

# thread-safe
push!(hist, val, weight)

Who said this? Actually at #650 I said almost the contrary. Improvements to histograms in StatsBase would be welcome (though I admit finding reviewers isn't always easy).

@mkborregaard
Copy link
Contributor

Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)

@oschulz
Copy link
Contributor

oschulz commented Aug 31, 2021

You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).

@feanor12
Copy link

feanor12 commented Oct 13, 2021

If the number of bins is approximated it would be good to document how it is approximated.
Even better would be the possibility to select the approximation algorithm. This could then also have the option to get the exact number of bins.

A few options I found are:

  • exact
  • sturges
  • rice
  • scott

@oschulz
Copy link
Contributor

oschulz commented Oct 13, 2021

Yes, that's the kind of methods we have in Plot now and that we should move over to StatsBase (with algorithm structs, not keywords like in Plots).

@thompsonmj
Copy link

Gentle reminder if schedules permit:

@oschulz ...

I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram in a proper way (with types), we just never got around to it. Maybe it's really time now.

@mkborregaard ...

Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)

@oschulz ...

You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).

@oschulz
Copy link
Contributor

oschulz commented Jan 17, 2022

Thanks for the reminder! It's been on the back of my mind, I'll try to get on it soonish.

@huixinzhang
Copy link

This code works now: h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=nbins)).

@thompsonmj
Copy link

Hi @huixinzhang, I think the core of the discussion centers on the default response to using the nbins keyword argument. The workaround using range() is useful, but I believe nbins should use the exact number of bins asked for if the code is to speak for itself!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants