Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using intervals to represent histogram binning #514

Open
oschulz opened this issue Aug 22, 2019 · 11 comments
Open

Using intervals to represent histogram binning #514

oschulz opened this issue Aug 22, 2019 · 11 comments

Comments

@oschulz
Copy link
Contributor

oschulz commented Aug 22, 2019

Since I'll have to touch the structure of Histogram for 513 anyhow - there's something I've been thinking a bit for a while, and it might be easiest to do this in one go:

I've always been a bit unhappy about the way histogram edges are handled - the fact that edge vectors are one entry longer than the weights each axis always struck me as somewhat inelegant. It also makes histogram analysis code a bit awkward at times.

I think the natural representations of bins are intervals. So the binning of an edge of a histogram should be a vector of (usually contiguous) intervals. Internally, it should in general still be stored the current way, as a vector of real numbers (length + 1), but it should look like a vector of intervals. So I'd like to propose the following:

  • A new type ContiguousIntervals that implements a view of an AbstractVector{T} with length L + 1 as a vector of Interval{:closed,:open,T} (resp. Interval{:open,:closed,T}) with length L. Maybe this could be contibuted to IntervalSets (@timholy, do you think that would fit in)?

  • Histogram's field edges::E gets replaced by binning::E with E<:NTuple{N,AbstractVector{<:IntervalSets.Interval}}. By default, we'd use ContiguousIntervals, but we'd allow any vector of intervals.

  • We use Base.getproperty and friends to provide a virtual property edges, for backward compatibility.

Operations like getting all left/right bin edges or all bin centers would simply become
broadcasts of minimum/maximum/mean over the binning intervals, etc. Stuff like plotting and histogram analysis code would become simpler and more elegant in general.

I'd like to implement this, but as it's not a trivial change, I'd be glad to get a provisional Ok, before I go ahead (@nalimilan @ararslan @quinnj?).

@andreasnoack
Copy link
Member

To me, this seems like complicating the code with very limited practical benefit. Introducing new types always has some overhead and I don't really see the benefit in this case. In particular, I question that allowing flexibility wrt which endpoints are included is worth the trouble. The underlying code would have to take this into account. Histograms are for continuous distributions so shouldn't really matter. Do you have applications where it would be useful?

@oschulz
Copy link
Contributor Author

oschulz commented Sep 11, 2019

Do you have applications where it would be useful?

Sure - we frequently do detailed analysis of histograms. It would be much more natural if StatsBase.Histogram had a representation of a bin on an axis, instead of the current implicit (resp by-convention) way of saying "take axis entries i and i+1". It's kinda hacky, the current way, as a user-facing API. I think intervals would be much cleaner and more natural, this wouldn't come with any runtime/performance cost and I think I could do this without complicating the code overall. Actually, except for the definition of a ContiguousIntervals (that could like in IntervalSets if @timholy agrees), this might actually simplify parts of the Histogram code.

@andreasnoack
Copy link
Member

I'm well aware that you do a lot of work with histograms so my questions was more if you could give an example that could help me understand a situation where it would be easier to use intervals. I'd mainly like to understand where the new types would be useful and I'd like to avoid that users would have to bother with :open and :closed.

@oschulz
Copy link
Contributor Author

oschulz commented Sep 11, 2019

if you could give an example that could help me understand a situation where it would be easier to use intervals

For example, when scanning a histogram for certain patterns (peaks, etc.), one often needs to take left and right bound of each bin (or it's width and center) into account. So you'll have code that looks at each bin and it's weight - but currently, we don't have a one-to-one relationship between weight-indices and bin-indices, the user has to reconstruct the bin information in an i and i+1 fashion. And when checking for bounds (e.g before using @inbounds) one needs to check that the one index-axis must be exactly one entry longer than the other. This is all kinda inconvenient and inelegant.

I think when dealing with histograms, we should be able to have a representation of a bin on an axis - and the natural representation is IMHO an interval. One could even broadcast over bins and weights and stuff like that.

I'd mainly like to understand where the new types would be useful

Well, Interval wouldn't be a new type since we have a well-established type for that in IntervalSets, which would be a very lightweight dependency (doesn't pull in any non-stdlib packages, package load time 0.05s). ContiguousIntervals would be a new type - but it would simply do what we do now in functions like _edge_binvolume or in various places in the histogram recipes in Plots.jl. So it would help make histogram codes DRYer across packages.

I'd like to avoid that users would have to bother with :open and :closed.

I don't think users would have to bother with that at all - they would create and use histograms as they do now. The only API-change would be that long-term we'd deprecate accessing hist.edges, which would be replaced by hist.binning. hist.binning[1][n] would simply provide the n-th bin on the first axis, as an Interval - but I don't think many users would bother with the type parameters of that Interval object, except in cases where they're really interested in whether it's closed left or right. In such cases, however, the user would now be able to dispatch on this if necessary.

@simonbyrne
Copy link
Member

It would be interesting to try this to see how it works: it would be nice to if the weights and intervals were of the same length.

Would be good to see if this could be build on top of https://github.com/JuliaMath/IntervalSets.jl, which was intended to be a common "interval" package (though I assume more functionality would be required than what is available there).

@oschulz
Copy link
Contributor Author

oschulz commented Sep 16, 2019

It would be interesting to try this to see how it works

Ok, in that case I will at least do a prototype, and then try to convince @andreasnoack based on that. :-)

Will take a bit of time, have lot's of stuff (with deadlines) going on at me moment.

Would be good to see if this could be build on top of https://github.com/JuliaMath/IntervalSets.jl ... (though I assume more functionality would be required than what is available there)

Yes, that's exactly what I have in mind.

@oschulz
Copy link
Contributor Author

oschulz commented Oct 17, 2019

Update: Not abandoned, just deferred due to time constraints. I still definitely plan to pursue this.

@oschulz
Copy link
Contributor Author

oschulz commented Mar 17, 2020

Just a life sign - I'm still interested in this, just overloaded, but it's definitely on my to-do list.

@oschulz
Copy link
Contributor Author

oschulz commented Jun 12, 2020

Haven't forgotten about this.

@oschulz
Copy link
Contributor Author

oschulz commented Nov 14, 2020

Still not forgotten, just overloaded ...

@oschulz
Copy link
Contributor Author

oschulz commented Jan 30, 2021

Sorry for the long silence on this - I've been thinking a bit: Now that view-like wrapper objects don't have to be stack-allocated anymore, we can just add a method binning(hist) or getbinning(hist) that returns the binning as a vector of intervals, wrapping the edges without copying anything, even if the edges are actual vectors and not ranges. And we could add ctors that accept vectors-of-intervals as binning, in addition to edge vectors. So we wouldn't need to change the structure of Histogram right now (but might, longer term, also for #513).

See also #650.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants