Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Move histograms to separate package #650

Open
oschulz opened this issue Jan 30, 2021 · 14 comments
Open

Proposal: Move histograms to separate package #650

oschulz opened this issue Jan 30, 2021 · 14 comments

Comments

@oschulz
Copy link
Contributor

oschulz commented Jan 30, 2021

Histograms are kind of a standalone thing in StatsBase right now - they're not used anywhere in StatsBase, and they use little to no functionality from StatsBase themselves. So I propose to factor them out into a separate package "JuliaStats/Histograms.jl".

StatsBase.jl could reexport Histograms.jl for a while, to keep compatibility, and then stop reexporting it with the next breaking release of StatsBase.

@andreasnoack , @simonbyrne , what do you think?

@oschulz
Copy link
Contributor Author

oschulz commented Jan 30, 2021

I'd be willing to take care of the complete transition.

@nalimilan
Copy link
Member

What would be the advantage of doing this? I mean, most parts of StatsBase could live independently from each other, but that makes it hard for users to find where the functionality lives.

@oschulz
Copy link
Contributor Author

oschulz commented Jan 31, 2021

Hm, I had the impression that other parts of StatsBase were coupled more tightly. And it would make it easier to evolve Histograms since StatsBase is such a central package that 0.x updates with breaking changes take ages to propagate though the ecosystem - so it takes a long time before higher-level packages can access the new version. Far, far fewer packages use histograms than use StatsBase. And just in general I thought more modularity would be a good thing. Also, thought (I speak under correction) that there was a longer term plan to restructure/modularize StatsBase anyhow (move parts to Statistics, other structural changes)?

@nalimilan
Copy link
Member

Yes the plan has been for a long time to move essential features to Statistics and other parts to specialized packages. So I assume you think histograms belong to the latter rather than the former?

And it would make it easier to evolve Histograms since StatsBase is such a central package that 0.x updates with breaking changes take ages to propagate though the ecosystem - so it takes a long time before higher-level packages can access the new version.

Do you mean that you would like to make breaking changes to histogram code? Otherwise we can tag bugfix releases of StatsBase. We could also tag 1.0 so that we can make minor releases when we add new features.

@oschulz
Copy link
Contributor Author

oschulz commented Jan 31, 2021

So I assume you think histograms belong to the latter rather than the former?

Yes, I think both StatsBase and a new "Histograms" package could both depends on Statistics without depending on each other.

Do you mean that you would like to make breaking changes to histogram code?

Well, #513 and #514 (dragged them on too long, but I really want to work on those now) will require changes to the Histogram struct that may be too substantial for a bugfix release. If there are breaking will depend on how I do it exactly - and also on whether addition of new type parameters (array type, for #513) would be considered breaking or not.

This can all be done within StateBase, of course, I just thought that if I'm going to do a bit of an overhaul on Histogram, it would be a good time to factor it out. StatsBase is a bit of a monster by now ... and maybe package load times would profit as well from increased modularity?

@simonbyrne
Copy link
Member

I can't remember why exactly, but KernelDensity.jl has its own implementation:
https://github.com/JuliaStats/KernelDensity.jl/blob/2c91f3f815900fc0dbcb2aef638b89a200604a16/src/univariate.jl#L99-L119
it would be good to make them able to use the same underlying functions.

@oschulz
Copy link
Contributor Author

oschulz commented Feb 1, 2021

I agree.

@nalimilan
Copy link
Member

Well, #513 and #514 (dragged them on too long, but I really want to work on those now) will require changes to the Histogram struct that may be too substantial for a bugfix release. If there are breaking will depend on how I do it exactly - and also on whether addition of new type parameters (array type, for #513) would be considered breaking or not.

Adding type parameters or changing the internal fields of the struct isn't considered as breaking. Anyway regarding #514 it seems that @andreasnoack wanted to see a proposal first and that didn't happen yet. So better decide whether we want to make changes before considering them as breaking. Also since we want to move to 1.0, being breaking isn't the end of the world, that's a good occasion to stop using 0.x.

This can all be done within StateBase, of course, I just thought that if I'm going to do a bit of an overhaul on Histogram, it would be a good time to factor it out. StatsBase is a bit of a monster by now ... and maybe package load times would profit as well from increased modularity?

"Monster" sounds like an exaggeration. StatsBase seems to load quite quickly, and it's much smaller than many packages. Yes it's a collection of many small functions in different areas, but that makes them easier to discover.

I can't remember why exactly, but KernelDensity.jl has its own implementation:
https://github.com/JuliaStats/KernelDensity.jl/blob/2c91f3f815900fc0dbcb2aef638b89a200604a16/src/univariate.jl#L99-L119
it would be good to make them able to use the same underlying functions.

Sure. But KernelDensity already depends on StatsBase so that can perfectly be done now by implementing the relevant functions there.

Overall I think the only relevant discussion is about the general organization of the stats packages. If we are certain we don't want histograms to live in Statistics at some point then it could make sense to move them to a separate package. However that's a quite limited set of features so I'm not sure that makes sense. We're not going to create one package for each page in the manual, that would be a nightmare for users and maintainers.

@oschulz
Copy link
Contributor Author

oschulz commented Feb 1, 2021

Adding type parameters or changing the internal fields of the struct isn't considered as breaking.

Thanks!

Anyway regarding #514 it seems that @andreasnoack wanted to see a proposal first and that didn't happen yet.

Sure, I'll have to flesh that out, of course - was planning to do that in the form of a PR.

"Monster" sounds like an exaggeration. StatsBase seems to load quite quickly

Yes, I admit that may have been a bit strong. :-) Package load time is about 0.35 s on my system (on v1.6-beta1), which is certainly not horrible (but also not negligible, IMHO).

Overall I think the only relevant discussion is about the general organization of the stats packages.

I agree. I wasn't trying to argue that splitting off histograms is in any way necessary for impromevments like #513 and #514. It's just while I was finally finding some time to get on that, that I realized that a separate Histograms package might be a good idea in general. But this proposal is indeed not coupled to #513 and #514.

@oschulz
Copy link
Contributor Author

oschulz commented Feb 1, 2021

@Moelf , you're also interested in histograms, what's your take on this?

@Moelf
Copy link
Contributor

Moelf commented Feb 3, 2021

@oschulz thx for pinning me! I'm leaning towards "yes". I think a few good things can spin out of it:

There are quite a few things we want to have in a comprehensive/"SOTA" histogram pkg:

  • More functionality: error handling and propagation
  • More computing backend: GPU
  • Niche things: sampling from a histogram, sophisticated auto binning algorithms, underflow/overflow bin
  • Specialization: faster algorithms for specific kind of histogram (manual, special heuristics)

These can be part of the StatsBase but it also may feel too heavy to maintain maybe.


An alternative would be to keep some basic things in StatsBase but separate and greatly re-write / extend a new Histograms.jl package. This is how I ~always envisioned. (Abstract)Histogram as an interface if you want to call it.

(I actually have a very crude fast 1D histogram implementation w/ error that is precisely based on AbstractHistograms in StatsBase.)

I have strong interests in contributing to this, I will keep an eye on this thread.

@oschulz
Copy link
Contributor Author

oschulz commented Feb 3, 2021

Thanks for your input, @Moelf !

This would actually be a good motivation for a standalone Histograms package - some of the stuff you want could be done within StatsBase, but the rest would IMHO better fit into a separate package. Maybe your "nice things" could be included too, without making it too heavy. Otherwise more special histogram types could of course go into separate packages, as you suggest.

@pdeffebach
Copy link

I dont know if its necessary to get move histograms into its own package. If it's really self contained than that makes sense. But I don't think it should be a moving target of a package.

However I wouldn't mind an overhaul for how histograms support missings. See here.

@oschulz
Copy link
Contributor Author

oschulz commented Feb 17, 2021

But I don't think it should be a moving target of a package.

Oh, no. But histograms are really pretty much self-contained in StatsBase.

However I wouldn't mind an overhaul for how histograms support missings. See here.

Good point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants