Pre-calculate mean instead of nothing for efficiency #15

yuehhua · 2019-12-01T02:29:48Z

No description provided.

mschauer · 2019-12-01T09:03:21Z

This only changes the place where the mean is calculated. It now calculated a bit later in
https://github.com/JuliaLang/Statistics.jl/blob/da6057baf849cbc803b952ef7adf979ae3a9f9d2/src/Statistics.jl#L186
This has the advantage that a more efficient algorithm can be used to compute var and mean at the same time.

chengchingwen · 2019-12-03T02:53:59Z

@mschauer but most of the api called something beforehand, so that efficient algorithm never has the chance to run and even we provide the mean it will still calculate it again.

mschauer · 2019-12-03T08:07:18Z

The algorithm is only implemented for dims=:. That is code path which ends in _var if dims=:.

chengchingwen · 2019-12-03T14:39:36Z

@mschauer It's not only implemented for dims=: but also only for iterable. For AbstractArray type, the problem I mentioned above is not solved. The mean should not be calculated twice.

And this is how the AbstractArray calculate var & mean with dims=:
https://github.com/JuliaLang/Statistics.jl/blob/52fc391ecf85c38be83d66a256e0917860153786/src/Statistics.jl#L341-L342
Not using the efficient algorithm at all.

yuehhua · 2019-12-03T15:38:48Z

As mentioned above, the efficient implementation for var and mean simultaneously maybe mean this:
https://github.com/JuliaLang/Statistics.jl/blob/da6057baf849cbc803b952ef7adf979ae3a9f9d2/src/Statistics.jl#L186-L200

However, it happens in the case that mean is nothing. That's fine.
But the place I try to change is using something, which try to determine whether it is nothing or not.
If not, mean is calculated first. Thus, there is no chance to use the efficient implementation inside _var.

Another way is to accept nothing and let the algorithm to deal with it.

As my point of view, the implementation here mixes the implementation and interface. It may need refactoring.

nalimilan · 2019-12-04T17:01:29Z

Actually IIRC Welford's algorithm isn't more efficient, it's just the only way to compute the variance if you are not allowed to make two passes over the data (for general iterators). But for arrays it's better to do two efficient passes (using SIMD) than a single slow pass. Wikipedia notes "This algorithm is much less prone to loss of precision due to catastrophic cancellation, but might not be as efficient because of the division operation inside the loop." Anyway, this can be checked quite easily if you want.

But TBH I don't really understand what problem this PR tries to address: is it efficiency or just API simplicity? Anyway, this PR would break the current API which allows passing nothing, so it cannot be merged as-is before 2.0. Also note that the docstring also needs updating.

chengchingwen · 2019-12-04T17:53:45Z

@nalimilan The problem this PR tries to address is that we will calculate the mean in the function no matter we have the mean value beforehand. They called something on two stuff: the mean argument and also the recalculated mean. That recalculated mean shouldn’t be calculated if the mean argument has value. Therefore, this is more about the efficiency, not API simplicity.

ALSO, the nothing compatibility can be fixed by merely add another dispatch on ::Nothing then everything would be fine. I think @yuehhua can add that to fix this issue.

But TBH the current API is a little messy. We shouldn’t use keyword argument on internal functions since they can have some impact on the performance. We would need a throughout internal API redesign.

nalimilan · 2019-12-05T11:28:49Z

OK, I see. Good catch. I guess the simplest solution then it to replace something(mean, Statistics.mean(A, dims=dims)) with mean === nothing ? Statistics.mean(A, dims=dims) : mean.

The performance impact of keyword arguments is much lower than what it used to be AFAIK, so better check that it really makes a difference before redesigning the whole thing. Also we're talking about functions which are relatively expensive so the overhead should be negligible.

yuehhua · 2019-12-12T06:21:17Z

I open another PR to refactor API directly. #17
We can close this PR.

Pre-calculate mean instead of nothing

52fc391

andreasnoack closed this May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-calculate mean instead of nothing for efficiency #15

Pre-calculate mean instead of nothing for efficiency #15

yuehhua commented Dec 1, 2019

mschauer commented Dec 1, 2019

chengchingwen commented Dec 3, 2019

mschauer commented Dec 3, 2019

chengchingwen commented Dec 3, 2019

yuehhua commented Dec 3, 2019 •

edited

Loading

nalimilan commented Dec 4, 2019

chengchingwen commented Dec 4, 2019

nalimilan commented Dec 5, 2019

yuehhua commented Dec 12, 2019

Pre-calculate mean instead of nothing for efficiency #15

Pre-calculate mean instead of nothing for efficiency #15

Conversation

yuehhua commented Dec 1, 2019

mschauer commented Dec 1, 2019

chengchingwen commented Dec 3, 2019

mschauer commented Dec 3, 2019

chengchingwen commented Dec 3, 2019

yuehhua commented Dec 3, 2019 • edited Loading

nalimilan commented Dec 4, 2019

chengchingwen commented Dec 4, 2019

nalimilan commented Dec 5, 2019

yuehhua commented Dec 12, 2019

yuehhua commented Dec 3, 2019 •

edited

Loading