Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-calculate mean instead of nothing for efficiency #15

Closed
wants to merge 1 commit into from

Conversation

yuehhua
Copy link

@yuehhua yuehhua commented Dec 1, 2019

No description provided.

@mschauer
Copy link
Member

mschauer commented Dec 1, 2019

This only changes the place where the mean is calculated. It now calculated a bit later in
https://github.com/JuliaLang/Statistics.jl/blob/da6057baf849cbc803b952ef7adf979ae3a9f9d2/src/Statistics.jl#L186
This has the advantage that a more efficient algorithm can be used to compute var and mean at the same time.

@chengchingwen
Copy link

@mschauer but most of the api called something beforehand, so that efficient algorithm never has the chance to run and even we provide the mean it will still calculate it again.

@mschauer
Copy link
Member

mschauer commented Dec 3, 2019

The algorithm is only implemented for dims=:. That is code path which ends in _var if dims=:.

@chengchingwen
Copy link

@mschauer It's not only implemented for dims=: but also only for iterable. For AbstractArray type, the problem I mentioned above is not solved. The mean should not be calculated twice.

And this is how the AbstractArray calculate var & mean with dims=:
https://github.com/JuliaLang/Statistics.jl/blob/52fc391ecf85c38be83d66a256e0917860153786/src/Statistics.jl#L341-L342
Not using the efficient algorithm at all.

@yuehhua
Copy link
Author

yuehhua commented Dec 3, 2019

As mentioned above, the efficient implementation for var and mean simultaneously maybe mean this:
https://github.com/JuliaLang/Statistics.jl/blob/da6057baf849cbc803b952ef7adf979ae3a9f9d2/src/Statistics.jl#L186-L200

However, it happens in the case that mean is nothing. That's fine.
But the place I try to change is using something, which try to determine whether it is nothing or not.
If not, mean is calculated first. Thus, there is no chance to use the efficient implementation inside _var.

Another way is to accept nothing and let the algorithm to deal with it.

As my point of view, the implementation here mixes the implementation and interface. It may need refactoring.

@nalimilan
Copy link
Member

Actually IIRC Welford's algorithm isn't more efficient, it's just the only way to compute the variance if you are not allowed to make two passes over the data (for general iterators). But for arrays it's better to do two efficient passes (using SIMD) than a single slow pass. Wikipedia notes "This algorithm is much less prone to loss of precision due to catastrophic cancellation, but might not be as efficient because of the division operation inside the loop." Anyway, this can be checked quite easily if you want.

But TBH I don't really understand what problem this PR tries to address: is it efficiency or just API simplicity? Anyway, this PR would break the current API which allows passing nothing, so it cannot be merged as-is before 2.0. Also note that the docstring also needs updating.

@chengchingwen
Copy link

@nalimilan The problem this PR tries to address is that we will calculate the mean in the function no matter we have the mean value beforehand. They called something on two stuff: the mean argument and also the recalculated mean. That recalculated mean shouldn’t be calculated if the mean argument has value. Therefore, this is more about the efficiency, not API simplicity.

ALSO, the nothing compatibility can be fixed by merely add another dispatch on ::Nothing then everything would be fine. I think @yuehhua can add that to fix this issue.

But TBH the current API is a little messy. We shouldn’t use keyword argument on internal functions since they can have some impact on the performance. We would need a throughout internal API redesign.

@nalimilan
Copy link
Member

OK, I see. Good catch. I guess the simplest solution then it to replace something(mean, Statistics.mean(A, dims=dims)) with mean === nothing ? Statistics.mean(A, dims=dims) : mean.

The performance impact of keyword arguments is much lower than what it used to be AFAIK, so better check that it really makes a difference before redesigning the whole thing. Also we're talking about functions which are relatively expensive so the overhead should be negligible.

@yuehhua
Copy link
Author

yuehhua commented Dec 12, 2019

I open another PR to refactor API directly. #17
We can close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants