Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve the overflow in mean() on integers by promoting accumulator #25

Merged
merged 19 commits into from
Apr 14, 2020
Merged

Solve the overflow in mean() on integers by promoting accumulator #25

merged 19 commits into from
Apr 14, 2020

Conversation

kagalenko-m-b
Copy link
Contributor

@kagalenko-m-b kagalenko-m-b commented Mar 1, 2020

This PR corrects the issue #22, the bug in mean() on integer inputs.

julia> x = [1,1]*typemax(Int);

julia> mean(x)
-1.0

julia> mean(x,dims=1)[]
2.147483647e9

Implementation and tests are largely as proposed by @stevengj

src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated
@@ -710,7 +724,7 @@ end
x = Any[1, 2, 4, 10]
y = Any[1, 2, 4, 10//1]
@test var(x) === 16.25
@test var(y) === 65//4
@test var(y) == 65//4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@test var(y) == 65//4
@test var(y) === 16.25

Technically this qualifies as a breaking change I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, however it is not obvious to me whether one or the other is more "right"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the point is that I'm not sure whether our policy allows changing things like this in a minor release.

Copy link
Contributor Author

@kagalenko-m-b kagalenko-m-b Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this change restores consistency

julia> y = Any[1, 2, 4, 10//1];var(y,dims=1)[]===16.25
true

So maybe calling it "breaking" is too strong.

test/runtests.jl Outdated Show resolved Hide resolved
kagalenko-m-b and others added 2 commits March 4, 2020 11:58
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Suggestions from code review

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triage agreed that overflow should be avoided. Just a few more stylistic comments before merging this PR.

src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
kagalenko-m-b and others added 2 commits March 20, 2020 17:54
Code style fixes suggested by reivewer

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I forgot that you also need to change the mean function for generic iterators above in the file.

@@ -41,7 +41,12 @@ julia> mean(skipmissing([1, missing, 3]))
2.0
```
"""
mean(itr) = mean(identity, itr)
function mean(itr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't OK:

  • collect(A) called by mean_promote will make a copy, which should be avoided at all costs. So we really need two different methods, one for arrays, and one for general iterators.
  • mean(f, itr) should do the same promotion as mean(itr). There's no reason it should have a different behavior, and the equivalence between mean(itr) and mean(identity, itr) seems essential.

Copy link
Contributor Author

@kagalenko-m-b kagalenko-m-b Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't OK:
* collect(A) called by mean_promote will make a copy, which should be avoided at all costs. So we really need two different methods, one for arrays, and one for general iterators.

I added the call to collect() to count the elements of "skipmissing" arrays.
Isn't there a method to count the elements without making a copy? Having duplicate implementations is bad style and multiplies the possibilities for bugs, when those two implementations go out of sync.

* `mean(f, itr)` should do the same promotion as `mean(itr)`. 

That is something I have to question. mean(f, itr) is user explicitly specifying the promotion. Should the function be second-guessing him in that case?

Anyways, that objection is easy to accommodate by adding an optional argument to _mean_promote() with the default value identity.

no reason it should have a different behavior, and the equivalence between mean(itr) and mean(identity, itr) seems essential

The second form means a user is telling the function "don't do the accumulator promotion".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, mean() on empty tuple throws MethodError and this behaviour is enshrined in tests. On the other hand, mean() on zero-length array returns NaN. That looks inconsistent to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the call to collect() to count the elements of "skipmissing" arrays.
Isn't there a method to count the elements without making a copy? Having duplicate implementations is bad style and multiplies the possibilities for bugs, when those two implementations go out of sync.

Unfortunately that's not possible, as some iterators do not allow going over their elements twice. So you need to count the elements as you process them.

That is something I have to question. mean(f, itr) is user explicitly specifying the promotion. Should the function be second-guessing him in that case?

Anyways, that objection is easy to accommodate by adding an optional argument to _mean_promote() with the default value identity.

Well the user isn't really specifying the promotion, s/he's merely saying that elements should be transformed first. For example, it's common to call mean(abs, x), and it could be surprising that the result overflows while mean(x) doesn't.

Right now, mean() on empty tuple throws MethodError and this behaviour is enshrined in tests. On the other hand, mean() on zero-length array returns NaN. That looks inconsistent to me.

That's because an empty array provides the element type so you know what's the type of zero that you should use. But an empty tuple doesn't have any element type, so the only choice we have is throw an error. Fixing the inconsistency by also throwing an error for arrays would be really annoying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation makes sense, thanks.

Copy link
Contributor Author

@kagalenko-m-b kagalenko-m-b Mar 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented promotion for generic iterators and unified promotion for AbstractArray in one function. If you take a look at the lines #164-165 in Statistics.jl, the commented out line 165 passes all tests for mean(), but breaks some tests of var().

src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

Can you add tests for mean(f, A, dims=...) methods?

@StefanKarpinski
Copy link
Contributor

That sounds better thought through that what I had come up with but yes, you can break a list into small enough pieces that each one can't overflow and then sum those recursively.

@nalimilan
Copy link
Member

Actually I hadn't realized you could break arrays in chunks. In practice, is it really necessary? Even for Int32, the condition I wrote in my previous comment works for up to 4e9 entries. We can leave splitting as a future optimization.

@stevengj
Copy link
Contributor

stevengj commented Apr 8, 2020

I'm skeptical that trying to optimize the small-integer case is worth the effort (= code complexity, fragility). For what real application is mean the performance-limiting step, such that a constant-factor slowdown is a big deal?

@kagalenko-m-b
Copy link
Contributor Author

That is my feeling on this matter, as well.

@nalimilan
Copy link
Member

No idea. But adding that fast path to avoid a performance regression isn't very costly either. Also, you never know what kind of operation people are going to benchmark against other languages.

@stevengj
Copy link
Contributor

stevengj commented Apr 8, 2020

But adding that fast path to avoid a performance regression isn't very costly either.

It is if you get it wrong, and the probability of that increases as you try to make the fast path more and more complicated in order to make it more and more general. And all code has a maintenance cost.

@nalimilan
Copy link
Member

@stevengj Are you OK with the PR as it is then?

src/Statistics.jl Outdated Show resolved Hide resolved
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
test/runtests.jl Outdated Show resolved Hide resolved
src/Statistics.jl Outdated Show resolved Hide resolved
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan
Copy link
Member

@StefanKarpinski Your call?

@StefanKarpinski
Copy link
Contributor

I think this is not my call, but if you and @stevengj are happy with this, I say go ahead.

@nalimilan
Copy link
Member

OK. Merging then. Adding fast paths for arrays of small integers can be discussed later.

@kagalenko-m-b Thanks for finishing this!

@nalimilan nalimilan merged commit 97c743d into JuliaStats:master Apr 14, 2020
@kagalenko-m-b
Copy link
Contributor Author

Took longer than I expected, but in the end we much improved the original version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants