Solve the overflow in mean() on integers by promoting accumulator #25

kagalenko-m-b · 2020-03-01T11:35:16Z

This PR corrects the issue #22, the bug in mean() on integer inputs.

julia> x = [1,1]*typemax(Int);

julia> mean(x)
-1.0

julia> mean(x,dims=1)[]
2.147483647e9

Implementation and tests are largely as proposed by @stevengj

…float

src/Statistics.jl

test/runtests.jl

nalimilan · 2020-03-04T08:49:16Z

test/runtests.jl

@@ -710,7 +724,7 @@ end
    x = Any[1, 2, 4, 10]
    y = Any[1, 2, 4, 10//1]
    @test var(x) === 16.25
-    @test var(y) === 65//4
+    @test var(y) == 65//4


Suggested change

@test var(y) == 65//4

@test var(y) === 16.25

Technically this qualifies as a breaking change I guess.

True, however it is not obvious to me whether one or the other is more "right"

Yes, the point is that I'm not sure whether our policy allows changing things like this in a minor release.

Note that this change restores consistency

julia> y = Any[1, 2, 4, 10//1];var(y,dims=1)[]===16.25 true

So maybe calling it "breaking" is too strong.

test/runtests.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Suggestions from code review Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan

Triage agreed that overflow should be avoided. Just a few more stylistic comments before merging this PR.

src/Statistics.jl

test/runtests.jl

Code style fixes suggested by reivewer Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan

Actually, I forgot that you also need to change the mean function for generic iterators above in the file.

nalimilan · 2020-03-22T18:36:28Z

src/Statistics.jl

@@ -41,7 +41,12 @@ julia> mean(skipmissing([1, missing, 3]))
 2.0
 ```
 """
-mean(itr) = mean(identity, itr)
+function mean(itr)


This change isn't OK:

collect(A) called by mean_promote will make a copy, which should be avoided at all costs. So we really need two different methods, one for arrays, and one for general iterators.

mean(f, itr) should do the same promotion as mean(itr). There's no reason it should have a different behavior, and the equivalence between mean(itr) and mean(identity, itr) seems essential.

This change isn't OK:
* collect(A) called by mean_promote will make a copy, which should be avoided at all costs. So we really need two different methods, one for arrays, and one for general iterators.

I added the call to collect() to count the elements of "skipmissing" arrays.
Isn't there a method to count the elements without making a copy? Having duplicate implementations is bad style and multiplies the possibilities for bugs, when those two implementations go out of sync.

* `mean(f, itr)` should do the same promotion as `mean(itr)`.

That is something I have to question. mean(f, itr) is user explicitly specifying the promotion. Should the function be second-guessing him in that case?

Anyways, that objection is easy to accommodate by adding an optional argument to _mean_promote() with the default value identity.

no reason it should have a different behavior, and the equivalence between mean(itr) and mean(identity, itr) seems essential

The second form means a user is telling the function "don't do the accumulator promotion".

Right now, mean() on empty tuple throws MethodError and this behaviour is enshrined in tests. On the other hand, mean() on zero-length array returns NaN. That looks inconsistent to me.

I added the call to collect() to count the elements of "skipmissing" arrays.
Isn't there a method to count the elements without making a copy? Having duplicate implementations is bad style and multiplies the possibilities for bugs, when those two implementations go out of sync.

Unfortunately that's not possible, as some iterators do not allow going over their elements twice. So you need to count the elements as you process them.

That is something I have to question. mean(f, itr) is user explicitly specifying the promotion. Should the function be second-guessing him in that case?

Anyways, that objection is easy to accommodate by adding an optional argument to _mean_promote() with the default value identity.

Well the user isn't really specifying the promotion, s/he's merely saying that elements should be transformed first. For example, it's common to call mean(abs, x), and it could be surprising that the result overflows while mean(x) doesn't.

Right now, mean() on empty tuple throws MethodError and this behaviour is enshrined in tests. On the other hand, mean() on zero-length array returns NaN. That looks inconsistent to me.

That's because an empty array provides the element type so you know what's the type of zero that you should use. But an empty tuple doesn't have any element type, so the only choice we have is throw an error. Fixing the inconsistency by also throwing an error for arrays would be really annoying.

That explanation makes sense, thanks.

I have implemented promotion for generic iterators and unified promotion for AbstractArray in one function. If you take a look at the lines #164-165 in Statistics.jl, the commented out line 165 passes all tests for mean(), but breaks some tests of var().

src/Statistics.jl

nalimilan · 2020-03-24T14:23:35Z

Can you add tests for mean(f, A, dims=...) methods?

StefanKarpinski · 2020-04-06T14:03:23Z

That sounds better thought through that what I had come up with but yes, you can break a list into small enough pieces that each one can't overflow and then sum those recursively.

nalimilan · 2020-04-06T14:11:35Z

Actually I hadn't realized you could break arrays in chunks. In practice, is it really necessary? Even for Int32, the condition I wrote in my previous comment works for up to 4e9 entries. We can leave splitting as a future optimization.

stevengj · 2020-04-08T15:37:38Z

I'm skeptical that trying to optimize the small-integer case is worth the effort (= code complexity, fragility). For what real application is mean the performance-limiting step, such that a constant-factor slowdown is a big deal?

kagalenko-m-b · 2020-04-08T15:40:26Z

That is my feeling on this matter, as well.

nalimilan · 2020-04-08T15:44:07Z

No idea. But adding that fast path to avoid a performance regression isn't very costly either. Also, you never know what kind of operation people are going to benchmark against other languages.

stevengj · 2020-04-08T16:04:24Z

But adding that fast path to avoid a performance regression isn't very costly either.

It is if you get it wrong, and the probability of that increases as you try to make the fast path more and more complicated in order to make it more and more general. And all code has a maintenance cost.

nalimilan · 2020-04-10T10:10:32Z

@stevengj Are you OK with the PR as it is then?

src/Statistics.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

src/Statistics.jl

test/runtests.jl

src/Statistics.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan · 2020-04-13T17:20:29Z

@StefanKarpinski Your call?

StefanKarpinski · 2020-04-13T18:27:13Z

I think this is not my call, but if you and @stevengj are happy with this, I say go ahead.

nalimilan · 2020-04-14T09:27:20Z

OK. Merging then. Adding fast paths for arrays of small integers can be discussed later.

@kagalenko-m-b Thanks for finishing this!

kagalenko-m-b · 2020-04-14T11:03:56Z

Took longer than I expected, but in the end we much improved the original version.

Solve the overflow in mean() on integers by promoting accumulator to …

72e84b1

…float

nalimilan reviewed Mar 1, 2020

View reviewed changes

src/Statistics.jl Outdated Show resolved Hide resolved

src/Statistics.jl Outdated Show resolved Hide resolved

Avoid trying to extend Base.promote()

5c652da

stevengj reviewed Mar 3, 2020

View reviewed changes

src/Statistics.jl Outdated Show resolved Hide resolved

nalimilan reviewed Mar 4, 2020

View reviewed changes

kagalenko-m-b and others added 2 commits March 4, 2020 11:58

Update src/Statistics.jl according to maintainer's suggestions

0143bb3

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Apply suggestions from code review

73756cd

Suggestions from code review Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan added the triage label Mar 4, 2020

nalimilan reviewed Mar 20, 2020

View reviewed changes

src/Statistics.jl Outdated Show resolved Hide resolved

src/Statistics.jl Outdated Show resolved Hide resolved

test/runtests.jl Outdated Show resolved Hide resolved

kagalenko-m-b and others added 2 commits March 20, 2020 17:54

Apply suggestions from code review

ac17e43

Code style fixes suggested by reivewer Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Syntax change to avoid a spurious error

3290a2b

nalimilan requested review from stevengj and StefanKarpinski March 20, 2020 15:16

nalimilan removed the triage label Mar 20, 2020

nalimilan approved these changes Mar 20, 2020

View reviewed changes

nalimilan requested changes Mar 21, 2020

View reviewed changes

Implement accumulator promotion for generic iterators

4d36284

kagalenko-m-b requested a review from nalimilan March 21, 2020 21:47

Summation with promoted accumulator in a single function

116476f

nalimilan reviewed Mar 22, 2020

View reviewed changes

kagalenko-m-b added 3 commits March 23, 2020 17:02

Unify implementation of promotion for AbstractArray

8e62e92

Promote accumulator for generic iterators

851a79d

Merge branch 'feature_request'

82da02e

kagalenko-m-b requested a review from nalimilan March 23, 2020 16:42

nalimilan reviewed Mar 23, 2020

View reviewed changes

src/Statistics.jl Outdated Show resolved Hide resolved

src/Statistics.jl Outdated Show resolved Hide resolved

src/Statistics.jl Outdated Show resolved Hide resolved

Corrected promotion for iterators

92a483c

kagalenko-m-b requested a review from nalimilan March 24, 2020 12:50

nalimilan reviewed Mar 24, 2020

View reviewed changes

src/Statistics.jl Outdated Show resolved Hide resolved

src/Statistics.jl Outdated Show resolved Hide resolved

Unified implementations with and without 'dims' keyword

2d61356

nalimilan reviewed Apr 10, 2020

View reviewed changes

src/Statistics.jl Outdated Show resolved Hide resolved

Update src/Statistics.jl

a522add

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

stevengj reviewed Apr 10, 2020

View reviewed changes

src/Statistics.jl Show resolved Hide resolved

nalimilan reviewed Apr 11, 2020

View reviewed changes

test/runtests.jl Show resolved Hide resolved

Avoid creation of intermediate array

53962d2

nalimilan reviewed Apr 12, 2020

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

src/Statistics.jl Outdated Show resolved Hide resolved

Test for return type on empty iterators

792f3f9

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan approved these changes Apr 12, 2020

View reviewed changes

Use + instead of add_sum

a9030d2

stevengj approved these changes Apr 13, 2020

View reviewed changes

nalimilan merged commit 97c743d into JuliaStats:master Apr 14, 2020

stevengj mentioned this pull request May 1, 2020

Integer overflow in mean() #22

Closed

bkamins mentioned this pull request May 8, 2020

Problems in groupreduce_init JuliaData/DataFrames.jl#2241

Closed

kimikage mentioned this pull request Jun 21, 2020

Fix reductions in Statistics JuliaMath/FixedPointNumbers.jl#183

Merged

nalimilan mentioned this pull request Aug 9, 2020

Statistic.mean(f, A) calls f length(A)+1 times #49

Open

nalimilan mentioned this pull request Apr 22, 2021

Improve the performance of describe() in the case of missing values. JuliaData/DataFrames.jl#2731

Open

maleadt mentioned this pull request May 4, 2021

mean can overflow on integer inputs JuliaGPU/CUDA.jl#885

Open

stevengj mentioned this pull request Jun 19, 2023

Mean overflows when using smaller types (e.g. Float16) #140

Open

kimikage mentioned this pull request Apr 7, 2024

Public API corresponding to _mean_promote #165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve the overflow in mean() on integers by promoting accumulator #25

Solve the overflow in mean() on integers by promoting accumulator #25

kagalenko-m-b commented Mar 1, 2020 •

edited

Loading

nalimilan Mar 4, 2020

kagalenko-m-b Mar 4, 2020

nalimilan Mar 4, 2020

kagalenko-m-b Mar 5, 2020 •

edited

Loading

nalimilan left a comment

nalimilan left a comment

nalimilan Mar 22, 2020

kagalenko-m-b Mar 22, 2020 •

edited

Loading

kagalenko-m-b Mar 22, 2020

nalimilan Mar 23, 2020

kagalenko-m-b Mar 23, 2020

kagalenko-m-b Mar 23, 2020 •

edited

Loading

nalimilan commented Mar 24, 2020

StefanKarpinski commented Apr 6, 2020

nalimilan commented Apr 6, 2020

stevengj commented Apr 8, 2020

kagalenko-m-b commented Apr 8, 2020

nalimilan commented Apr 8, 2020

stevengj commented Apr 8, 2020 •

edited

Loading

nalimilan commented Apr 10, 2020

nalimilan commented Apr 13, 2020

StefanKarpinski commented Apr 13, 2020

nalimilan commented Apr 14, 2020

kagalenko-m-b commented Apr 14, 2020

Solve the overflow in mean() on integers by promoting accumulator #25

Solve the overflow in mean() on integers by promoting accumulator #25

Conversation

kagalenko-m-b commented Mar 1, 2020 • edited Loading

nalimilan Mar 4, 2020

Choose a reason for hiding this comment

kagalenko-m-b Mar 4, 2020

Choose a reason for hiding this comment

nalimilan Mar 4, 2020

Choose a reason for hiding this comment

kagalenko-m-b Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Mar 22, 2020

Choose a reason for hiding this comment

kagalenko-m-b Mar 22, 2020 • edited Loading

Choose a reason for hiding this comment

kagalenko-m-b Mar 22, 2020

Choose a reason for hiding this comment

nalimilan Mar 23, 2020

Choose a reason for hiding this comment

kagalenko-m-b Mar 23, 2020

Choose a reason for hiding this comment

kagalenko-m-b Mar 23, 2020 • edited Loading

Choose a reason for hiding this comment

nalimilan commented Mar 24, 2020

StefanKarpinski commented Apr 6, 2020

nalimilan commented Apr 6, 2020

stevengj commented Apr 8, 2020

kagalenko-m-b commented Apr 8, 2020

nalimilan commented Apr 8, 2020

stevengj commented Apr 8, 2020 • edited Loading

nalimilan commented Apr 10, 2020

nalimilan commented Apr 13, 2020

StefanKarpinski commented Apr 13, 2020

nalimilan commented Apr 14, 2020

kagalenko-m-b commented Apr 14, 2020

kagalenko-m-b commented Mar 1, 2020 •

edited

Loading

kagalenko-m-b Mar 5, 2020 •

edited

Loading

kagalenko-m-b Mar 22, 2020 •

edited

Loading

kagalenko-m-b Mar 23, 2020 •

edited

Loading

stevengj commented Apr 8, 2020 •

edited

Loading