-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solve the overflow in mean() on integers by promoting accumulator #25
Conversation
test/runtests.jl
Outdated
@@ -710,7 +724,7 @@ end | |||
x = Any[1, 2, 4, 10] | |||
y = Any[1, 2, 4, 10//1] | |||
@test var(x) === 16.25 | |||
@test var(y) === 65//4 | |||
@test var(y) == 65//4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@test var(y) == 65//4 | |
@test var(y) === 16.25 |
Technically this qualifies as a breaking change I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, however it is not obvious to me whether one or the other is more "right"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the point is that I'm not sure whether our policy allows changing things like this in a minor release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this change restores consistency
julia> y = Any[1, 2, 4, 10//1];var(y,dims=1)[]===16.25
true
So maybe calling it "breaking" is too strong.
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Suggestions from code review Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triage agreed that overflow should be avoided. Just a few more stylistic comments before merging this PR.
Code style fixes suggested by reivewer Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I forgot that you also need to change the mean
function for generic iterators above in the file.
src/Statistics.jl
Outdated
@@ -41,7 +41,12 @@ julia> mean(skipmissing([1, missing, 3])) | |||
2.0 | |||
``` | |||
""" | |||
mean(itr) = mean(identity, itr) | |||
function mean(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change isn't OK:
collect(A)
called bymean_promote
will make a copy, which should be avoided at all costs. So we really need two different methods, one for arrays, and one for general iterators.mean(f, itr)
should do the same promotion asmean(itr)
. There's no reason it should have a different behavior, and the equivalence betweenmean(itr)
andmean(identity, itr)
seems essential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change isn't OK:
*collect(A)
called bymean_promote
will make a copy, which should be avoided at all costs. So we really need two different methods, one for arrays, and one for general iterators.
I added the call to collect()
to count the elements of "skipmissing" arrays.
Isn't there a method to count the elements without making a copy? Having duplicate implementations is bad style and multiplies the possibilities for bugs, when those two implementations go out of sync.
* `mean(f, itr)` should do the same promotion as `mean(itr)`.
That is something I have to question. mean(f, itr)
is user explicitly specifying the promotion. Should the function be second-guessing him in that case?
Anyways, that objection is easy to accommodate by adding an optional argument to _mean_promote()
with the default value identity
.
no reason it should have a different behavior, and the equivalence between
mean(itr)
andmean(identity, itr)
seems essential
The second form means a user is telling the function "don't do the accumulator promotion".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, mean()
on empty tuple throws MethodError and this behaviour is enshrined in tests. On the other hand, mean()
on zero-length array returns NaN. That looks inconsistent to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the call to
collect()
to count the elements of "skipmissing" arrays.
Isn't there a method to count the elements without making a copy? Having duplicate implementations is bad style and multiplies the possibilities for bugs, when those two implementations go out of sync.
Unfortunately that's not possible, as some iterators do not allow going over their elements twice. So you need to count the elements as you process them.
That is something I have to question.
mean(f, itr)
is user explicitly specifying the promotion. Should the function be second-guessing him in that case?Anyways, that objection is easy to accommodate by adding an optional argument to
_mean_promote()
with the default valueidentity
.
Well the user isn't really specifying the promotion, s/he's merely saying that elements should be transformed first. For example, it's common to call mean(abs, x)
, and it could be surprising that the result overflows while mean(x)
doesn't.
Right now,
mean()
on empty tuple throws MethodError and this behaviour is enshrined in tests. On the other hand,mean()
on zero-length array returns NaN. That looks inconsistent to me.
That's because an empty array provides the element type so you know what's the type of zero that you should use. But an empty tuple doesn't have any element type, so the only choice we have is throw an error. Fixing the inconsistency by also throwing an error for arrays would be really annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explanation makes sense, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented promotion for generic iterators and unified promotion for AbstractArray in one function. If you take a look at the lines #164-165 in Statistics.jl, the commented out line 165 passes all tests for mean()
, but breaks some tests of var()
.
Can you add tests for |
That sounds better thought through that what I had come up with but yes, you can break a list into small enough pieces that each one can't overflow and then sum those recursively. |
Actually I hadn't realized you could break arrays in chunks. In practice, is it really necessary? Even for |
I'm skeptical that trying to optimize the small-integer case is worth the effort (= code complexity, fragility). For what real application is |
That is my feeling on this matter, as well. |
No idea. But adding that fast path to avoid a performance regression isn't very costly either. Also, you never know what kind of operation people are going to benchmark against other languages. |
It is if you get it wrong, and the probability of that increases as you try to make the fast path more and more complicated in order to make it more and more general. And all code has a maintenance cost. |
@stevengj Are you OK with the PR as it is then? |
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@StefanKarpinski Your call? |
I think this is not my call, but if you and @stevengj are happy with this, I say go ahead. |
OK. Merging then. Adding fast paths for arrays of small integers can be discussed later. @kagalenko-m-b Thanks for finishing this! |
Took longer than I expected, but in the end we much improved the original version. |
This PR corrects the issue #22, the bug in mean() on integer inputs.
Implementation and tests are largely as proposed by @stevengj