-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numpy.mean along multiple axis gives wrong result for large arrays #8869
Comments
I can confirm the results are the same on
|
A more alarming case: In [18]: test_array[:,:,:1].sum(axis=(0,1))
Out[18]: array([ 40000000.], dtype=float32)
In [19]: test_array[:,:,:2].sum(axis=(0,1))
Out[19]: array([ 16777216., 16777216.], dtype=float32) |
this again boils down to the more robust summation algorithm is only applied along the trailing axes. |
That is true it works for axis (1,2) but not for (0,1): >>> test_array = np.ones((4,10000000,4), dtype=np.float32)
>>> test_array.sum(axis=(0,1))
array([ 16777216., 16777216., 16777216., 16777216.], dtype=float32) >>> test_array.sum(axis=(1,2))
array([ 40000000., 40000000., 40000000., 40000000.], dtype=float32) |
I wonder whether the following is related with this bug ? >>> aa = np.ones((5001 * 5001, 3), dtype=np.float32)
>>> bb = np.ones((5001 * 5001 * 3), dtype=np.float32)
>>> np.sum(aa, axis=0), np.sum(bb, axis=0), 5001 * 5001 * 3
(array([16777216., 16777216., 16777216.], dtype=float32), 75030000.0, 75030003) in the 2nd result 3 is missing vs 3rd results. for i in range(-10, 10):
s = (4 * 1024) ** 2 + i
bb = np.ones((s), dtype=np.float32)
res = int(np.sum(bb, axis=0))
print(res, s, res == s) gives
|
To aid in merging #22, where we've collapsed the xyz dimensions into an S dimension. Tracing back on a numpy core bug with large arrays: numpy/numpy#8869 (comment)
This issue still exits in numpy '1.14.5'.
|
Similar issue. Google brought me here. How do we resolve this?
|
Now I think this could be an overflow issue. The np.mean very likely sum things up first, and if it is out of range of np.float32, then the results is not accurate anymore. Using float64 might solve this problem in most cases. |
Would it make sense to change implementation of np.mean, so that it would be calculated for appropriate dimensions iteratively e.g.
would return result of 'test_array.mean(axis=0).mean(axis=0)'? |
I implemented approach described above (pull request). Happy to get some feedback. |
@eric-wieser already pointed to the correct cause: the summation algorithm is "smart" only along contiguous memory regions. (No overflow here @YubinXie, but catastrophic precision loss) smart:
naive:
Please note that
confirming that indeed The approach proposed by @mproszewska does not solve this problem, but only alleviates it in the special case in which we have the sum along multiple axes, but the sum along a single axis does not suffer from precision loss. In other words it will not resolve
|
@vfdev-5 I would not call your example a bug: even with a 'smart' summation algorithm, some precision loss is to be expected. Indeed the integer 16777217 has no exact floating point representation in single precision:
|
Will this issue be solve at some point? It was opened in 2017 and it is still relevant. I have just opened similar issue here: #25909. I saw that when you copy data then you have problem with sum:
However you don't have such problem with single collumn:
|
The docs are saying that:
However in this case this is not true:
|
A mean over an array containing only ones should obviously return only ones. However,
This returns the correct result:
I guess the reason for this problem is some overflow, because it does not appear when I use a test array with dtype = np.float64. I would expect numpy to either give the correct result or to at least give a warning whenever such an overflow happens.
surprisingly mean along all axis gives the correct result again:
(I used numpy-1.12.1 with python 3.5 on Ubuntu 16)
The text was updated successfully, but these errors were encountered: