-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370
Comments
After #2236, I am actually not sure that the automatically casting to float64 or switching to numpy function are the correct path. My proposal is to make these method more explicit, e.g. supporting |
Perhaps we could make it possible to to set the ops engine (to either numpy or bottleneck) and dtype ( |
I didn't notice that.
I think this is a reasonable option. Personally, I think we can consider to stop using bottleneck entirely or make it completely optional. |
There has been discussion about changing this condo-forge dependencies for xarray: conda-forge/xarray-feedstock#5. Bottleneck definitely isn’t a true required dependency. Does it work to simply specify an explicit dtype in the sum? I also wonder if it’s really worth the hassle of using bottleneck here, given these numerical precision issues and how it can’t be used with cask. But I do think it still probably offers a meaningful speedup in many cases.... |
Yes. If the original array is in np.float32 and we specify
How about making numpy function default and we use bottleneck only when it is specified explicitly? |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
Code Sample, a copy-pastable example if possible
Data file used is here: test.nc.zip
Output from each statement is commented out.
Problem description
As you can see above, the mean falls outside the range of the data, and the standard deviation is nearly two orders of magnitude higher than it should be. This is because a significant loss of precision is occurring when using bottleneck's
nansum()
on data with afloat32
dtype. I demonstrated this effect here: pydata/bottleneck#193.Naturally, this means that converting the data to
float64
or anyint
dtype will give the correct result, as well as using numpy's built-in functions instead or uninstalling bottleneck. An example is shown below.Expected Output
Output of
xr.show_versions()
xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.0
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: 0.6.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: None
cartopy: None
seaborn: None
setuptools: 40.0.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.5.0
sphinx: None
Unfortunately this will probably not be fixed downstream anytime soon, so I think it would be nice if xarray provided some sort of automatic workaround for this rather than having to remember to manually convert my data if it's
float32
. I am thinking makingfloat64
the default (as discussed in #2304 ) would be nice but perhaps it might also be good if there was at least a warning whenever bottleneck'snansum()
is used onfloat32
arrays.The text was updated successfully, but these errors were encountered: