Improve array interface of `Raster` for consistency with `np.ma.masked_array` operations #302

rhugonnet · 2022-09-06T19:16:56Z

Summary

This PR adds a nodata-aware array interface to work with Raster objects and their subclasses. All of the ~90 universal functions of NumPy are now supported (see ufuncs: https://numpy.org/doc/stable/reference/ufuncs.html), and also ~25 NumPy array functions defined on a per-need basis (such as np.min, np.max, np.mean, np.median, np.std, np.var, np.percentile, np.quantile). All these functions are wrapped in a way that accounts for possible nodata in Raster.data.

From `__array_interface` to `array_ufunc` and `array_function__`

In #210, we added __array_interface__ as a property to perform operations on Raster classes as if on array. However, this interface points to the array of the masked array .data.data, and therefore gives inconsistent results depending on nodata values.
For example:

np.min(raster.data) # minimum value is around 340, nodata value is -9999
Out[1]: 341.2

np.min(raster) 
Out[2]: -9999

This PR tries to expand the addition of @erikmannerfelt to apply to the masked array (Note: there is a mask key in the dict of __array_interface__, but I didn't understand how this is supposed to work, see https://numpy.org/doc/stable/reference/arrays.interface.html).
Instead, this PR uses __array__, __array_ufunc__ and __array_function__ methods of NumPy, that allow to define the comportement of the Raster class when functions are applied to it, respectively, np.array(), any universal functions ufunc, and any array function function. See https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch.

Choosing the behaviour of NumPy functions on `Raster` objects

Important note to start: I did not find a way to define the comportement of np.ma functions (which are not covered by __array_function__ of NumPy, and no equivalent seem to exists in the ma module). So this PR only concerns NumPy functions with the format np.function.

For defining behaviour, I thought the best option would be to enforce the exact same behaviour as when applying to the masked array .data. NumPy function generally account for masks even if not called from np.ma. However, there are a couple exceptions that ignore the mask (raising a UserWarning), those are:

np.median
np.quantile
np.percentile

And so, because of those three, I thought preferable to make the behaviour of NumPy functions on Raster always nodata-aware. This means that the behaviour is:

All NumPy functions except the 3 listed above are cast to the masked array Raster.data. Then, if the function preserves shape (all ufuncs), the output masked array is encapsulated in a Raster object.
For "median", the NumPy function is redefined into np.ma.median to be nodata-aware and cast on the masked array .data,
For "quantile" and "percentile", no np.ma function exists, so the NumPy function is left the same and cast on valid values .data.data[~.data.mask]

Seeing how NumPy has made most classic functions outside of np.ma account for mask, it could be only a matter of years until they add np.median, np.percentile and np.quantile. Then the behaviour of our array interface would become fully consistent without the need for specific cases.

For documentation

In the documentation (once we set it up a bit better), we would recommend three options for array calculations, that always remain fully consistent:

Applying any np.function to the Raster object,
Applying any np.ma.function to the masked array .data of the Raster object,
Applying any np.nanfunction to Raster.get_nanarray() once implemented (see Add a get_nan_array method to the Raster class. #277).

Tests

A specific test class TestArrayInterface is added.
It tests all combinations of:

np.function (all ufuncs and supported array funcs),
dtype supported by Raster (possibly two loops for two inputs),
nodata, either None or the default value of a dtype (possibly two nodata for two inputs),

which are applied to Raster object with a 5x5 masked array with random values of the dtype, and with a mask containing at least one True and one False value.

For all combinations, the test check that the output of the NumPy function applied to Raster.data is fully consistent with the same function applied to Raster, at the exception of the special case of "median", "percentile" and "quantile" checked separately.

Additionally, the PR checks that NumPy TypeError or ValueError are always raised through application to Raster when supposed to, i.e. for a wrong combination of input dtypes, and that this error behaviour is fully consistent with the one for the function being applied to Raster.data.

The many combinations increases our number of test reported by pytest by about 13,500. This is a lot but helps tracing quickly exactly which combination is failing (which occured quite a bit during the drafting phase of the PR). If we feel it's too much, we could move some of these test loops inside the test_ functions instead of as @pytest.mark.parametrize loops.

Little addition

I have uncommented tests on reflective arithmetic operations with 3D np.ndarrays that use to fail due to #256. I have also added 2D tests and updated the _overloading_check function to enable #304.

Linked issues

Resolves #261
Resolves #256
Resolves #304

…face

…be changed)

…or reflective arithmetic operations

geoutils/georaster/raster.py

adehecq · 2022-09-12T11:17:03Z

geoutils/georaster/raster.py

+            if ufunc.nout == 1:
+                return self.__class__(
+                    {
+                        "data": getattr(ufunc, method)(inputs[0].data, **kwargs),  # type: ignore


Can't you define the type of the output (I assume np.ndarray) rather than ignoring mypy complain?

The type of the output is already defined for __array_ufunc__ and __array_function__ in the function definition.

Mypy still complains about something else specific to those lines, I think it's the indexing of inputs or something similar. I couldn't figure it out so I silenced it.

geoutils/georaster/raster.py

tests/test_georaster.py

adehecq

Apart from a few minor comments, it looks all good. Thanks for fixing this !!
Just of curiosity, did you check how this is done in pandas?

rhugonnet · 2022-09-13T06:58:40Z

I checked how this is recommended by NumPy (https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch), which is how it's typically done for dask arrays or cupy arrays (pandas does not encapsulate np.ndarray, so it's not exactly the same use cases).

adehecq · 2022-09-13T07:17:34Z

I checked how this is recommended by NumPy (https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch), which is how it's typically done for dask arrays or cupy arrays (pandas does not encapsulate np.ndarray, so it's not exactly the same use cases).

Ok cool, it's nicely explained on the NumPy doc indeed !

Incremental commit on array interface

d90bfa8

rhugonnet marked this pull request as draft September 6, 2022 19:17

rhugonnet added 12 commits September 6, 2022 22:46

Start tests

147db67

Incremental commit on array interface

edb967c

Replace float128 by longdouble

befc4a8

Merge remote-tracking branch 'upstream/main' into improve_array_inter…

a837901

…face

Incremental commit on array interface

9a150cb

Finalization of __array_ufunc__ interface and tests

674750f

Add a mask to __array_ufunc__ tests

4984d56

Increment commit on array functions

587e571

Finalize array functions and linting

3a7db58

Add regex ignore codespell on nin (which is a numpy variable, cannot …

7ed901c

…be changed)

Update descriptions

4d84a72

Making black and isort happy simultaneously

79e3bf6

rhugonnet marked this pull request as ready for review September 8, 2022 08:19

This was referenced Sep 8, 2022

Numpy representation of Rasters causes issue with arithmetic overloading #256

Closed

Asymmetry of arithmetic operations when broadcasting between 2D and 3D arrays #304

Closed

rhugonnet requested review from adehecq and erikmannerfelt September 8, 2022 11:43

rhugonnet mentioned this pull request Sep 10, 2022

Fix tests with GeoUtils data.setter and __array_interface__ update GlacioHack/xdem#301

Merged

Uncomment reflectivity test with np.ndarray and enable 2D broadcast f…

0a1b551

…or reflective arithmetic operations