-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve array interface of Raster
for consistency with np.ma.masked_array
operations
#302
Improve array interface of Raster
for consistency with np.ma.masked_array
operations
#302
Conversation
…or reflective arithmetic operations
geoutils/georaster/raster.py
Outdated
if ufunc.nout == 1: | ||
return self.__class__( | ||
{ | ||
"data": getattr(ufunc, method)(inputs[0].data, **kwargs), # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you define the type of the output (I assume np.ndarray
) rather than ignoring mypy complain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of the output is already defined for __array_ufunc__
and __array_function__
in the function definition.
Mypy still complains about something else specific to those lines, I think it's the indexing of inputs
or something similar. I couldn't figure it out so I silenced it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from a few minor comments, it looks all good. Thanks for fixing this !!
Just of curiosity, did you check how this is done in pandas?
I checked how this is recommended by NumPy (https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch), which is how it's typically done for dask arrays or cupy arrays (pandas does not encapsulate np.ndarray, so it's not exactly the same use cases). |
Ok cool, it's nicely explained on the NumPy doc indeed ! |
Summary
This PR adds a nodata-aware array interface to work with
Raster
objects and their subclasses. All of the ~90 universal functions of NumPy are now supported (see ufuncs: https://numpy.org/doc/stable/reference/ufuncs.html), and also ~25 NumPy array functions defined on a per-need basis (such as np.min, np.max, np.mean, np.median, np.std, np.var, np.percentile, np.quantile). All these functions are wrapped in a way that accounts for possiblenodata
inRaster.data
.From
__array_interface__
to__array_ufunc__
and__array_function__
In #210, we added
__array_interface__
as a property to perform operations onRaster
classes as if on array. However, this interface points to the array of the masked array.data.data
, and therefore gives inconsistent results depending onnodata
values.For example:
This PR tries to expand the addition of @erikmannerfelt to apply to the masked array (Note: there is a
mask
key in the dict of__array_interface__
, but I didn't understand how this is supposed to work, see https://numpy.org/doc/stable/reference/arrays.interface.html).Instead, this PR uses
__array__
,__array_ufunc__
and__array_function__
methods of NumPy, that allow to define the comportement of theRaster
class when functions are applied to it, respectively,np.array()
, any universal functionsufunc
, and any array functionfunction
. See https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch.Choosing the behaviour of NumPy functions on
Raster
objectsImportant note to start: I did not find a way to define the comportement of
np.ma
functions (which are not covered by__array_function__
of NumPy, and no equivalent seem to exists in thema
module). So this PR only concerns NumPy functions with the formatnp.function
.For defining behaviour, I thought the best option would be to enforce the exact same behaviour as when applying to the masked array
.data
. NumPy function generally account for masks even if not called fromnp.ma
. However, there are a couple exceptions that ignore the mask (raising aUserWarning
), those are:np.median
np.quantile
np.percentile
And so, because of those three, I thought preferable to make the behaviour of NumPy functions on
Raster
always nodata-aware. This means that the behaviour is:Raster.data
. Then, if the function preserves shape (all ufuncs), the output masked array is encapsulated in aRaster
object.np.ma.median
to be nodata-aware and cast on the masked array.data
,np.ma
function exists, so the NumPy function is left the same and cast on valid values.data.data[~.data.mask]
Seeing how NumPy has made most classic functions outside of
np.ma
account for mask, it could be only a matter of years until they addnp.median
,np.percentile
andnp.quantile
. Then the behaviour of our array interface would become fully consistent without the need for specific cases.For documentation
In the documentation (once we set it up a bit better), we would recommend three options for array calculations, that always remain fully consistent:
np.function
to theRaster
object,np.ma.function
to the masked array.data
of theRaster
object,np.nanfunction
toRaster.get_nanarray()
once implemented (see Add aget_nan_array
method to the Raster class. #277).Tests
A specific test class
TestArrayInterface
is added.It tests all combinations of:
np.function
(all ufuncs and supported array funcs),dtype
supported byRaster
(possibly two loops for two inputs),nodata
, either None or the default value of adtype
(possibly two nodata for two inputs),which are applied to
Raster
object with a 5x5 masked array with random values of thedtype
, and with a mask containing at least oneTrue
and oneFalse
value.For all combinations, the test check that the output of the NumPy function applied to
Raster.data
is fully consistent with the same function applied toRaster
, at the exception of the special case of "median", "percentile" and "quantile" checked separately.Additionally, the PR checks that NumPy
TypeError
orValueError
are always raised through application toRaster
when supposed to, i.e. for a wrong combination of inputdtypes
, and that this error behaviour is fully consistent with the one for the function being applied toRaster.data
.The many combinations increases our number of test reported by
pytest
by about 13,500. This is a lot but helps tracing quickly exactly which combination is failing (which occured quite a bit during the drafting phase of the PR). If we feel it's too much, we could move some of these test loops inside thetest_
functions instead of as@pytest.mark.parametrize
loops.Little addition
I have uncommented tests on reflective arithmetic operations with 3D np.ndarrays that use to fail due to #256. I have also added 2D tests and updated the
_overloading_check
function to enable #304.Linked issues
Resolves #261
Resolves #256
Resolves #304