Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argmin / argmax behavior doesn't match documentation #1388

Closed
lamorton opened this issue Apr 27, 2017 · 11 comments
Closed

argmin / argmax behavior doesn't match documentation #1388

lamorton opened this issue Apr 27, 2017 · 11 comments

Comments

@lamorton
Copy link

lamorton commented Apr 27, 2017

The documentation reads

Returns:
reduced : DataArray
New DataArray object with argmin applied to its data and the indicated dimension(s) removed.

However, what happens is that the numpy argmin output (single index into the flattened array at which the maximum/minimum is found) is wrapped in a dummy DataArray (similar behavior for Datasets also):

[In1]: x = np.linspace(0,1,5)

[In2]: y = np.linspace(0,1,6)

[In3]: z = np.random.randn(5,6)

[In4]: example = xr.DataArray(z, {'x':x,'y':y},('x','y'))

[In5]: print(example.argmin())

[Out5]: <xarray.DataArray ()>

        array(10)

[In6]: example_ds = xr.Dataset(data_vars={'ex':example})

[In7]: print(example_ds.argmin())

[Out7]: <xarray.Dataset>

        Dimensions:  ()

        Data variables:

        ex       int64 10

I realize that maybe for compatibility reasons it is necessary to make Datasets/DataArrays do this, but it seems like the documented behavior would be nicer. At any rate, the documentation should match the behavior.

Specs:
python 2.7
xarray 0.9.1
numpy 1.11.3

@shoyer
Copy link
Member

shoyer commented Apr 27, 2017

Agreed, the current implementation of argmin() only gives the correct result when given one dimension. It's not entirely obvious what argmin() should yield when done to multiple dimensions, certainly this is not very useful. I would prefer raising an error to this current behavior.

@lamorton
Copy link
Author

Well, xarray at least agrees with numpy's implementation of that function, but that's not to say it is 'correct.' It would be nice if numpy.argmin worked intuitively. That aside, it seems to me that applying min() to a xr.DataArray should return a reduced array with length 1 in each dimension; then you could just query this object and find the coordinate/dimension values. Perhaps then argmin() would just get a tuple of axis indices, such that arr[*arr.argmin()] == arr.min() would hold.

The next question is, what happens if you start supplying coordinate/dimension optional arguments to argmin? It doesn't make sense to minimize over a coordinate, so only dimensions should be accepted. This should result in a tuple of lists, the way numpy.where does.

Does that seem reasonable?

@shoyer
Copy link
Member

shoyer commented Apr 30, 2017

I agree that arr[arr.argmin(dim)] == arr.min(dim) is a useful invariant, but currently xarray's indexing works a little differently from NumPy. Probably arr.isel_points(**arr.argmin(dim)) == arg.min(dim) is the better invariant for now, and in the future arr[arr.argmin(dim)] will change to work consistently (#974).

The main downside of returning a tuple or dict from argmin() is that it makes the common case of taking the max/min over one dimension a little harder. So possibly it would be better to write two methods:

  • argmin would work like it does currently (returning an xarray object), but error if reducing over multiple dimensions.
  • argmin_indices would return a dict suitable for use in indexing.

@fujiisoup
Copy link
Member

fujiisoup commented Jun 18, 2017

I'm working to fix this and I would like to make some design decisions;

  1. What should max() look like?
    I guess this method should work also for multi-dimensional data.
    To satisfy the arr.isel_points(**arr.argmin_indices(dim)) == arg.min(dim) relation,
    the result array should have proper coordinates?

  2. Multiple dim arguments
    Currently, doc says argmin accepts multiple axes, but np.argmin does not.
    Can we limit argmin's arguments only str not sequence of strs?

Edit:

  1. Multi-dimensional array to isel_points
    Currently, isel_points only accepts 1-dimensional array, while the result of argmin_indexes can be multi-dimensional, e.g.
xr.DataArray(np.random.randn(4, 3, 2), dims=['x', 'y', 'z']).argmin_indexes(dims=['x'])

Do we need special treatment for this (maybe in isel_points) or just raise an Error (current behavior)?

@fujiisoup fujiisoup mentioned this issue Jul 1, 2017
4 tasks
@fujiisoup
Copy link
Member

I am thinking again how argmin should work with our new vectorizing indexing #1639 .
It would be great if arr.isel(**arr.argmin(dim)) == arr.min(dim) could be satisfied even with a multi-dimensional array, although the behavior is different from numpy.argmin.
(Maybe our current min should be replaced by arr.isel(**arr.argmin(dim)) so that it preserves the coordinates.)

(We discussed the name for this new method in #1469 but here I just use argmin for the simplicity.)

For example with a three dimensional array with dims=['x', 'y', 'z'], such as
arr = xr.DataArray(np.random.randn(4, 3, 2), dims=['x', 'y', 'z'])
I am thinking that...

  • arr.argmin() would return a xr.Dataset which contains 'x', 'y', 'z' as its data_vars.
    1. ds = arr.argmin(dims=None) case:
      • ds['x'], ds['y'], ds['z'] would be 0d-integers.
    2. ds = arr.argmin(dims=['x', 'y']) case:
      • ds['x'], ds['y'], ds['z'] would be 1d-integer-arrays.
      • The dimension of these three arrays would be 'z_argmin', where ds['z_argmin'] == arr['z'].
    3. ds = arr.argmin(dims='x') case:
      • ds['x'], ds['y'], ds['z'] would be 2d-integer-arrays.
      • The dimensions of these three arrays are 'y_argmin' and 'z_argmin', where ds['y_argmin'] == arr['y'] and ds['z_argmin'] == arr['z'].

The above proposal for ii (and iii) is not quite clean, as if it is used as an argument of isel, it appends a new coordinate 'z_argmin', which is just a duplicate of 'arr['z']', i.e.
arr.isel(**arr.argmin(dims=['x', 'y']))['z_argmin'] == arr['z'].

Any thoughts are welcome.

@gajomi
Copy link
Contributor

gajomi commented Feb 2, 2018

I just came across the various argmax/idxmax (and related min) related issues recently in a project I have been working on. In addition to agreeing that docs should be updated when appropriate here are my two or three cents:

  • As someone new to xarray I like the idea of having both argmax/argmin and argmax_indices/argmin_indices, with the former returning the coordinate indices and the latter the underlying numpy indices analogous to numpy.argmax/numpy.argmin methods. This makes migrating from numpy ndarrays data and collection of associated index arrays obvious (a common path into the xarray world I think).
  • I can also get that idxmax/idxmin might make a better name given that one can have multi-indexed coordinates. If both argmax and idxmax methods are retained probably good to have docs cross reference.
  • In any case, to respond to @fujiisoup's above proposal, I like the idea of retaining the dimension names in the output, and adding a dimension to hold argmax dims, but think it might make more sense to output a DataArray. By way of example, if I had something like:
size = (2,2,2,2)
dims = list("wxyz")
data = np.random.rand(*size)
coords = {dim:["{0}_{1}".format(dim,s) for s in range(s)] for dim,s in zip(dims,size)}
da = xr.DataArray(data, dims=dims, coords=coords)

>>>da
<xarray.DataArray (w: 2, x: 2, y: 2, z: 2)>
array([[[[ 0.149945,  0.230338],
         [ 0.626969,  0.299918]],

        [[ 0.351764,  0.286436],
         [ 0.130604,  0.982152]]],


       [[[ 0.262667,  0.950426],
         [ 0.76655 ,  0.681631]],

        [[ 0.635468,  0.735071],
         [ 0.901116,  0.601303]]]])
Coordinates:
  * w        (w) <U3 'w_0' 'w_1'
  * x        (x) <U3 'x_0' 'x_1'
  * y        (y) <U3 'y_0' 'y_1'
  * z        (z) <U3 'z_0' 'z_1'

I would like to get something like the following

>>>argmax(da)
<xarray.DataArray '_argmax' (argmaxdim: 4)>
array(['w_0', 'x_1', 'y_1', 'z_1'],
      dtype='<U3')
Coordinates:
  * argmaxdim  (argmaxdim) <U1 'w' 'x' 'y' 'z'

>>>argmax(da, dim=list("wy"))
<xarray.DataArray '_argmax' (x: 2, z: 2, argmaxdim: 2)>
array([[['w_1', 'y_1'],
        ['w_1', 'y_0']],

       [['w_1', 'y_1'],
        ['w_0', 'y_1']]], dtype=object)
Coordinates:
  * x          (x) object 'x_0' 'x_1'
  * z          (z) object 'z_0' 'z_1'
  * argmaxdim  (argmaxdim) <U1 'w' 'y'

where the order of the dims in the unreduced and argmax cases are in the right order as above. For reference, just in case that these examples aren't enough to generalize, a horribly inefficient implementation of above (assuming unique maximum indices):

def _argmaxstackeddim(dastacked, ind):
    keepdims = dastacked.indexes['keepdims'].names
    values = dastacked.keepdims.values[ind]
    coords = {keepdim:[val] for keepdim,val in zip(keepdims,values)}
    result = dastacked.sel(keepdims=values)\
                      .pipe(argmax)\
                      .expand_dims(keepdims)\
                      .assign_coords(**coords)
    return result 

def argmax(da, dim=None):
    daname = "" if da.name is None else da.name
    name = daname+"_argmax"
    if dim is None:
        maxda = da.where(da == da.max(),drop=True)
        dims = list(maxda.dims)
        dimmaxvals = [maxda.coords[dim].values[0] for dim in dims]
        result = xr.DataArray(dimmaxvals,
                              dims='argmaxdim',
                              coords={'argmaxdim':dims},
                              name = name)
        return result
    else:
        if isinstance(dim,str):
            dim = [dim]
        keepdims = [d for d in da.dims if d not in dim]
        dastacked = da.stack(keepdims = keepdims)
        slices = [_argmaxstackeddim(dastacked,i) for i in range(len(dastacked.keepdims))]
        return  xr.merge(slices)[name]

@fujiisoup
Copy link
Member

fujiisoup commented Feb 4, 2018

@gajomi

Sorry for my late response and thank you for the proposal.

But aside from my previous proposal, I was thinking whether such aggregation methods (including argmin) should propagate the coordinate.
For example, as you pointed out, in theory, we may be able to track x-coordinate at the argmin index after da.argmin(dim='x').
But it is not reasonable for da.mean(dim='x').
It may be reasonable for da.max(dim='x') but not for da.median(dim='x').

Such specific rules may be confusing and bring additional complexity.
I think the rule
we do not track coordinates after aggregations
would be much simpler and easier to understand.

If we adopt the above rule, I think the argmin would give just an array of indices,

In [1]: import xarray as xr
   ...: da = xr.DataArray([[0, 3, 2], [2, 1, 4]], dims=['x', 'y'],
   ...:                   coords={'x': [1, 2], 'y': ['a', 'b', 'c']})
   ...:

In [4]: da.argmin(dim='x')
Out[4]: 
<xarray.DataArray (y: 3)>
array([0, 1, 0])
Coordinates:
  * y        (y) <U1 'a' 'b' 'c'

In [3]: da.isel(x=da.argmin(dim='x'))
Out[3]: 
<xarray.DataArray (y: 3)>
array([0, 1, 2])
Coordinates:
    x        (y) int64 1 2 1
  * y        (y) <U1 'a' 'b' 'c'

I think your logic would be useful even we do not track the coordinate.

I would appreciate any feedback.

@shoyer
Copy link
Member

shoyer commented Feb 4, 2018

I think it would be fine to add a special case to preserve coordinates corresponding to min/max values with xarray's min() and max() methods, but I don't feel strongly about this. The exact coordinates could be surprising if there are multiple min/max values.

I agree that it does not make sense to preserve coordinates along aggregated dimensions for argmin/argmax, but we can preserve other coordinates. So I like @fujiisoup's example behavior above.

I suppose we now have two candidate APIs for returning multiple indices from a method like argmin/argmax:

  1. Add an additional dimension, e.g., argmaxdim for keeping track of multiple indices.
  2. Return a dict or Dataset with multiple indices.

I think my favorite option is (2) with da.argmin_indices() returning a Dataset, which will allow da[da.argmin_indices()] to work after we finish switching __iter__ to only iterate over data variables (#884 (comment)). One downside of this approach is that it is not obvious how we woudl define argmin_indices() to work on a Dataset, but that's probably OK given that you cannot use a Dataset (yet) for indexing.

My concern with adding an additional dimension is that it is always a little surprising and error-prone when we invent new dimension names not supplied by the user (for example, this can lead to conflicting names). Also, consolidating indices will not work as well with idxmin(), which might put indices of different dtypes in the same array.

Either way, I would like a separate dedicated method for returning multiple indexing arrays. It's convenient (and what users expect) for argmax to return a single array if taking the max only over one dimension. However, if we switch to add an argmaxdim or return a dict/Dataset for multiple dimensions, then we will end up with an annoying inconsistency between the 1D and N-D versions. It would be better to say argmax(dim) is only for one dimension (and raise an error if this is not true) and have the separate argmax_indices(dims) that is consistently defined for any number of dimensions.

@gajomi
Copy link
Contributor

gajomi commented Feb 5, 2018

@fujiisoup and @shoyer

Really enlightening comments above. I think I am starting to get the dao of xarray a bit better :)

I was thinking whether such aggregation methods (including argmin) should propagate the coordinate.

Agreed it would be nice to have a consistent and well reasoned rule for coordinate propagation in aggregation methods. I think a key point here, which gets brought up in your example is that it might make sense to have different subrules depending on the semantics of the operation. Functions like argmax are explicitly concerned with underlying "indices" (dimensions or otherwise) and so may call for different behavior from the mean, which explicitly is explicitly invariant under permutations of the underlying indices. The max/min/median functions are interesting case to think about, in that they are also invariant with under change of underlying indices, but can have potentially more than one index that they are associated with and do not "destroy" information about the value at those indices.

My concern with adding an additional dimension is that it is always a little surprising and error-prone when we invent new dimension names not supplied by the user (for example, this can lead to conflicting names)

Yeah, I felt a little dirty appending '_argmax'.

I think my favorite option is (2) with da.argmin_indices() returning a Dataset, which will allow da[da.argmin_indices()]

OK. I think I understand now why @fujiisoup proposed output a Dataset rather than an array. That's a natural syntax for getting the values from the indices.

Either way, I would like a separate dedicated method for returning multiple indexing arrays.

+1 to dedicated adding more methods if needed, since I think even if it isn;t needed the associated docs will need to make sure users are aware of the analogous idx* methods if they get added.

@stale
Copy link

stale bot commented Jan 6, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jan 6, 2020
@dcherian dcherian removed the stale label Jan 6, 2020
@dcherian
Copy link
Contributor

dcherian commented Jul 9, 2020

I think this is fixed now thanks to @johnomotani

@dcherian dcherian closed this as completed Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants