Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shoudn't assert_allclose transpose datasets? #5733

Closed
jbusecke opened this issue Aug 23, 2021 · 16 comments · Fixed by #8991
Closed

Shoudn't assert_allclose transpose datasets? #5733

jbusecke opened this issue Aug 23, 2021 · 16 comments · Fixed by #8991

Comments

@jbusecke
Copy link
Contributor

I am trying to compare two datasets, one of which has possibly transposed dimensions on a data variable.

import xarray as xr
import numpy as np
data = np.random.rand(4,6)
da = xr.DataArray(data, dims=['x','y'])
ds1 = xr.Dataset({'data':da})
ds2 = xr.Dataset({'data':da}).transpose('y','x')

What happened:
In my mind this should pass

xr.testing.assert_allclose(ds1, ds2)

but instead it fails

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-7-58cd53174a1e> in <module>
----> 1 xr.testing.assert_allclose(ds1, ds2)

    [... skipping hidden 1 frame]

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/testing.py in assert_allclose(a, b, rtol, atol, decode_bytes)
    169             a.variables, b.variables, compat=compat_variable
    170         )
--> 171         assert allclose, formatting.diff_dataset_repr(a, b, compat=equiv)
    172     else:
    173         raise TypeError("{} not supported by assertion comparison".format(type(a)))

AssertionError: Left and right Dataset objects are not close


Differing data variables:
L   data     (x, y) float64 0.8589 0.09264 0.0264 ... 0.1039 0.3685 0.3983
R   data     (y, x) float64 0.8589 0.8792 0.8433 0.6952 ... 0.3664 0.2214 0.3983

Simply transposing ds2 to the same dimensions of ds1 fixes this (since the data is the same after all)

xr.testing.assert_allclose(ds1, ds2.transpose('x','y'))

Since most of the other xarray operations are 'transpose-safe' ((ds1+ds2) = (ds1 + ds2.transpose('x','y')), shouldnt this one be too?

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.109+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.2
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.7.1
cftime: 1.4.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.2
cfgrib: 0.9.9.0
iris: None
bottleneck: 1.3.2
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: 3.4.1
cartopy: 0.19.0
seaborn: None
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 20.3.4
conda: None
pytest: None
IPython: 7.22.0
sphinx: None

@max-sixty
Copy link
Collaborator

I agree this is another station on the journey between "byte identical" and "practically similar", which we've discovered over the past couple of years... :)

I would lean towards having it fail for assert_identical, with no view on all_close; definitely open to it coercing transpose.

@benbovy
Copy link
Member

benbovy commented Aug 24, 2021

Is there any operation in xarray where dimension order matters? If yes, I'm not sure if it's a good idea to allow transposed dimension pass assert_allclose or assert_equal. But even otherwise, I'd find it a bit weird for a "numpy equivalent" function to have:

xr.testing.assert_allclose(ds1.data, ds2.data)   # ok
np.testing.assert_allclose(ds1.data.values, ds2.data.values)   # fails

What about a check_dim_order option? Also, it would be useful if information about non-matching dimension order was shown more explicitly in the assertion error message.

@dcherian
Copy link
Contributor

What about a check_dim_order option? Also, it would be useful if information about non-matching dimension order was shown more explicitly in the assertion error message.

This sounds good to me. We should also have a check_attrs kwarg since that's another thing that only identical checks.

@dcherian
Copy link
Contributor

I have a related question

import xarray as xr

da = xr.DataArray([[1, 1, 1], [2, 2, 2]], dims=("x", "y"))

xr.testing.assert_identical(da, da.transpose())
AssertionError: Left and right DataArray objects are not identical
Differing dimensions:
    (x: 2, y: 3) != (y: 3, x: 2)
Differing values:
L
    array([[1, 1, 1],
           [2, 2, 2]])
R
    array([[1, 2],
           [1, 2],
           [1, 2]])

Strictly speaking, the values are different I guess. However I think this error would be clearer if it said that the dimension order was different but the values are equal once the dimensions are transposed.

I.e. we could

if set(a.dims) == set(b.dims):
     a = a.transpose(b.dims)
	 # check values and raise if actually different
else:
    # current behaviour

Is this a good idea?

@jbusecke
Copy link
Contributor Author

Strictly speaking, the values are different I guess. However I think this error would be clearer if it said that the dimension order was different but the values are equal once the dimensions are transposed.

I guess this comes down a bit to a philosophical question related to @benbovy s comment above. You can either make this operation be similar to the numpy equivalent (with some more xarray specific checks) or it can check whether the values at a certain combo of labels are the same/close.

The latter would be the way I think about data in xarray as a user. To me the removal of axis logic (via labels) is one of the biggest draws for myself, but importantly I also pitch this as one of the big reasons to switch to xarray for beginners.

I would argue that a 'strict' (numpy style) comparision is less practical in a scientific workflow and we do have the numpy implementation to achieve that functionality.
So I would ultimately argue that xarray should check closeness between values at certain label positions by default.

However, this might be very opinionated on my end, and a better error message would already be a massive improvement.

@TomNicholas
Copy link
Member

I think we definitely need a flag, because we have two conflicting use cases:

  1. Developers who use asserts in test suites and might well care about differences like
xr.testing.assert_allclose(ds1.data, ds2.data)   # ok
np.testing.assert_allclose(ds1.data.values, ds2.data.values)   # fails
  1. Users for whom xarray allows to forget about dimension order, and would expect transposed data to be equal.

The easiest way to cover both is to have an option to switch between them.

In my opinion the only question is what the default comparison behaviour should be. I like the idea of adding a check_dim_order kwarg to assert_equal, assert_allclose, and assert_identical, but have the flag default to False for the first two functions and True for the last one.

@dcherian
Copy link
Contributor

what the default comparison behaviour should be

I don't think we can change this because it's very backwards incompatible and affects tests in downstream packages.

But 👍 to adding a flag allowing users to opt out of dimension order checking.

@benbovy
Copy link
Member

benbovy commented Aug 30, 2021

@jbusecke I agree with your point of view that "xarray-style" comparison is more practical in a scientific workflow. Especially if dimension order is irrelevant for most (all?) xarray operations, ignoring the order for assert_allclose / assert_equal too makes sense and is consistent.

However, it might be also dangerous / harmful if the workflow includes data conversion between labeled vs. unlabelled formats. There's a risk of checking for equality with xarray, then later converting to numpy and assuming that arrays are equal without feeling the need to check again. If dimension sizes are the same this might lead to very subtle bugs.

Since it is easy to ignore or forget about default values, having a check_dim_order option that defaults to True is probably safer IMHO, although slightly less convenient. No strong views, though.

@dcherian I like your idea but I'm not sure what's best between your code snippet and checking equality of aligned dimensions datasets only if non-dimension-aligned are not equal.

@mjwillson
Copy link

+1 for a check_dim_order option to .equals, assert_equal that can be disabled. (Ideally I think the default would be not to check dim order, but that ship has sailed now).

Or failing that, it would at least be nice to have xarray.testing.assert_equal_modulo_dim_order etc. When writing tests I usually don't care about dimension order and it's frustrating to have to manually do e.g.
xarray.testing.assert_allclose(a, b.transpose(a.dims)).

As pointed out, most of the xarray API is dimension-order-invariant and so it's odd to have no supported way to do comparisons in a dimension-order-invariant way.

@ignamv
Copy link
Contributor

ignamv commented Apr 26, 2024

My data point: I was writing tests for my libraries and I was puzzled for some time until I realized xarray.assert_allclose doesn't transpose. I think transposing by default makes more sense, but at the very least it should have an option. The current error where it complains about dimension order and data was not clear to me. I switched to assert abs(expected/actual-1).max().item() < 1e-4 because of this.

There's a risk of checking for equality with xarray, then later converting to numpy and assuming that arrays are equal without feeling the need to check again

Don't all conversions to numpy have to keep dimension order in mind?

@dcherian
Copy link
Contributor

We would happily take a PR to implement a keyword-only check_dim_order and/or check_attrs options on assert_identical, assert_equals, and assert_allclose.

@ignamv
Copy link
Contributor

ignamv commented Apr 27, 2024

Awesome, I'll work on it.

@headtr1ck
Copy link
Collaborator

Would it make sense to extend this to broadcastable in general, so allow missing dimensions etc.?

@max-sixty
Copy link
Collaborator

Would it make sense to extend this to broadcastable in general, so allow missing dimensions etc.?

To be this would be too lax — it does matter if there are missing dims — they will have a different impact / repr etc...

@ignamv
Copy link
Contributor

ignamv commented May 1, 2024

My changes so far mostly keep the old behavior unless the user specifies check_dims="transpose". The difference is that, if the dimensions don't match (either strictly for "strict" or transposed for "transpose"), then the assertion message will not have the whole diff but rather just the mismatched dimensions. Is this OK?

>>> a = xarray.DataArray(np.arange(6).reshape((2,3)), dims=['x','y'])
>>> b = a.transpose()
>>> xarray.testing.assert_equal(a,b)
AssertionError: Dimensions differ: ('x', 'y') ('y', 'x')

I still need to look into Datatrees (first time I've heard about them) so I know what to do about them.

@TomNicholas
Copy link
Member

Thanks for working on this @ignamv ! I've added some comments to your PR.

the assertion message will not have the whole diff but rather just the mismatched dimensions. Is this OK?

Yes I think that is fine!

I still need to look into Datatrees (first time I've heard about them) so I know what to do about them.

You should just be able to copy the pattern you'll see in the assert functions already. Happy to help if it's not clear.

max-sixty added a commit that referenced this issue May 5, 2024
…#5733) (#8991)

* Add argument check_dims to assert_allclose to allow transposed inputs

* Update whats-new.rst

* Add `check_dims` argument to assert_equal and assert_identical + tests

* Assert that dimensions match before transposing or comparing values

* Add docstring for check_dims to assert_equal and assert_identical

* Update doc/whats-new.rst

Co-authored-by: Tom Nicholas <tom@cworthy.org>

* Undo fat finger

Co-authored-by: Tom Nicholas <tom@cworthy.org>

* Add attribution to whats-new.rst

* Replace check_dims with bool argument check_dim_order, rename align_dims to maybe_transpose_dims

* Remove left-over half-made test

* Remove check_dim_order argument from assert_identical

* assert_allclose/equal: emit full diff if dimensions don't match

* Rename check_dim_order test, test Dataset with different dim orders

* Update whats-new.rst

* Hide maybe_transpose_dims from Pytest traceback

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

* Ignore mypy error due to missing functools.partial.__name__

---------

Co-authored-by: Tom Nicholas <tom@cworthy.org>
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
andersy005 pushed a commit that referenced this issue May 10, 2024
…#5733) (#8991)

* Add argument check_dims to assert_allclose to allow transposed inputs

* Update whats-new.rst

* Add `check_dims` argument to assert_equal and assert_identical + tests

* Assert that dimensions match before transposing or comparing values

* Add docstring for check_dims to assert_equal and assert_identical

* Update doc/whats-new.rst

Co-authored-by: Tom Nicholas <tom@cworthy.org>

* Undo fat finger

Co-authored-by: Tom Nicholas <tom@cworthy.org>

* Add attribution to whats-new.rst

* Replace check_dims with bool argument check_dim_order, rename align_dims to maybe_transpose_dims

* Remove left-over half-made test

* Remove check_dim_order argument from assert_identical

* assert_allclose/equal: emit full diff if dimensions don't match

* Rename check_dim_order test, test Dataset with different dim orders

* Update whats-new.rst

* Hide maybe_transpose_dims from Pytest traceback

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

* Ignore mypy error due to missing functools.partial.__name__

---------

Co-authored-by: Tom Nicholas <tom@cworthy.org>
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
andersy005 added a commit that referenced this issue May 10, 2024
* main:
  Avoid auto creation of indexes in concat (#8872)
  Fix benchmark CI (#9013)
  Avoid extra read from disk when creating Pandas Index. (#8893)
  Add a benchmark to monitor performance for large dataset indexing (#9012)
  Zarr: Optimize `region="auto"` detection (#8997)
  Trigger CI only if code files are modified. (#9006)
  Fix for ruff 0.4.3 (#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004)
  Speed up localize (#8536)
  Simplify fast path (#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991)
  Fix syntax error in test related to cupy (#9000)
andersy005 added a commit to hmaarrfk/xarray that referenced this issue May 10, 2024
* backend-indexing:
  Trigger CI only if code files are modified. (pydata#9006)
  Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870)
  add `.oindex` and `.vindex` to `BackendArray` (pydata#8885)
  temporary enable CI triggers on feature branch
  Avoid auto creation of indexes in concat (pydata#8872)
  Fix benchmark CI (pydata#9013)
  Avoid extra read from disk when creating Pandas Index. (pydata#8893)
  Add a benchmark to monitor performance for large dataset indexing (pydata#9012)
  Zarr: Optimize `region="auto"` detection (pydata#8997)
  Trigger CI only if code files are modified. (pydata#9006)
  Fix for ruff 0.4.3 (pydata#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004)
  Speed up localize (pydata#8536)
  Simplify fast path (pydata#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991)
  Fix syntax error in test related to cupy (pydata#9000)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants