Shoudn't `assert_allclose` transpose datasets? #5733

jbusecke · 2021-08-23T22:55:12Z

I am trying to compare two datasets, one of which has possibly transposed dimensions on a data variable.

import xarray as xr
import numpy as np
data = np.random.rand(4,6)
da = xr.DataArray(data, dims=['x','y'])
ds1 = xr.Dataset({'data':da})
ds2 = xr.Dataset({'data':da}).transpose('y','x')

What happened:
In my mind this should pass

xr.testing.assert_allclose(ds1, ds2)

but instead it fails

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-7-58cd53174a1e> in <module>
----> 1 xr.testing.assert_allclose(ds1, ds2)

    [... skipping hidden 1 frame]

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/testing.py in assert_allclose(a, b, rtol, atol, decode_bytes)
    169             a.variables, b.variables, compat=compat_variable
    170         )
--> 171         assert allclose, formatting.diff_dataset_repr(a, b, compat=equiv)
    172     else:
    173         raise TypeError("{} not supported by assertion comparison".format(type(a)))

AssertionError: Left and right Dataset objects are not close


Differing data variables:
L   data     (x, y) float64 0.8589 0.09264 0.0264 ... 0.1039 0.3685 0.3983
R   data     (y, x) float64 0.8589 0.8792 0.8433 0.6952 ... 0.3664 0.2214 0.3983

Simply transposing ds2 to the same dimensions of ds1 fixes this (since the data is the same after all)

xr.testing.assert_allclose(ds1, ds2.transpose('x','y'))

Since most of the other xarray operations are 'transpose-safe' ((ds1+ds2) = (ds1 + ds2.transpose('x','y')), shouldnt this one be too?

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.109+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.2
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.7.1
cftime: 1.4.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.2
cfgrib: 0.9.9.0
iris: None
bottleneck: 1.3.2
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: 3.4.1
cartopy: 0.19.0
seaborn: None
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 20.3.4
conda: None
pytest: None
IPython: 7.22.0
sphinx: None

The text was updated successfully, but these errors were encountered:

max-sixty · 2021-08-24T01:09:19Z

I agree this is another station on the journey between "byte identical" and "practically similar", which we've discovered over the past couple of years... :)

I would lean towards having it fail for assert_identical, with no view on all_close; definitely open to it coercing transpose.

benbovy · 2021-08-24T11:21:22Z

Is there any operation in xarray where dimension order matters? If yes, I'm not sure if it's a good idea to allow transposed dimension pass assert_allclose or assert_equal. But even otherwise, I'd find it a bit weird for a "numpy equivalent" function to have:

xr.testing.assert_allclose(ds1.data, ds2.data)   # ok
np.testing.assert_allclose(ds1.data.values, ds2.data.values)   # fails

What about a check_dim_order option? Also, it would be useful if information about non-matching dimension order was shown more explicitly in the assertion error message.

dcherian · 2021-08-24T14:16:45Z

What about a check_dim_order option? Also, it would be useful if information about non-matching dimension order was shown more explicitly in the assertion error message.

This sounds good to me. We should also have a check_attrs kwarg since that's another thing that only identical checks.

dcherian · 2021-08-29T23:50:59Z

I have a related question

import xarray as xr

da = xr.DataArray([[1, 1, 1], [2, 2, 2]], dims=("x", "y"))

xr.testing.assert_identical(da, da.transpose())

AssertionError: Left and right DataArray objects are not identical
Differing dimensions:
    (x: 2, y: 3) != (y: 3, x: 2)
Differing values:
L
    array([[1, 1, 1],
           [2, 2, 2]])
R
    array([[1, 2],
           [1, 2],
           [1, 2]])

Strictly speaking, the values are different I guess. However I think this error would be clearer if it said that the dimension order was different but the values are equal once the dimensions are transposed.

I.e. we could

if set(a.dims) == set(b.dims):
     a = a.transpose(b.dims)
	 # check values and raise if actually different
else:
    # current behaviour

Is this a good idea?

jbusecke · 2021-08-30T14:27:01Z

Strictly speaking, the values are different I guess. However I think this error would be clearer if it said that the dimension order was different but the values are equal once the dimensions are transposed.

I guess this comes down a bit to a philosophical question related to @benbovy s comment above. You can either make this operation be similar to the numpy equivalent (with some more xarray specific checks) or it can check whether the values at a certain combo of labels are the same/close.

The latter would be the way I think about data in xarray as a user. To me the removal of axis logic (via labels) is one of the biggest draws for myself, but importantly I also pitch this as one of the big reasons to switch to xarray for beginners.

I would argue that a 'strict' (numpy style) comparision is less practical in a scientific workflow and we do have the numpy implementation to achieve that functionality.
So I would ultimately argue that xarray should check closeness between values at certain label positions by default.

However, this might be very opinionated on my end, and a better error message would already be a massive improvement.

TomNicholas · 2021-08-30T14:48:23Z

I think we definitely need a flag, because we have two conflicting use cases:

Developers who use asserts in test suites and might well care about differences like

xr.testing.assert_allclose(ds1.data, ds2.data)   # ok
np.testing.assert_allclose(ds1.data.values, ds2.data.values)   # fails

Users for whom xarray allows to forget about dimension order, and would expect transposed data to be equal.

The easiest way to cover both is to have an option to switch between them.

In my opinion the only question is what the default comparison behaviour should be. I like the idea of adding a check_dim_order kwarg to assert_equal, assert_allclose, and assert_identical, but have the flag default to False for the first two functions and True for the last one.

dcherian · 2021-08-30T15:20:29Z

what the default comparison behaviour should be

I don't think we can change this because it's very backwards incompatible and affects tests in downstream packages.

But 👍 to adding a flag allowing users to opt out of dimension order checking.

benbovy · 2021-08-30T15:24:17Z

@jbusecke I agree with your point of view that "xarray-style" comparison is more practical in a scientific workflow. Especially if dimension order is irrelevant for most (all?) xarray operations, ignoring the order for assert_allclose / assert_equal too makes sense and is consistent.

However, it might be also dangerous / harmful if the workflow includes data conversion between labeled vs. unlabelled formats. There's a risk of checking for equality with xarray, then later converting to numpy and assuming that arrays are equal without feeling the need to check again. If dimension sizes are the same this might lead to very subtle bugs.

Since it is easy to ignore or forget about default values, having a check_dim_order option that defaults to True is probably safer IMHO, although slightly less convenient. No strong views, though.

@dcherian I like your idea but I'm not sure what's best between your code snippet and checking equality of aligned dimensions datasets only if non-dimension-aligned are not equal.

mjwillson · 2022-05-17T11:48:26Z

+1 for a check_dim_order option to .equals, assert_equal that can be disabled. (Ideally I think the default would be not to check dim order, but that ship has sailed now).

Or failing that, it would at least be nice to have xarray.testing.assert_equal_modulo_dim_order etc. When writing tests I usually don't care about dimension order and it's frustrating to have to manually do e.g.
xarray.testing.assert_allclose(a, b.transpose(a.dims)).

As pointed out, most of the xarray API is dimension-order-invariant and so it's odd to have no supported way to do comparisons in a dimension-order-invariant way.

ignamv · 2024-04-26T08:10:45Z

My data point: I was writing tests for my libraries and I was puzzled for some time until I realized xarray.assert_allclose doesn't transpose. I think transposing by default makes more sense, but at the very least it should have an option. The current error where it complains about dimension order and data was not clear to me. I switched to assert abs(expected/actual-1).max().item() < 1e-4 because of this.

There's a risk of checking for equality with xarray, then later converting to numpy and assuming that arrays are equal without feeling the need to check again

Don't all conversions to numpy have to keep dimension order in mind?

dcherian · 2024-04-26T14:10:28Z

We would happily take a PR to implement a keyword-only check_dim_order and/or check_attrs options on assert_identical, assert_equals, and assert_allclose.

ignamv · 2024-04-27T05:19:21Z

Awesome, I'll work on it.

headtr1ck · 2024-04-27T07:22:51Z

Would it make sense to extend this to broadcastable in general, so allow missing dimensions etc.?

max-sixty · 2024-04-29T16:52:32Z

Would it make sense to extend this to broadcastable in general, so allow missing dimensions etc.?

To be this would be too lax — it does matter if there are missing dims — they will have a different impact / repr etc...

ignamv · 2024-05-01T13:38:04Z

My changes so far mostly keep the old behavior unless the user specifies check_dims="transpose". The difference is that, if the dimensions don't match (either strictly for "strict" or transposed for "transpose"), then the assertion message will not have the whole diff but rather just the mismatched dimensions. Is this OK?

>>> a = xarray.DataArray(np.arange(6).reshape((2,3)), dims=['x','y'])
>>> b = a.transpose()
>>> xarray.testing.assert_equal(a,b)
AssertionError: Dimensions differ: ('x', 'y') ('y', 'x')

I still need to look into Datatrees (first time I've heard about them) so I know what to do about them.

TomNicholas · 2024-05-01T15:27:20Z

Thanks for working on this @ignamv ! I've added some comments to your PR.

the assertion message will not have the whole diff but rather just the mismatched dimensions. Is this OK?

Yes I think that is fine!

I still need to look into Datatrees (first time I've heard about them) so I know what to do about them.

You should just be able to copy the pattern you'll see in the assert functions already. Happy to help if it's not clear.

…#5733) (#8991) * Add argument check_dims to assert_allclose to allow transposed inputs * Update whats-new.rst * Add `check_dims` argument to assert_equal and assert_identical + tests * Assert that dimensions match before transposing or comparing values * Add docstring for check_dims to assert_equal and assert_identical * Update doc/whats-new.rst Co-authored-by: Tom Nicholas <tom@cworthy.org> * Undo fat finger Co-authored-by: Tom Nicholas <tom@cworthy.org> * Add attribution to whats-new.rst * Replace check_dims with bool argument check_dim_order, rename align_dims to maybe_transpose_dims * Remove left-over half-made test * Remove check_dim_order argument from assert_identical * assert_allclose/equal: emit full diff if dimensions don't match * Rename check_dim_order test, test Dataset with different dim orders * Update whats-new.rst * Hide maybe_transpose_dims from Pytest traceback Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> * Ignore mypy error due to missing functools.partial.__name__ --------- Co-authored-by: Tom Nicholas <tom@cworthy.org> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

* main: Avoid auto creation of indexes in concat (#8872) Fix benchmark CI (#9013) Avoid extra read from disk when creating Pandas Index. (#8893) Add a benchmark to monitor performance for large dataset indexing (#9012) Zarr: Optimize `region="auto"` detection (#8997) Trigger CI only if code files are modified. (#9006) Fix for ruff 0.4.3 (#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004) Speed up localize (#8536) Simplify fast path (#9001) Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991) Fix syntax error in test related to cupy (#9000)

* backend-indexing: Trigger CI only if code files are modified. (pydata#9006) Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870) add `.oindex` and `.vindex` to `BackendArray` (pydata#8885) temporary enable CI triggers on feature branch Avoid auto creation of indexes in concat (pydata#8872) Fix benchmark CI (pydata#9013) Avoid extra read from disk when creating Pandas Index. (pydata#8893) Add a benchmark to monitor performance for large dataset indexing (pydata#9012) Zarr: Optimize `region="auto"` detection (pydata#8997) Trigger CI only if code files are modified. (pydata#9006) Fix for ruff 0.4.3 (pydata#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004) Speed up localize (pydata#8536) Simplify fast path (pydata#9001) Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991) Fix syntax error in test related to cupy (pydata#9000)

dcherian added the API design label Aug 24, 2021

benbovy mentioned this issue Sep 23, 2021

Check explicit indexes when comparing two xarray objects #5812

Open

TomNicholas mentioned this issue Sep 27, 2021

Grid ufunc refactor - proof of principle for 1D diffs xgcm/xgcm#368

Merged

5 tasks

dcherian added the topic-testing label Nov 2, 2023

TomNicholas mentioned this issue Apr 28, 2024

Complete deprecation of Dataset.dims returning dict #8980

Open

4 tasks

ignamv mentioned this issue May 1, 2024

Add argument check_dims to assert_allclose to allow transposed inputs (#5733) #8991

Merged

4 tasks

max-sixty closed this as completed in #8991 May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shoudn't `assert_allclose` transpose datasets? #5733

Shoudn't `assert_allclose` transpose datasets? #5733

jbusecke commented Aug 23, 2021

INSTALLED VERSIONS

max-sixty commented Aug 24, 2021

benbovy commented Aug 24, 2021

dcherian commented Aug 24, 2021

dcherian commented Aug 29, 2021

jbusecke commented Aug 30, 2021

TomNicholas commented Aug 30, 2021

dcherian commented Aug 30, 2021

benbovy commented Aug 30, 2021

mjwillson commented May 17, 2022

ignamv commented Apr 26, 2024 •

edited

Loading

dcherian commented Apr 26, 2024

ignamv commented Apr 27, 2024

headtr1ck commented Apr 27, 2024

max-sixty commented Apr 29, 2024

ignamv commented May 1, 2024

TomNicholas commented May 1, 2024

Shoudn't assert_allclose transpose datasets? #5733

Shoudn't assert_allclose transpose datasets? #5733

Comments

jbusecke commented Aug 23, 2021

INSTALLED VERSIONS

max-sixty commented Aug 24, 2021

benbovy commented Aug 24, 2021

dcherian commented Aug 24, 2021

dcherian commented Aug 29, 2021

jbusecke commented Aug 30, 2021

TomNicholas commented Aug 30, 2021

dcherian commented Aug 30, 2021

benbovy commented Aug 30, 2021

mjwillson commented May 17, 2022

ignamv commented Apr 26, 2024 • edited Loading

dcherian commented Apr 26, 2024

ignamv commented Apr 27, 2024

headtr1ck commented Apr 27, 2024

max-sixty commented Apr 29, 2024

ignamv commented May 1, 2024

TomNicholas commented May 1, 2024

Shoudn't `assert_allclose` transpose datasets? #5733

Shoudn't `assert_allclose` transpose datasets? #5733

ignamv commented Apr 26, 2024 •

edited

Loading