Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep the original ordering of the coordinates #4409

Merged
merged 12 commits into from
Sep 18, 2020

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Sep 6, 2020

In #4408, The formatting of coords turned out to be non-deterministic. This sorts the data variables, coordinates and dimensions without coordinates sections as well as the dimensions summary of Dataset objects.

No tests, yet, because I'm not quite sure sorting using the str representation of the key is the best way to make the formatting deterministic (so I'd appreciate reviews).

It might also be good to document somewhere that these sections are now sorted, and that it only makes sense to look at the order in the dimension summary of DataArray and Variable objects.

@keewis keewis mentioned this pull request Sep 6, 2020
4 tasks
@max-sixty
Copy link
Collaborator

Could we keep them in sorted order like a normal dict (at least as-of recent python versions)?

@keewis
Copy link
Collaborator Author

keewis commented Sep 6, 2020

thanks for the hint, sorting in __init__ should be possible and might even be easier than sorting when formatting. However, 3.7 is the first version of python (the language) which includes the insertion-order preservation introduced in CPython 3.6. Do we have to bump the supported python version for that, or is it fine to keep the current behavior for a non-CPython python<3.7 (if we ever supported a language other than CPython)?

Edit: that breaks two of the iris tests, which seem to depend on the order of the coords. Maybe we should change that?

@max-sixty
Copy link
Collaborator

sorting in __init__ should be possible and might even be easier than sorting when formatting.

Though I was thinking we wouldn't sort it; we'd take the order as given

Do we have to bump the supported python version for that, or is it fine to keep the current behavior for a non-CPython python<3.7 (if we ever supported a language other than CPython)?

I reckon it's fine, it's such a corner case, and it'll only affect these doctests

@shoyer
Copy link
Member

shoyer commented Sep 11, 2020

Can you give an example of a non-deterministic coordinate order? That sounds surprising to me, given that on Python 3.6+ dictionaries preserve insertion order.

My preference would be not to sort mappings automatically, either in __init__ or when formatting. I think users find it a little more intuitive to preserve the order of dict keys.

It's true that this is only a guarantee on Python 3.7+, but both CPython 3.6 and all versions of pypy 3 preserve dict insertion order, so in practice we can pretty much always guarantee this. (And soon, Python 3.7 will be required for xarray.)

@keewis
Copy link
Collaborator Author

keewis commented Sep 11, 2020

Can you give an example of a non-deterministic coordinate order

sure, just try running python -m pytest --doctest-modules xarray --ignore xarray/tests. For me at least (with the environment created by py38.yml), xarray.core.common.DataWithCoords.pipe, xarray.core.groupby.GroupBy.quantile and xarray.core.dataset.Dataset.filter_by_attrs pass or fail at random: in one run they pass, in the next they fail because a different ordering was used for the coordinates (see also #4408).

I think users find it a little more intuitive to preserve the order of dict keys.

Not sure if that's evidence against this (maybe we need to rearrange the repr to really fix it), but in #4072 someone confused the order of the printed coordinates with the order of the dimensions. I'd think it's much easier to tell people that these are alphabetically sorted, no matter what they do, so they shouldn't try to read the dimension order from the coordinates (same with the dimension summary).

@klindsay28
Copy link

Here's another example that yields non-deterministic coordinate order, which propagates into a plot title when selection is done on the coordinates. When I run the code below, the title is sometimes x = 0.0, y = 0.0 and sometimes y = 0.0, x = 0.0.

This is in a new conda environment that I created using the command conda create -n title_order python=3.7 matplotlib xarray. Output from xr.show_versions() is below.

I think the non-determinism is coming from the command ds_subset = ds[['var']].

import numpy as np
import xarray as xr

xlen = 3
ylen = 4
zlen = 5

x = xr.DataArray(np.linspace(0.0, 1.0, xlen), dims=('x'))
y = xr.DataArray(np.linspace(0.0, 1.0, ylen), dims=('y'))
z = xr.DataArray(np.linspace(0.0, 1.0, zlen), dims=('z'))

vals = np.arange(xlen*ylen*zlen, dtype='float64').reshape((xlen, ylen, zlen))
da = xr.DataArray(vals, dims=('x', 'y', 'z'), coords={'x': x, 'y': y, 'z': z})

ds = xr.Dataset({'var': da})
print('coords for var in original Dataset')
print(ds['var'].coords)

print('**********')

ds_subset = ds[['var']]
print('coords for var after subsetting')
print(ds_subset['var'].coords)

print('**********')

p = ds_subset['var'].isel(x=0,y=0).plot()
print('title for plot() with dim selection')
print(p[0].axes.get_title())
Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1127.13.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.0
pandas: 1.1.2
numpy: 1.19.1
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.3.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20200814
pip: 20.2.3
conda: None
pytest: None
IPython: None
sphinx: None

@keewis
Copy link
Collaborator Author

keewis commented Sep 11, 2020

I think that's still deterministic (although I agree that the change in the title is annoying): if you run your script multiple times, it will always return the same output. The issue I'm trying to fix here is much worse: the output of the same code will return different results with each run.

Edit: but maybe I just have a messed up environment

@klindsay28
Copy link

I disagree that this is deterministic. If I run the script multiple times, the plot title varies, and I consider the plot title part of the output.

I have jupyter notebooks that create figures and use this code idiom. If I refactor code of mine that is used by these notebooks, I would like to rerun the notebooks to confirm that the notebook results don't change. Having the plot titles change at random complicates this comparison.

I think sorting the coordinates would avoid this difficulty that I encounter.

@keewis
Copy link
Collaborator Author

keewis commented Sep 11, 2020

yes, sorry, you're right, that is indeed non-deterministic, and exactly the problem I'm having with the doctests.

Interestingly, this seems to be deterministic per python session: running

import xarray as xr


ds = xr.Dataset(
    {"a": (("x", "y"), [[0, 1], [2, 3]])}, coords={"x": ["a", "b"], "y": [0, 1]}
)
print(ds)
print(ds + 1)
print(ds)
print(ds + 1)

will print the same result for ds and ds + 1, but when rerunning in a new session, the coordinate order may be changed (more coordinates increase the probability for the change).

@dcherian
Copy link
Contributor

The plotting thing could be fixed relatively easily here:

def _title_for_slice(self, truncate: int = 50) -> str:
"""
If the dataarray has 1 dimensional coordinates or comes from a slice
we can show that info in the title
Parameters
----------
truncate : int, default: 50
maximum number of characters for title
Returns
-------
title : string
Can be used for plot titles
"""
one_dims = []
for dim, coord in self.coords.items():
if coord.size == 1:
one_dims.append(
"{dim} = {v}".format(dim=dim, v=format_item(coord.values))
)
title = ", ".join(one_dims)
if len(title) > truncate:
title = title[: (truncate - 3)] + "..."
return title
and the cor5responding function in Dataset.

@shoyer
Copy link
Member

shoyer commented Sep 11, 2020

OK, my guess is that this is happening because there is someplace where we iterate over a Python set (which has a non-deterministic order based on hashing) rather than a Python dict. If we can nail down where that is happening we should remove it!

@keewis
Copy link
Collaborator Author

keewis commented Sep 11, 2020

unfortunately, that's pretty difficult: the set in question is Dataset._coord_names, and there would be a lot to rewrite if we tried to change that to a list. Just for reference, the method where the reordering happens is Dataset._copy_listed.

@shoyer
Copy link
Member

shoyer commented Sep 11, 2020

One way to fix this is to iterate over variables instead of _coord_names, e.g., instead of:

for k in self._coord_names:
    ...

use:

for k in self._variables:
    if k in self._coord_names:
        ...

I believe we already use this trick in a few places for exactly this reason.

@keewis
Copy link
Collaborator Author

keewis commented Sep 11, 2020

thanks, that almost fixed the doctests. However, concat seems to somehow also mix up the coordinates / dimensions. I tried to fix that by not iterating over a set, but I'm totally not sure if there's a better fix.

@dcherian
Copy link
Contributor

What do you mean by "mix up"? Also see #2811

@keewis
Copy link
Collaborator Author

keewis commented Sep 11, 2020

"mix up" in this case means that the order is random (i.e. a set was involved). Sorry if that was unclear.

@keewis keewis closed this Sep 11, 2020
@keewis keewis reopened this Sep 11, 2020
@keewis
Copy link
Collaborator Author

keewis commented Sep 11, 2020

the tests pass so this should be ready for review

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @keewis

xarray/core/dataset.py Show resolved Hide resolved
xarray/core/concat.py Show resolved Hide resolved
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
xarray/core/dataset.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

xarray/core/coordinates.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor

@keewis @dcherian FYI: Regarding #4072 and #2811 with concat I successfully tested a fix locally yesterday. I'll ping you when I have the PR out.

@keewis
Copy link
Collaborator Author

keewis commented Sep 12, 2020

@kmuehlbauer, I modified concat so #4072 and #2811 might already be fixed. Can you confirm?

Edit: at least #4072 does not seem to be fixed yet. Good news is that this doesn't seem to depend on a set (i.e. the result does not change with different interpreter sessions)

@kmuehlbauer
Copy link
Contributor

@keewis I'm traveling currently, will check on Monday and come back to you.

@keewis keewis changed the title Sort mappings before formatting Keep the original ordering of the coordinates Sep 12, 2020
@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Sep 14, 2020

@keewis Unfortunately your changes did not help solving #2811. I'll have a pull request ready today with a proposal how to fix both issues.

@keewis
Copy link
Collaborator Author

keewis commented Sep 14, 2020

thanks for checking!

@kmuehlbauer
Copy link
Contributor

@keewis See #4419

@keewis
Copy link
Collaborator Author

keewis commented Sep 18, 2020

should we merge this before releasing 0.16.1?

@dcherian
Copy link
Contributor

LGTM 👍

@dcherian dcherian merged commit b2c1550 into pydata:master Sep 18, 2020
@keewis keewis deleted the sort-coords-in-repr branch September 18, 2020 21:05
@keewis keewis mentioned this pull request Sep 18, 2020
1 task
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 9, 2020
…pagate-attrs

* 'propagate-attrs' of github.com:dcherian/xarray: (22 commits)
  silence sphinx warnings about broken rst (pydata#4448)
  Xarray open_mfdataset with engine Zarr (pydata#4187)
  Fix release notes formatting (pydata#4443)
  fix typo in io.rst (pydata#4250)
  Fix typo (pydata#4181)
  Fix release notes typo
  New whatsnew section
  Add notes re doctests (pydata#4440)
  Fixed dask.optimize on datasets (pydata#4438)
  Release notes for 0.16.1 (pydata#4435)
  Small updates to How-to-release + lint (pydata#4436)
  Fix doctests (pydata#4439)
  add a ci for doctests (pydata#4437)
  preserve original dimension, coordinate and variable order in ``concat`` (pydata#4419)
  Fix for h5py deepcopy issues (pydata#4426)
  Keep the original ordering of the coordinates (pydata#4409)
  Clearer Vectorized Indexing example (pydata#4433)
  Revert "Fix optimize for chunked DataArray (pydata#4432)" (pydata#4434)
  Fix optimize for chunked DataArray (pydata#4432)
  fix doc dataarray to netcdf (pydata#4424)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] xr.concat inverts coordinates order
6 participants