Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent/confusing behaviour when concatenating dimension coords #2975

Open
TomNicholas opened this issue May 20, 2019 · 2 comments
Open
Labels
topic-combine combine/concat/merge

Comments

@TomNicholas
Copy link
Member

TomNicholas commented May 20, 2019

I noticed that with multiple conflicting dimension coords then concat can give pretty weird/counterintuitive results, at least compared to what the documentation suggests they should give:

# Create two datasets with conflicting coordinates
objs = [Dataset({'x': [0], 'y': [1]}), Dataset({'y': [0], 'x': [1]})]

[<xarray.Dataset>
 Dimensions:  (x: 1, y: 1)
 Coordinates:
   * x        (x) int64 0
   * y        (y) int64 1
 Data variables:
     *empty*, 
<xarray.Dataset>
 Dimensions:  (x: 1, y: 1)
 Coordinates:
   * y        (y) int64 0
   * x        (x) int64 1
 Data variables:
     *empty*]
# Try to join along only 'x',
# coords='minimal' so concatenate "Only coordinates in which the dimension already appears"
concat(objs, dim='x', coords='minimal') 

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * y        (y) int64 0 1
  * x        (x) int64 0 1
Data variables:
    *empty*

# It's joined along x and y! 

Based on my reading of the docstring for concat, I would have expected this to not attempt to concatenate y, because coords='minimal', and instead to throw an error because 'y' is a "non-concatenated variable" whose values are not the same across datasets.

Now let's try to get concat to broadcast 'y' across 'x':

# Try to join along only 'x' by setting coords='different'
concat(objs, dim='x', coords='different') 

Now as "Data variables which are not equal (ignoring attributes) across all datasets are also concatenated" then I would have expected 'y' to be concatenated across 'x', i.e. to add the 'x' dimension to the 'y' coord, i.e:

<xarray.Dataset>
Dimensions:  (x: 2, y: 1)
Coordinates:
  * y        (y, x) int64 1 0
  * x        (x)    int64 0 1
Data variables:
    *empty*

But that's not what we get!:

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * y        (y) int64 0 1
  * x        (x) int64 0 1
Data variables:
    *empty*

Same again but without dimension coords

If we create the same sort of objects but the variables are data vars not coords, then everything behaves exactly as expected:

objs2 = [Dataset({'a': ('x', [0]), 'b': ('y', [1])}), Dataset({'a': ('x', [1]), 'b': ('y', [0])})]

[<xarray.Dataset>
 Dimensions:  (x: 1, y: 1)
 Dimensions without coordinates: x, y
 Data variables:
     a        (x) int64 0
     b        (y) int64 1, 
<xarray.Dataset>
 Dimensions:  (x: 1, y: 1)
 Dimensions without coordinates: x, y
 Data variables:
     a        (x) int64 1
     b        (y) int64 0]

concat(objs2, dim='x', data_vars='minimal')

ValueError: variable b not equal across datasets

concat(objs2, dim='x', data_vars='different')

<xarray.Dataset>
Dimensions:  (x: 2, y: 1)
Dimensions without coordinates: x, y
Data variables:
    a        (x) int64 0 1
    b        (x, y) int64 1 0

Also if you do the same again but with coordinates which are not dimension coords, i.e:

objs3 = [Dataset(coords={'a': ('x', [0]), 'b': ('y', [1])}), Dataset(coords={'a': ('x', [1]), 'b': ('y', [0])})]

[<xarray.Dataset>
 Dimensions:  (x: 1, y: 1)
 Coordinates:
     a        (x) int64 0
     b        (y) int64 1
 Dimensions without coordinates: x, y
 Data variables:
     *empty*, 
<xarray.Dataset>
 Dimensions:  (x: 1, y: 1)
 Coordinates:
     a        (x) int64 1
     b        (y) int64 0
 Dimensions without coordinates: x, y
 Data variables:
     *empty*]

then this again gives the expected concatenation behaviour.

So this implies that the compatibility checks that are being done on the data vars are not being done on the coords, but only if they are dimension coordinates!

Either this is not the desired behaviour or the concat docstring needs to be a lot clearer. If we agree that this is not the desired behaviour then I will have a look inside concat to work out why it's happening.

EDIT: Presumably this has something to do with the ToDo in the code for concat: # TODO: support concatenating scalar coordinates even if the concatenated dimension already exists...

@shoyer
Copy link
Member

shoyer commented May 28, 2019

I'm not entirely sure what's going on here, but I suspect this has something to do with how we do automatic alignment inside concat:

datasets = align(*datasets, join='outer', copy=False, exclude=[dim],
fill_value=fill_value)

@shoyer
Copy link
Member

shoyer commented May 28, 2019

It might be worth testing it this behavior goes away when you comment out that line.

shoyer pushed a commit that referenced this issue Jun 25, 2019
* concatenates along a single dimension

* Wrote function to find correct tile_IDs from nested list of datasets

* Wrote function to check that combined_tile_ids structure is valid

* Added test of 2d-concatenation

* Tests now check that dataset ordering is correct

* Test concatentation along a new dimension

* Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm

* All unit tests now passing

* Fixed a failing test which I didn't notice because I don't have pseudoNetCDF

* Began updating open_mfdataset to handle N-D input

* Refactored to remove duplicate logic in open_mfdataset & auto_combine

* Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator

* --amend

* Now raises ValueError if input not ordered correctly before concatenation

* Added some more prototype tests defining desired behaviour more clearly

* Now raises informative errors on invalid forms of input

* Refactoring to alos merge along each dimension

* Refactored to literally just apply the old auto_combine along each dimension

* Added unit tests for open_mfdatset

* Removed TODOs

* Removed format strings

* test_get_new_tile_ids now doesn't assume dicts are ordered

* Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered

* Test for getting new tile id

* Fixed itertoolz import so that it's compatible with older versions

* Increased test coverage

* Added toolz as an explicit dependency to pass tests on python2.7

* Updated 'what's new'

* No longer attempts to shortcut all concatenation at once if concat_dims=None

* Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz

* Fixed erroneous removal of utils import

* Updated docstrings to include an example of multidimensional concatenation

* Clarified auto_combine docstring for N-D behaviour

* Added unit test for nested list of Datasets with different variables

* Minor spelling and pep8 fixes

* Started working on a new api with both auto_combine and manual_combine

* Wrote basic function to infer concatenation order from coords.

Needs better error handling though.

* Attempt at finalised version of public-facing API.

All the internals still need to be redone to match though.

* No longer uses entire old auto_combine internally, only concat or merge

* Updated what's new

* Removed uneeded addition to what's new for old release

* Fixed incomplete merge in docstring for open_mfdataset

* Tests for manual combine passing

* Tests for auto_combine now passing

* xfailed weird behaviour with manual_combine trying to determine concat_dim

* Add auto_combine and manual_combine to API page of docs

* Tests now passing for open_mfdataset

* Completed merge so that #2648 is respected, and added tests.

Also moved concat to it's own file to avoid a circular dependency

* Separated the tests for concat and both combines

* Some PEP8 fixes

* Pre-empting a test which will fail with opening uamiv format

* Satisfy pep8speaks bot

* Python 3.5 compatibile after changing some error string formatting

* Order coords using pandas.Index objects

* Fixed performance bug from GH #2662

* Removed ToDos about natural sorting of string coords

* Generalized auto_combine to handle monotonically-decreasing coords too

* Added more examples to docstring for manual_combine

* Added note about globbing aspect of open_mfdataset

* Removed auto-inferring of concatenation dimension in manual_combine

* Added example to docstring for auto_combine

* Minor correction to docstring

* Another very minor docstring correction

* Added test to guard against issue #2777

* Started deprecation cycle for auto_combine

* Fully reverted open_mfdataset tests

* Updated what's new to match deprecation cycle

* Reverted uamiv test

* Removed dependency on itertools

* Deprecation tests fixed

* Satisfy pycodestyle

* Started deprecation cycle of auto_combine

* Added specific error for edge case combine_manual can't handle

* Check that global coordinates are monotonic

* Highlighted weird behaviour when concatenating with no data variables

* Added test for impossible-to-auto-combine coordinates

* Removed uneeded test

* Satisfy linter

* Added airspeedvelocity benchmark for combining functions

* Benchmark will take longer now

* Updated version numbers in deprecation warnings to fit with recent release of 0.12

* Updated api docs for new function names

* Fixed docs build failure

* Revert "Fixed docs build failure"

This reverts commit ddfc6dd.

* Updated documentation with section explaining new functions

* Suppressed deprecation warnings in test suite

* Resolved ToDo by pointing to issue with concat, see #2975

* Various docs fixes

* Slightly renamed tests to match new name of tested function

* Included minor suggestions from shoyer

* Removed trailing whitespace

* Simplified error message for case combine_manual can't handle

* Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim

* Simple fixes suggested by shoyer

* Change deprecation warning behaviour

* linting
@dcherian dcherian added the topic-combine combine/concat/merge label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-combine combine/concat/merge
Projects
None yet
Development

No branches or pull requests

3 participants