Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved default behavior when concatenating DataArrays #2777

Closed
wants to merge 3 commits into from

Conversation

Zac-HD
Copy link
Contributor

@Zac-HD Zac-HD commented Feb 19, 2019

This is really nice to have when producing faceted plots of satellite observations in various bands, and should be somewhere between useful and harmless in other cases.

Example code:

ds = xr.Dataset({
    k: xr.DataArray(np.random.random((2, 2)), dims="x y".split(), name=k) 
    for k in "blue green red".split()
})
xr.concat([ds.blue, ds.green, ds.red], dim="band").plot.imshow(col="band")

Before - facets have an index, colorbar has misleading label:

image

After - facets have meaningful labels, colorbar has no label:

image

@Zac-HD Zac-HD force-pushed the concat-arrays branch 2 times, most recently from 280ce92 to a2df249 Compare February 19, 2019 06:36
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice usability improvement!

xarray/core/combine.py Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 19, 2019

Thanks for the support and quick review @shoyer!

Any idea when Xarray 0.12 might be out? I'm teaching some remote sensing workshops in mid-March and would love to have this merged, as a colleague's review of those notebooks prompted this PR 😄

@Zac-HD Zac-HD mentioned this pull request Feb 20, 2019
xarray/core/combine.py Outdated Show resolved Hide resolved
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 20, 2019

The docs build failed due to a (transient) http error when loading tutorial data for the docs, so I've also finalised the planned conversion from xarray.tutorial.load_dataset to xarray.tutorial.open_dataset.

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 21, 2019

Hmm, it looks like the failure to download the naturalearth coastlines.zip wasn't so transient after all - but it does work on my machine!

@Zac-HD Zac-HD closed this Feb 22, 2019
@Zac-HD Zac-HD reopened this Feb 22, 2019
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 22, 2019

OK! @shoyer, I've got everything passing and it's ready for review.

Even the accidental tutorial/docs fixes 😄

doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member

shoyer commented Feb 22, 2019 via email

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 22, 2019

@shoyer - don't worry about the docs build, I'm pretty sure that was just a flaky network from Travis and it's working now in any case.

I've left tutorial.load_dataset in, just changed "removed in 0.12" to "removed in a future version".

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. I'll merge in a day or two unless anyone else has review comments to add.

xarray/core/combine.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Feb 26, 2019

Hello @Zac-HD! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 27, 2019 at 00:51 Hours UTC

@shoyer
Copy link
Member

shoyer commented Feb 26, 2019

@pep8speaks seems to have gone hay-wire -- maybe you have a syntax error?

Thinking about this a little more, one hazard of converting names into index labels is that we lose the invariant that you get the same result regardless of order in which you call concat, e.g., something like these expressions could now give different results:

xarray.concat([a, b], dim='x')

vs

xarray.concat([xarray.concat([a], dim='x'), xarray.concat([b], dim='x')], dim='x')

If a and b have the same name, then you'd get an index with the two duplicate entries in the second case but not the first case.

I'm not entirely sure this is a deal-breaker but it makes me a little nervous reluctant. In particular, it might break some the invariants we're relying upon for the next version of open_mfdataset (#2616, cc @TomNicholas )

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 27, 2019

maybe you have a syntax error?

...yep, an unmatched paren. 😥

If a and b have the same name, then you'd get an index with the two duplicate entries in the second case but not the first case. [which would be bad]

I think it's impossible to avoid this when using inference in the general case. Two options I think would be decent-if-unsatisfying:

  1. Explicitly manage this in the new combining functions, e.g. clear the concat dim coords if they are not unique and the input arrays did not have coords in that dimension.
  2. Add an argument to xr.concat to enable or disable this, e.g. infer_coords=True, and disable it when calling xr.concat from other combining functions.

Zac-HD and others added 2 commits February 27, 2019 11:50
This is really nice to have when using concat to produce faceted plots of various kinds, and harmless when it's useless.
it's still deprecated, but we'll leave it for a bit longer before removal.
@TomNicholas
Copy link
Member

TomNicholas commented Feb 27, 2019

@Zac-HD forgive me for this but I think this PR is unnecessary because what you need basically already exists in the API.

Going back to your original example, you could have got the same indexing by creating a DataArray to use as a coordinate to concatenate over:

colors = "blue green red".split()
ds = xr.Dataset({
    k: xr.DataArray(np.random.random((2, 2)), dims="x y".split(), name=k)
    for k in colors
})

band = xr.DataArray(colors, name="band", dims=["band"])
xr.concat([ds.blue, ds.green, ds.red], dim=band).plot.imshow(col="band")

figure_1

This still leaves the wrong label on the colorbar, but that could be fixed separately and has to do with concat using the attrs of the first dataset in the list for the final dataset (a similar problem to #2382). I think it would be easier to change that behaviour instead (perhaps to if all names the same, use that name, else name of result = None, but this also relates to #1614).

Creating a new coordinate using a DataArray is in the docstring for xr.concat:

If dimension is provided as a DataArray or Index, its name is used as the dimension to concatenate along and the values are added as a coordinate.

but I think there should be an example there too. (Also I think this is relevant to #1646)

I'm not entirely sure this is a deal-breaker but it makes me a little nervous

@shoyer I agree, although I like the idea then I think this could introduce all sorts of complex concatentation edge cases.

At the very least the new API should have symmetry properties something like:

da1 = DataArray(name='a', data=[[0]], dims=['x', 'y'])
da2 = DataArray(name='b', data=[[1]], dims=['x', 'y'])
da3 = DataArray(name='a', data=[[2]], dims=['x', 'y'])
da4 = DataArray(name='b', data=[[3]], dims=['x', 'y'])

xr.manual_combine([[da1, da2], [da3, da4]], concat_dim=['x', 'y'])
# should give the same result as 
xr.manual_combine([[da1, da3], [da2, da4]], concat_dim=['y', 'x'])

but with this PR I don't think it would. In the first case the x coord would be created with values ['a', 'b'], and no y coord would be created, while in the second case no y coord would be created, and the intermediate DataSet would be nameless, so then no x coord would be created either.

I think my suggestion for naming would pass this test because the result would be nameless and have no coords either way.

I might have got that wrong but I definitely think this kind of change should be carefully considered 😕

(EDIT: I just added this example as a test to #2616)

@TomNicholas TomNicholas mentioned this pull request Feb 27, 2019
3 tasks
@TomNicholas
Copy link
Member

TomNicholas commented Feb 27, 2019

I've just submitted a PR which solves this issue in the way I just suggested instead #2792.

@TomNicholas
Copy link
Member

@Zac-HD there's actually another way to get the indexing behaviour you wanted with the current API:

colors = "blue green red".split()
das = [xr.DataArray(np.random.random((2, 2)), dims="x y".split(),
                    coords={"band": k})
       for k in colors]
xr.concat(das, dim="band").plot.imshow(col="band")

Here instead of using the name attribute to label each band I've used a scalar coordinate called "band", so that when you concat along "band" it will just stack along that coordinate.

This never touches the names so actually gives the desired output without the need for #2792:
figure_2

@shoyer
Copy link
Member

shoyer commented Mar 3, 2019

I guess we should probably roll back the "name to scalar coordinates" part of this change.

@Zac-HD do you want to do that here or should we go with @TomNicholas's PR?

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Mar 3, 2019

No objection to going with #2792; I'm just happy to have the change merged 😄

It would be nice for someone to cherry-pick 63da214 before releasing 0.12 though, just to fix that warning.

@Zac-HD Zac-HD closed this Mar 3, 2019
@shoyer shoyer mentioned this pull request Mar 12, 2019
3 tasks
shoyer pushed a commit that referenced this pull request Jun 25, 2019
* concatenates along a single dimension

* Wrote function to find correct tile_IDs from nested list of datasets

* Wrote function to check that combined_tile_ids structure is valid

* Added test of 2d-concatenation

* Tests now check that dataset ordering is correct

* Test concatentation along a new dimension

* Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm

* All unit tests now passing

* Fixed a failing test which I didn't notice because I don't have pseudoNetCDF

* Began updating open_mfdataset to handle N-D input

* Refactored to remove duplicate logic in open_mfdataset & auto_combine

* Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator

* --amend

* Now raises ValueError if input not ordered correctly before concatenation

* Added some more prototype tests defining desired behaviour more clearly

* Now raises informative errors on invalid forms of input

* Refactoring to alos merge along each dimension

* Refactored to literally just apply the old auto_combine along each dimension

* Added unit tests for open_mfdatset

* Removed TODOs

* Removed format strings

* test_get_new_tile_ids now doesn't assume dicts are ordered

* Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered

* Test for getting new tile id

* Fixed itertoolz import so that it's compatible with older versions

* Increased test coverage

* Added toolz as an explicit dependency to pass tests on python2.7

* Updated 'what's new'

* No longer attempts to shortcut all concatenation at once if concat_dims=None

* Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz

* Fixed erroneous removal of utils import

* Updated docstrings to include an example of multidimensional concatenation

* Clarified auto_combine docstring for N-D behaviour

* Added unit test for nested list of Datasets with different variables

* Minor spelling and pep8 fixes

* Started working on a new api with both auto_combine and manual_combine

* Wrote basic function to infer concatenation order from coords.

Needs better error handling though.

* Attempt at finalised version of public-facing API.

All the internals still need to be redone to match though.

* No longer uses entire old auto_combine internally, only concat or merge

* Updated what's new

* Removed uneeded addition to what's new for old release

* Fixed incomplete merge in docstring for open_mfdataset

* Tests for manual combine passing

* Tests for auto_combine now passing

* xfailed weird behaviour with manual_combine trying to determine concat_dim

* Add auto_combine and manual_combine to API page of docs

* Tests now passing for open_mfdataset

* Completed merge so that #2648 is respected, and added tests.

Also moved concat to it's own file to avoid a circular dependency

* Separated the tests for concat and both combines

* Some PEP8 fixes

* Pre-empting a test which will fail with opening uamiv format

* Satisfy pep8speaks bot

* Python 3.5 compatibile after changing some error string formatting

* Order coords using pandas.Index objects

* Fixed performance bug from GH #2662

* Removed ToDos about natural sorting of string coords

* Generalized auto_combine to handle monotonically-decreasing coords too

* Added more examples to docstring for manual_combine

* Added note about globbing aspect of open_mfdataset

* Removed auto-inferring of concatenation dimension in manual_combine

* Added example to docstring for auto_combine

* Minor correction to docstring

* Another very minor docstring correction

* Added test to guard against issue #2777

* Started deprecation cycle for auto_combine

* Fully reverted open_mfdataset tests

* Updated what's new to match deprecation cycle

* Reverted uamiv test

* Removed dependency on itertools

* Deprecation tests fixed

* Satisfy pycodestyle

* Started deprecation cycle of auto_combine

* Added specific error for edge case combine_manual can't handle

* Check that global coordinates are monotonic

* Highlighted weird behaviour when concatenating with no data variables

* Added test for impossible-to-auto-combine coordinates

* Removed uneeded test

* Satisfy linter

* Added airspeedvelocity benchmark for combining functions

* Benchmark will take longer now

* Updated version numbers in deprecation warnings to fit with recent release of 0.12

* Updated api docs for new function names

* Fixed docs build failure

* Revert "Fixed docs build failure"

This reverts commit ddfc6dd.

* Updated documentation with section explaining new functions

* Suppressed deprecation warnings in test suite

* Resolved ToDo by pointing to issue with concat, see #2975

* Various docs fixes

* Slightly renamed tests to match new name of tested function

* Included minor suggestions from shoyer

* Removed trailing whitespace

* Simplified error message for case combine_manual can't handle

* Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim

* Simple fixes suggested by shoyer

* Change deprecation warning behaviour

* linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants