-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: N-dimensional auto_combine #2553
Conversation
…tentation algorithm
Hello @TomNicholas! Thanks for updating the PR.
Comment last updated on December 12, 2018 at 00:54 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few high level comments
I should have suggested this before, but the internal implementation of |
…ed list traverser as an iterator
This is basically done now - I've implemented everything I wanted to, and included unit tests for the new functionality. @shoyer I haven't changed the way the combine function is applied repeatedly to match the implementation in
|
I appreciate that this is pretty complicated, perhaps it should have its own section in the docs (I can do another PR?) |
Do you think it would make sense to try to get rid of "both merging and concatenating" in favor of requiring another dimension in the grid? I guess we do sort of need this for this for |
I think that would mean there might be some situations that 1D I think this is a question of whether you think that: a) or b) I personally think a), but I expect users who haven't read the source code for |
I think multiple levels There are probably some edge cases where the existing
Yes, this is my concern. The way this API handles nested lists makes sense from an implementation perspective, but not really from a user perspective. For users, I think it makes sense to have:
The original version of If you like, we could also merge this work (which is excellent progress towards user-facing APIs) but keep the changes internal to xarray for now until we figure out the public APIs. |
I think that's probably the case, but I also think that those edge cases will be so specific that maybe we don't have to explicitly support them. We could just say that anyone who has a combination of datasets that is that funky can just concatenate them themselves?
I agree, two separate functions is a lot more intuitive than having The only problem with that idea is that both of these functions should be options for
That would be great. Then I could start using the master branch of xarray again in my code, while we redo the public API. If I set |
This sounds pretty good to me.
OK, do you want to go ahead and revert the documentation/public API for now? I would even be OK supporting nested lists temporarily in xarray via APIs like |
Okay good. So with this API then
Yes, I'll do that.
If I revert the documentation and you merge this PR then that's exactly what we will have, which would be useful for me until we do the public API. (Also it seems that whatever was wrong with cftime has been fixed now, as the CI tests are passing.) |
Yes, but throwing a warning should probably only be a temporary solution. Long term we should pick a default value (e.g., |
Okay I've reverted the API, so it basically converts a If we merge this then should I start a separate Pull Request for further discussion about the API? One of the Travis CI builds failed but again I don't think that was me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can indeed ignore the test failure on dask-dev -- that looks like a dask issue (dask/dask#4291)
Removed the unnecessary argument. |
thanks @TomNicholas! I'm looking forward to the follow-ups :) |
* upstream/master: Feature: N-dimensional auto_combine (pydata#2553) Support HighLevelGraphs (pydata#2603) Bump cftime version in doc environment (pydata#2604) use keep_attrs in binary operations II (pydata#2590) Temporarily mark dask-dev build as an allowed failure (pydata#2602) Fix wrong error message in interp() (pydata#2598) Add dayofyear and dayofweek accessors (pydata#2599) Fix h5netcdf saving scalars with filters or chunks (pydata#2591) Minor update to PR template (pydata#2596) Zarr consolidated (pydata#2559) fix examples (pydata#2581) Fix typo (pydata#2578) Concat docstring typo (pydata#2577) DOC: remove example using Dataset.T (pydata#2572) python setup.py test now works by default (pydata#2573) Return slices when possible from CFTimeIndex.get_loc() (pydata#2569) DOC: fix computation.rst (pydata#2567)
* concatenates along a single dimension * Wrote function to find correct tile_IDs from nested list of datasets * Wrote function to check that combined_tile_ids structure is valid * Added test of 2d-concatenation * Tests now check that dataset ordering is correct * Test concatentation along a new dimension * Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm * All unit tests now passing * Fixed a failing test which I didn't notice because I don't have pseudoNetCDF * Began updating open_mfdataset to handle N-D input * Refactored to remove duplicate logic in open_mfdataset & auto_combine * Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator * --amend * Now raises ValueError if input not ordered correctly before concatenation * Added some more prototype tests defining desired behaviour more clearly * Now raises informative errors on invalid forms of input * Refactoring to alos merge along each dimension * Refactored to literally just apply the old auto_combine along each dimension * Added unit tests for open_mfdatset * Removed TODOs * Removed format strings * test_get_new_tile_ids now doesn't assume dicts are ordered * Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered * Test for getting new tile id * Fixed itertoolz import so that it's compatible with older versions * Increased test coverage * Added toolz as an explicit dependency to pass tests on python2.7 * Updated 'what's new' * No longer attempts to shortcut all concatenation at once if concat_dims=None * Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz * Fixed erroneous removal of utils import * Updated docstrings to include an example of multidimensional concatenation * Clarified auto_combine docstring for N-D behaviour * Added unit test for nested list of Datasets with different variables * Minor spelling and pep8 fixes * Started working on a new api with both auto_combine and manual_combine * Wrote basic function to infer concatenation order from coords. Needs better error handling though. * Attempt at finalised version of public-facing API. All the internals still need to be redone to match though. * No longer uses entire old auto_combine internally, only concat or merge * Updated what's new * Removed uneeded addition to what's new for old release * Fixed incomplete merge in docstring for open_mfdataset * Tests for manual combine passing * Tests for auto_combine now passing * xfailed weird behaviour with manual_combine trying to determine concat_dim * Add auto_combine and manual_combine to API page of docs * Tests now passing for open_mfdataset * Completed merge so that #2648 is respected, and added tests. Also moved concat to it's own file to avoid a circular dependency * Separated the tests for concat and both combines * Some PEP8 fixes * Pre-empting a test which will fail with opening uamiv format * Satisfy pep8speaks bot * Python 3.5 compatibile after changing some error string formatting * Order coords using pandas.Index objects * Fixed performance bug from GH #2662 * Removed ToDos about natural sorting of string coords * Generalized auto_combine to handle monotonically-decreasing coords too * Added more examples to docstring for manual_combine * Added note about globbing aspect of open_mfdataset * Removed auto-inferring of concatenation dimension in manual_combine * Added example to docstring for auto_combine * Minor correction to docstring * Another very minor docstring correction * Added test to guard against issue #2777 * Started deprecation cycle for auto_combine * Fully reverted open_mfdataset tests * Updated what's new to match deprecation cycle * Reverted uamiv test * Removed dependency on itertools * Deprecation tests fixed * Satisfy pycodestyle * Started deprecation cycle of auto_combine * Added specific error for edge case combine_manual can't handle * Check that global coordinates are monotonic * Highlighted weird behaviour when concatenating with no data variables * Added test for impossible-to-auto-combine coordinates * Removed uneeded test * Satisfy linter * Added airspeedvelocity benchmark for combining functions * Benchmark will take longer now * Updated version numbers in deprecation warnings to fit with recent release of 0.12 * Updated api docs for new function names * Fixed docs build failure * Revert "Fixed docs build failure" This reverts commit ddfc6dd. * Updated documentation with section explaining new functions * Suppressed deprecation warnings in test suite * Resolved ToDo by pointing to issue with concat, see #2975 * Various docs fixes * Slightly renamed tests to match new name of tested function * Included minor suggestions from shoyer * Removed trailing whitespace * Simplified error message for case combine_manual can't handle * Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim * Simple fixes suggested by shoyer * Change deprecation warning behaviour * linting
What I did
Generalised the
auto_combine()
function to be able to concatenate and merge datasets along any number of dimensions, instead of just one.Provides one solution to #2159, and relevant for discussion in #2039.
Currently it cannot deduce the order in which datasets should be concatenated along any one dimension from the coordinates, so it just concatenates them in order they are supplied. This means for an N-D
concatenation the datasets have to be supplied as a list of lists, which is nested as many times as there are dimensions to be concatenated along.
How it works
In
_infer_concat_order_from_nested_list()
the nested list of datasets is recursively traversed in order to create a dictionary of datasets, where the keys are the corresponding "tile IDs". These tile IDs aretuples serving as multidimensional indexes for the position of the dataset within the hypercube of all datasets which are to be combined. For example four datasets which are to be combined along two dimensions would be supplied as
and given tile_IDs to be stored as
Using this unambiguous intermediate structure means that another method could be used to organise the datasets for concatenation (i.e. reading the values of their coordinates), and a new keyword argument
infer_order_from_coords
used to choose the method. The_combine_nd()
function concatenates along one dimension at a time, reducing the length of the tile_ID tuple by one each time_combine_along_first_dim()
is called. After each concatenation the different variables are merged, so the newauto_combine()
is essentially like calling the old one once for each dimension inconcat_dims
.Still to do
I would like people's opinions on the method I've chosen to do this, and any feedback on the code quality would be appreciated. Assuming we're happy with the method I used here, then the remaining tasks include:
More tests of the final
auto_combine()
functionAdd option to deduce concatenation order from coords (or this could be a separate PR)Integrate this all the way up to
open_mfdataset()
.Unit tests for
open_mfdataset()
More tests that the user has inputted a valid structure of datasets
Possibly parallelize the concatenation step?A few other small
TODO
s which are incombine.py
Proper documentation showing how the input should be structured.
Fix failing unit tests on python 2.7 (though support for 2.7 is being dropped at the end of 2018?)
Fix failing unit tests on python 3.5
Update what's new
This PR was intended to solve the common use case of collecting output from a simulation which was parallelized in multiple dimensions. I would like to write a tutorial about how to use xarray to do this, including examples of how to preprocess the data and discard processor ghost cells.