concat prealigned objects #1413

rabernat · 2017-05-17T20:16:00Z

Closes slow performance with open_mfdataset #1385
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

This is an initial PR to bypass index alignment and coordinate checking when concatenating datasets.

rabernat · 2017-05-18T18:14:56Z

Let me expand on what this does.

Many netCDF datasets consist of multiple files with identical coordinates, except for one (e.g. time). With xarray we can open these datasets with open_mfdataset, which calls concat on the list of individual dataset objects. concat calls align, which loads all of the dimension indices (and, optionally, non-dimension coordinates) from each file and checks them for consistency / alignment.

This align step is potentially quite expensive for big collections of files with large indices. For example, an unstructured grid or particle-based dataset would just have a single dimension coordinate, with the same length as the data variables. If the user knows that the datasets are already aligned, this PR enables the alignment step to be skipped by passing the argument prealigned=True to concat. My goal is to avoid touching the disk as much as possible.

This PR is a draft in progress. I still need to propagate the prealigned argument up to auto_combine and open_mfdataset.

An alternative API would be to add another option to the coords keywork, i.e. coords='prealigned'.

Feedback welcome.

shoyer · 2017-05-18T18:27:36Z

xarray/core/combine.py

+    if not prealigned:
+        datasets = align(*datasets, join='outer', copy=False, exclude=[dim])
+    else:
+        coords = 'minimal'


It's bad form to unilaterally override an argument with another value -- it's better to raise an error (or maybe a warning).

The only value of coords that really breaks here is 'different', and even that value could conceivably make sense.

What about just adding the option coords='prealigned'?

My initial thought was that, for prealigned data, all coords should just be drawn from the first object. But on second thought, what if there are other coords in the later dataset that do need to be concatenated, e.g. concat over time with an auxiliary coordinates iteration_number with dimension time.

It definitely doesn't work with coords='different'. I have not tried all the other options. I have a hard time conceptualizing what the different coords options do. Some guidance would be very welcome. I don't really understand what the function _calc_concat_over does.

shoyer · 2017-05-18T19:04:18Z

This enhancement makes a lot of sense to me.

Two things worth considering:

Given a collection of datasets, how do I know if setting prealigned=True will work? This is where my PR adding xr.align(..., join='exact') could help (I can finish that up). Maybe it's worth adding xr.is_aligned or something similar.
What happens if things go wrong? It's okay if the behavior is undefined (or could give wrong results) but we should document that. Ideally we should raise sensible errors at some later time, e.g., when the dask arrays are computed. This might or might not be possible to do efficiently with dask, if the result of all the equality checks are consolidated and added into the dask graphs of the results.

rabernat · 2017-05-19T00:30:13Z

Given a collection of datasets, how do I know if setting prealigned=True will work?

I guess we would want to check that (a) the necessary variables and dimensions exist in all datasets and (b) the dimensions have the same length. We would want to bypass the actual reading of the indices. I agree it would be nicer to subsume this logic into align.

What is xr.align(..., join='exact') supposed to do?

What happens if things go wrong?

I can add more careful checks once we sort out the align question.

shoyer · 2017-05-19T14:04:05Z

What is xr.align(..., join='exact') supposed to do?

It verifies that all dimensions have the same length, and coordinates along all dimensions (used for indexing) also match. Unlike the normal version of align, it doesn't do any indexing -- the outputs are always the same as the inputs.

It does not check that the necessary dimensions and variables exist in all datasets. But we should do that as part of the logic in concat anyways, since the xarray data model always requires knowing variables and their dimensions.

rabernat · 2017-05-19T14:53:49Z

As I think about this further, I realize it might be futile to avoid reading the dimensions from all the files. This is a basic part of how open_dataset works.

shoyer · 2017-05-19T20:32:57Z

Well, we could potentially write a fast path constructor for loading multiple netcdf files that avoids open_dataset. We just need another way to specify the schema, e.g., using NCML.

…

On Fri, May 19, 2017 at 10:53 AM Ryan Abernathey ***@***.***> wrote: As I think about this further, I realize it might be futile to avoid reading the dimensions from all the files. This is a basic part of how open_dataset works. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1413 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1pDsz3dD_xmfKFgg-WYk3LBCP1raks5r7az9gaJpZM4NeYj-> .

rabernat · 2017-05-20T01:51:03Z

Since the expensive part (for me) is actually reading all the coordinates, I'm not sure that this PR makes sense any more.

The same thing I am going for here could probably be accomplished by allowing the user to pass join='exact' via open_mfdataset. A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset.

For really big datasets, I think we will want to go the NCML approach, generating the xarray metadata as a pre-processing step. Then we could add a function like open_ncml_dataset to xarray which would parse this metadata and construct the dataset in a more efficient way (i.e. not reading redundant coordinates).

shoyer · 2017-05-20T16:00:15Z

Sounds good to me!

jhamman · 2017-07-13T21:20:41Z

@rabernat - I'm just catching up on this issue. Is you last comment indicating that we should close this PR?

rabernat · 2017-07-14T13:01:45Z

Yes, I think it should be closed. There are better ways to accomplish the desired goals.

Specifically, allowing the user to pass kwargs to concat via open_mfdataset would be useful.

jhamman · 2017-07-17T21:53:40Z

Okay thanks, closing now. We can always reopen this if necessary.

rabernat added 2 commits May 17, 2017 16:12

first pass at prealigned concat

525bbbd

add dataarray

a0314bf

shoyer reviewed May 18, 2017

View reviewed changes

rabernat mentioned this pull request May 19, 2017

If join='exact', raise an error for non-aligned objects #1330

Merged

3 tasks

jhamman added topic-backends topic-performance labels Jul 13, 2017

jhamman closed this Jul 17, 2017

jhamman added the wontfix label Jul 17, 2017

shoyer mentioned this pull request Aug 24, 2017

open_mfdataset reads coords from disk multiple times #1521

Closed

rabernat mentioned this pull request Apr 5, 2018

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Open

dcherian mentioned this pull request May 1, 2019

We need a fast path for open_mfdataset #1823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concat prealigned objects #1413

concat prealigned objects #1413

rabernat commented May 17, 2017

rabernat commented May 18, 2017 •

edited

Loading

shoyer May 18, 2017

rabernat May 19, 2017

shoyer commented May 18, 2017

rabernat commented May 19, 2017 •

edited

Loading

shoyer commented May 19, 2017

rabernat commented May 19, 2017

shoyer commented May 19, 2017 via email

rabernat commented May 20, 2017

shoyer commented May 20, 2017 via email •

edited by jhamman

Loading

jhamman commented Jul 13, 2017

rabernat commented Jul 14, 2017 •

edited

Loading

jhamman commented Jul 17, 2017

concat prealigned objects #1413

concat prealigned objects #1413

Conversation

rabernat commented May 17, 2017

rabernat commented May 18, 2017 • edited Loading

shoyer May 18, 2017

Choose a reason for hiding this comment

rabernat May 19, 2017

Choose a reason for hiding this comment

shoyer commented May 18, 2017

rabernat commented May 19, 2017 • edited Loading

shoyer commented May 19, 2017

rabernat commented May 19, 2017

shoyer commented May 19, 2017 via email

rabernat commented May 20, 2017

shoyer commented May 20, 2017 via email • edited by jhamman Loading

jhamman commented Jul 13, 2017

rabernat commented Jul 14, 2017 • edited Loading

jhamman commented Jul 17, 2017

rabernat commented May 18, 2017 •

edited

Loading

rabernat commented May 19, 2017 •

edited

Loading

shoyer commented May 20, 2017 via email •

edited by jhamman

Loading

rabernat commented Jul 14, 2017 •

edited

Loading