Slow performance of open_mfdataset for actual data #252

davbyr · 2021-02-17T11:51:00Z

xarray.open_mfdataset can be unusably slow when loading in actualy model data instead of the example data. This is a known thing and there's a lot of discussion of it on the web. A central long running set of forum posts:

pydata/xarray#1385

I think getting a multi-file load working well and quickly in COAsT is essential. There are some potential solutions here:

https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets

The suggested use of compat=override seems to significantly improve performance for me. For example:

nemo = xr.open_mfdataset(fn_nemo_data, concat_dim = "time_counter",
                         data_vars="minimal", coords="minimal", 
                         compat="override")

It looks like we had this implemented at some point but it has been commented out. Also the addition of this with statement before using open_mfdataset:

with dask.config.set(**{'array.slicing.split_large_chunks': True}):

Seems to improve performance (or makes warnings be quiet anyway). So:

with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    nemo = xr.open_mfdataset(fn_nemo_data, concat_dim = "time_counter", chunks={},
                             data_vars="minimal", coords="minimal", 
                             compat="override", parallel=True)

Currently, this piece of code is:

self.dataset = xr.open_mfdataset(
            directory_to_files, chunks=chunks, parallel=True, 
            combine="by_coords") #, compat='override')

AC:
Find and implement the best selection of arguments for using open_mfdataset with multiple NEMO data files.

The text was updated successfully, but these errors were encountered:

davbyr · 2021-02-18T10:27:36Z

Managed to solve my problem, so maybe not necessary. The first file of the model output had an extra variable (pot_density), which was inconsistent with the rest of the files. Therefore open_mfdataset was having a hard time (it must look for common file structures and can't handle differences very well). The solution above still fixes that problem in this use case. I imagine this situation could be relatively common though, so still something to keep in mind.

davbyr linked a pull request Feb 17, 2021 that will close this issue

Profile validation #251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance of open_mfdataset for actual data #252

Slow performance of open_mfdataset for actual data #252

davbyr commented Feb 17, 2021 •

edited

Loading

davbyr commented Feb 18, 2021

Slow performance of open_mfdataset for actual data #252

Slow performance of open_mfdataset for actual data #252

Comments

davbyr commented Feb 17, 2021 • edited Loading

davbyr commented Feb 18, 2021

davbyr commented Feb 17, 2021 •

edited

Loading