slow performance with open_mfdataset #1385

rabernat · 2017-04-26T18:06:32Z

We have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset, and I would like to improve this.

Each individual netCDF file looks like this:

%time ds_single = xr.open_dataset('float_trajectories.0000000000.nc')
ds_single

CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms
Wall time: 60.8 ms

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 1993-01-01
  * npart    (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    z        (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ...
    u        (time, npart) float32 0.000545563 0.000544884 0.000544204 ...
    v        (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ...
    y        (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ...

As shown above, a single data file opens in ~60 ms.

When I call open_mdsdataset on 49 files (each with a different time dimension but the same npart), here is what happens:

%time ds = xr.open_mfdataset('*.nc', )
ds

CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s
Wall time: 2min 4s

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 49)
Coordinates:
  * npart    (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * time     (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
    z        (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ...
    u        (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ...
    v        (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ...
    y        (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ...

It takes over 2 minutes to open the dataset. Specifying concat_dim='time' does not improve performance.

Here is %prun of the open_mfdataset command.

         748994 function calls (724222 primitive calls) in 142.160 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       49   62.455    1.275   62.458    1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
       49   47.207    0.963   47.209    0.963 base.py:1067(is_unique)
      196    7.198    0.037    7.267    0.037 {operator.getitem}
       49    4.632    0.095    4.687    0.096 netCDF4_.py:182(_open_netcdf4_group)
      240    3.189    0.013    3.426    0.014 numeric.py:2476(array_equal)
       98    1.937    0.020    1.937    0.020 {numpy.core.multiarray.arange}
4175/3146    1.867    0.000    9.296    0.003 {numpy.core.multiarray.array}
       49    1.525    0.031  119.144    2.432 alignment.py:251(reindex_variables)
       24    1.065    0.044    1.065    0.044 {method 'cumsum' of 'numpy.ndarray' objects}
       12    1.010    0.084    1.010    0.084 {method 'sort' of 'numpy.ndarray' objects}
5227/4035    0.660    0.000    1.688    0.000 collections.py:50(__init__)
       12    0.600    0.050    3.238    0.270 core.py:2761(insert)
12691/7497    0.473    0.000    0.875    0.000 indexing.py:363(shape)
   110728    0.425    0.000    0.663    0.000 {isinstance}
       12    0.413    0.034    0.413    0.034 {method 'flatten' of 'numpy.ndarray' objects}
       12    0.341    0.028    0.341    0.028 {numpy.core.multiarray.where}
        2    0.333    0.166    0.333    0.166 {pandas._join.outer_join_indexer_int64}
        1    0.331    0.331  142.164  142.164 <string>:1(<module>)

It looks like most of the time is being spent on reindex_variables. I understand why this happens...xarray needs to make sure the dimensions are the same in order to concatenate them together.

Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables step is not necessary, since I know that all the npart dimensions are the same in each file?

Possibly related to #1301 and #1340.

The text was updated successfully, but these errors were encountered:

rabernat · 2017-04-26T18:07:03Z

cc: @geosciz, who is helping with this project.

shoyer · 2017-04-26T20:59:23Z

For example, can I give a hint to xarray that this reindex_variables step is not necessary

Yes, adding an boolean argument prealigned which defaults to False to concat seems like a very reasonable optimization here.

But more generally, I am a little surprised by how slow pandas.Index.get_indexer and pandas.Index.is_unique are. This suggests we should add a fast-path optimization to skip these steps in reindex_variables:

xarray/xarray/core/alignment.py

Lines 302 to 306 in ab4ffee

    
           if not index.is_unique: 
        
               raise ValueError( 
        
                   'cannot reindex or align along dimension %r because the ' 
        
                   'index has duplicate values' % name) 
        
           indexer = get_indexer(index, target, method, tolerance)

Basically, if index.equals(target), we should just set indexer = np.arange(target.size). Although, if we have duplicate values in the index, the operation should arguably fail for correctness.

rabernat · 2018-03-02T21:57:26Z

An update on this long-standing issue.

I have learned that open_mfdataset can be blazingly fast if decode_cf=False but extremely slow with decode_cf=True.

As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example.

base_dir = '/glade/scratch/rpa/'
prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001'
code = 'pop.h.nday1.SST'
glob_pattern = os.path.join(base_dir, prefix, '%s.%s.*.nc' % (prefix, code))

def non_time_coords(ds):
    return [v for v in ds.data_vars
            if 'time' not in ds[v].dims]

def drop_non_essential_vars_pop(ds):
    return ds.drop(non_time_coords(ds))   

# this runs almost instantly
ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1},
                       preprocess=drop_non_essential_vars_pop, decode_cf=False)

And returns this

<xarray.Dataset>
Dimensions:     (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62)
Coordinates:
  * z_w_top     (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ...
  * z_t         (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ...
  * z_w         (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ...
  * z_t_150m    (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ...
  * z_w_bot     (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ...
  * time        (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ...
Dimensions without coordinates: d2, nlat, nlon
Data variables:
    time_bound  (time, d2) float64 dask.array<shape=(16401, 2), chunksize=(1, 2)>
    SST         (time, nlat, nlon) float32 dask.array<shape=(16401, 2400, 3600), chunksize=(1, 2400, 3600)>
Attributes:
    nsteps_total:  480
    tavg_sum:      64800.0
    title:         BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001
    start_time:    This dataset was created on 2016-03-14 at 05:32:30.3
    Conventions:   CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren...
    source:        CCSM POP2, the CCSM Ocean Component
    cell_methods:  cell_methods = time: mean ==> the variable values are aver...
    calendar:      All years have exactly  365 days.
    history:       none
    contents:      Diagnostic and Prognostic Variables
    revision:      $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $

This is roughly 45 years of daily data, one file per year.

Instead, if I just change decode_cf=True (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this:

There are more of these open_dataset tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial.

This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372.

cc Pangeo folks: @jhamman, @mrocklin

shoyer · 2018-03-02T23:58:26Z

@rabernat How does performance compare if you call xarray.decode_cf() on the opened dataset? The adjustments I recently did to lazy decoding should only help once the data is already loaded into dask.

rabernat · 2018-03-09T17:53:15Z

Calling ds = xr.decode_cf(ds, decode_times=False) on the dataset returns instantly. However, the variable data is wrapped in the adaptors, effectively destroying the chunks

>>> ds.SST.variable._data
LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<_apply_mask, shape=(16401, 2400, 3600), dtype=float32, chunksize=(1, 2400, 3600)>), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None))))

Calling getitem on this array triggers the whole dask array to be computed, which would takes forever and would completely blow out the notebook memory. This is because of #1372, which would be fixed by #1725.

This has actually become a major showstopper for me. I need to work with this dataset in decoded form.

Versions

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 3.12.62-60.64.8-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.1
pandas: 0.22.0
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.3.1
h5netcdf: 0.5.0
h5py: 2.7.1
Nio: None
zarr: 2.2.0a2.dev176
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.1
distributed: 1.21.3
matplotlib: 2.1.2
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 38.4.0
pip: 9.0.1
conda: None
pytest: 3.3.2
IPython: 6.2.1

shoyer · 2018-03-09T20:17:19Z

OK, so it seems that we need a change to disable wrapping dask arrays with LazilyIndexedArray. Dask arrays are already lazy!

shoyer · 2018-11-10T23:38:10Z

Was this fixed by #2047?

chuaxr · 2018-11-15T00:32:42Z

I can confirm that

ds = xr.open_mfdataset(data_fnames,chunks={'lat':20,'time':50,'lon':24,'pfull':11},\
                      decode_cf=False)
ds = xr.decode_cf(ds)

is much faster (seconds vs minutes) than

ds = xr.open_mfdataset(data_fnames,chunks={'lat':20,'time':50,'lon':24,'pfull':11})

. For reference, data_fnames is a list of 5 files, each of which is ~75 GB.

shoyer · 2018-11-15T00:45:53Z

@chuaxr I assume you're testing this with xarray 0.11?

It would be good to do some profiling to figure out what is going wrong here.

chuaxr · 2018-11-15T13:37:16Z

Yes, I'm on 0.11.

Nothing displays on the task stream/ progress bar when using open_mfdataset, although I can monitor progress when, say, computing the mean.

The output from %time using decode_cf = False is

CPU times: user 4.42 s, sys: 392 ms, total: 4.82 s
Wall time: 4.74 s

and for decode_cf = True:

CPU times: user 11.6 s, sys: 1.61 s, total: 13.2 s
Wall time: 3min 28s

Using xr.set_options(file_cache_maxsize=1) doesn't make any noticeable difference.

If I repeat the open_mfdataset for another 5 files (after opening the first 5), I occasionally get this warning:
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)

I only began using the dashboard recently; please let me know if there's something basic I'm missing.

shoyer · 2018-11-16T02:45:05Z

@chuaxr What do you see when you use %prun when opening the dataset? This might point to the bottleneck.

One way to fix this would be to move our call to decode_cf() in open_dataset() to after applying chunking, i.e., to switch up the order of operations on these lines:

xarray/xarray/backends/api.py

Lines 270 to 296 in f547ed0

    
           def maybe_decode_store(store, lock=False): 
        
               ds = conventions.decode_cf( 
        
                   store, mask_and_scale=mask_and_scale, decode_times=decode_times, 
        
                   concat_characters=concat_characters, decode_coords=decode_coords, 
        
                   drop_variables=drop_variables) 
        
               _protect_dataset_variables_inplace(ds, cache) 
        
               if chunks is not None: 
        
                   from dask.base import tokenize 
        
                   # if passed an actual file path, augment the token with 
        
                   # the file modification time 
        
                   if (isinstance(filename_or_obj, basestring) and 
        
                           not is_remote_uri(filename_or_obj)): 
        
                       mtime = os.path.getmtime(filename_or_obj) 
        
                   else: 
        
                       mtime = None 
        
                   token = tokenize(filename_or_obj, mtime, group, decode_cf, 
        
                                    mask_and_scale, decode_times, concat_characters, 
        
                                    decode_coords, engine, chunks, drop_variables) 
        
                   name_prefix = 'open_dataset-%s' % token 
        
                   ds2 = ds.chunk(chunks, name_prefix=name_prefix, token=token) 
        
                   ds2._file_obj = ds._file_obj 
        
               else: 
        
                   ds2 = ds 
        
               return ds2

In practice, is the difference between using xarray's internal lazy array classes for decoding and dask for decoding. I would expect to see small differences in performance between these approaches (especially when actually computing data), but for constructing the computation graph I would expect them to have similar performance. It is puzzling that dask is orders of magnitude faster -- that suggests that something else is going wrong in the normal code path for decode_cf(). It would certainly be good to understand this before trying to apply any fixes.

chuaxr · 2018-11-16T16:20:25Z

Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to decode_cf. Here's the output of prun using two different files of approximately the same size (~75 GB), run from a notebook without using distributed (which doesn't lead to any speedup):

Output of
%prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/atmos_level.1999010100-2000123123.sphum.nc',chunks={'lat':20,'time':50,'lon':12,'pfull':11})


          780980 function calls (780741 primitive calls) in 55.374 seconds
 
    Ordered by: internal time
 
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         7   54.448    7.778   54.448    7.778 {built-in method _operator.getitem}
    764838    0.473    0.000    0.473    0.000 core.py:169(<genexpr>)
         3    0.285    0.095    0.758    0.253 core.py:169(<listcomp>)
         2    0.041    0.020    0.041    0.020 {cftime._cftime.num2date}
         3    0.040    0.013    0.821    0.274 core.py:173(getem)
         1    0.027    0.027   55.374   55.374 <string>:1(<module>)

Output of
%prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/atmos_level.2001010100-2002123123.temp.nc',chunks={'lat':20,'time':50,'lon':12,'pfull':11},
decode_cf=False)

 
          772212 function calls (772026 primitive calls) in 56.000 seconds
 
    Ordered by: internal time
 
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         5   55.213   11.043   55.214   11.043 {built-in method _operator.getitem}
    764838    0.486    0.000    0.486    0.000 core.py:169(<genexpr>)
         3    0.185    0.062    0.671    0.224 core.py:169(<listcomp>)
         3    0.041    0.014    0.735    0.245 core.py:173(getem)
         1    0.027    0.027   56.001   56.001 <string>:1(<module>)

/work isn't a remote archive, so it surprises me that this should happen.

shoyer · 2018-11-16T16:46:55Z

Does it take 10 seconds even to open a single file? The big mystery is what that top line ("_operator.getitem") is but my guess is it's netCDF4-python. h5netcdf might also give different results...

…

On Fri, Nov 16, 2018 at 8:20 AM chuaxr ***@***.***> wrote: Sorry, I think the speedup had to do with accessing a file that had previously been loaded rather than due to decode_cf. Here's the output of prun using two different files of approximately the same size (~75 GB), run from a notebook without using distributed (which doesn't lead to any speedup): Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.1999010100-2000123123.sphum.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11}) 780980 function calls (780741 primitive calls) in 55.374 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 7 54.448 7.778 54.448 7.778 {built-in method _operator.getitem} 764838 0.473 0.000 0.473 0.000 core.py:169(<genexpr>) 3 0.285 0.095 0.758 0.253 core.py:169(<listcomp>) 2 0.041 0.020 0.041 0.020 {cftime._cftime.num2date} 3 0.040 0.013 0.821 0.274 core.py:173(getem) 1 0.027 0.027 55.374 55.374 <string>:1(<module>) Output of %prun ds = xr.open_mfdataset('/work/xrc/AM4_skc/ atmos_level.2001010100-2002123123.temp.nc ',chunks={'lat':20,'time':50,'lon':12,'pfull':11}, decode_cf=False) 772212 function calls (772026 primitive calls) in 56.000 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 5 55.213 11.043 55.214 11.043 {built-in method _operator.getitem} 764838 0.486 0.000 0.486 0.000 core.py:169(<genexpr>) 3 0.185 0.062 0.671 0.224 core.py:169(<listcomp>) 3 0.041 0.014 0.735 0.245 core.py:173(getem) 1 0.027 0.027 56.001 56.001 <string>:1(<module>) /work isn't a remote archive, so it surprises me that this should happen. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1385 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1jmFqfe9_dIgHAMYlVOh7WKhzO8Kks5uvuXKgaJpZM4NJOcQ> .

chuaxr · 2018-11-16T18:10:53Z

h5netcdf fails with the following error (presumably the file is not compatible):

/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
     97         if swmr and swmr_support:
     98             flags |= h5f.ACC_SWMR_READ
---> 99         fid = h5f.open(name, flags, fapl=fapl)
    100     elif mode == 'r+':
    101         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (file signature not found)

Using scipy:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    65/42   80.448    1.238   80.489    1.916 {built-in method numpy.core.multiarray.array}
   764838    0.548    0.000    0.548    0.000 core.py:169(<genexpr>)
        3    0.169    0.056    0.717    0.239 core.py:169(<listcomp>)
        2    0.041    0.021    0.041    0.021 {cftime._cftime.num2date}
        3    0.038    0.013    0.775    0.258 core.py:173(getem)
        1    0.024    0.024   81.313   81.313 <string>:1(<module>)

sbiner · 2019-02-07T18:52:53Z

I have the same problem. open_mfdatasset is 10X slower than nc.MFDataset. I used the following code to get some timing on opening 456 local netcdf files located in a nc_local directory (of total size of 532MB)

clef = 'nc_local/*.nc'
t00 = time.time()
l_fichiers_nc = sorted(glob.glob(clef))
print ('timing glob: {:6.2f}s'.format(time.time()-t00))

# netcdf4
t00 = time.time()
ds1 = nc.MFDataset(l_fichiers_nc)
#dates1 = ouralib.netcdf.calcule_dates(ds1)
print ('timing netcdf4: {:6.2f}s'.format(time.time()-t00))

# xarray
t00 = time.time()
ds2 = xr.open_mfdataset(l_fichiers_nc)
print ('timing xarray: {:6.2f}s'.format(time.time()-t00))

# xarray tune
t00 = time.time()
ds3 = xr.open_mfdataset(l_fichiers_nc, decode_cf=False, concat_dim='time')
ds3 = xr.decode_cf(ds3)
print ('timing xarray tune: {:6.2f}s'.format(time.time()-t00))

The output I get is :

timing glob: 0.00s
timing netcdf4: 3.80s
timing xarray: 44.60s
timing xarray tune: 15.61s

I made tests on a centOS server using python2.7 and 3.6, and on mac OS as well with python3.6. The timing changes but the ratios are similar between netCDF4 and xarray.

Is there any way of making open_mfdataset go faster?

In case it helps, here are output from xr.show_versions and %prun xr.open_mfdataset(l_fichiers_nc). I do not know anything about the output of %prun but I have noticed that the first two lines of the ouput are different wether I'm using python 2.7 or python 3.6. I made those tests on centOS and macOS with anaconda environments.

for python 2.7:

        13996351 function calls (13773659 primitive calls) in 42.133 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2664   16.290    0.006   16.290    0.006 {time.sleep}
      912    6.330    0.007    6.623    0.007 netCDF4_.py:244(_open_netcdf4_group)

for python 3.6:

       9663408 function calls (9499759 primitive calls) in 31.934 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5472   15.140    0.003   15.140    0.003 {method 'acquire' of '_thread.lock' objects}
      912    5.661    0.006    5.718    0.006 netCDF4_.py:244(_open_netcdf4_group)

longer output of %prun with python3.6:

         9663408 function calls (9499759 primitive calls) in 31.934 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5472   15.140    0.003   15.140    0.003 {method 'acquire' of '_thread.lock' objects}
      912    5.661    0.006    5.718    0.006 netCDF4_.py:244(_open_netcdf4_group)
     4104    0.564    0.000    0.757    0.000 {built-in method _operator.getitem}
133152/129960    0.477    0.000    0.660    0.000 indexing.py:496(shape)
1554550/1554153    0.414    0.000    0.711    0.000 {built-in method builtins.isinstance}
      912    0.260    0.000    0.260    0.000 {method 'close' of 'netCDF4._netCDF4.Dataset' objects}
     6384    0.244    0.000    0.953    0.000 netCDF4_.py:361(open_store_variable)
      910    0.241    0.000    0.595    0.001 duck_array_ops.py:141(array_equiv)
    20990    0.235    0.000    0.343    0.000 {pandas._libs.lib.is_scalar}
37483/36567    0.228    0.000    0.230    0.000 {built-in method builtins.iter}
    93986    0.219    0.000    1.607    0.000 variable.py:239(__init__)
    93982    0.194    0.000    0.194    0.000 variable.py:706(attrs)
    33744    0.189    0.000    0.189    0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}
    15511    0.175    0.000    0.638    0.000 core.py:1776(normalize_chunks)
     5930    0.162    0.000    0.350    0.000 missing.py:183(_isna_ndarraylike)
297391/296926    0.159    0.000    0.380    0.000 {built-in method builtins.getattr}
   134230    0.155    0.000    0.269    0.000 abc.py:180(__instancecheck__)
     6384    0.142    0.000    0.199    0.000 netCDF4_.py:34(__init__)
    93986    0.126    0.000    0.671    0.000 variable.py:414(_parse_dimensions)
   156545    0.119    0.000    0.811    0.000 utils.py:450(ndim)
    12768    0.119    0.000    0.203    0.000 core.py:747(blockdims_from_blockshape)
     6384    0.117    0.000    2.526    0.000 conventions.py:245(decode_cf_variable)
741183/696380    0.116    0.000    0.134    0.000 {built-in method builtins.len}
41957/23717    0.110    0.000    4.395    0.000 {built-in method numpy.core.multiarray.array}
    93978    0.110    0.000    0.110    0.000 variable.py:718(encoding)
   219940    0.109    0.000    0.109    0.000 _weakrefset.py:70(__contains__)
    99458    0.100    0.000    0.440    0.000 variable.py:137(as_compatible_data)
    53882    0.085    0.000    0.095    0.000 core.py:891(shape)
   140604    0.084    0.000    0.628    0.000 variable.py:272(shape)
     3192    0.084    0.000    0.170    0.000 utils.py:88(_StartCountStride)
    10494    0.081    0.000    0.081    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    44688    0.077    0.000    0.157    0.000 variables.py:102(unpack_for_decoding)

output of xr.show_versions()

xr.show_versions()                                                                                                    

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

xarray: 0.11.0
pandas: 0.24.1
numpy: 1.15.4
scipy: None
netCDF4: 1.4.2
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: None
cyordereddict: None
dask: 1.1.1
distributed: 1.25.3
matplotlib: 3.0.2
cartopy: None
seaborn: None
setuptools: 40.7.3
pip: 19.0.1
conda: None
pytest: None
IPython: 7.2.0
sphinx: None

TomNicholas · 2019-02-07T19:00:57Z

Looks like you're using xarray v0.11.0, but the most recent one is v0.11.3. There have been several changes since then which might affect this, try that first.

…

On Thu, 7 Feb 2019, 18:53 sbiner, ***@***.***> wrote: I have the same problem. open_mfdatasset is 10X slower than nc.MFDataset. I used the following code to get some timing on opening 456 local netcdf files located in a nc_local directory (of total size of 532MB) clef = 'nc_local/*.nc' t00 = time.time() l_fichiers_nc = sorted(glob.glob(clef)) print ('timing glob: {:6.2f}s'.format(time.time()-t00)) # netcdf4 t00 = time.time() ds1 = nc.MFDataset(l_fichiers_nc) #dates1 = ouralib.netcdf.calcule_dates(ds1) print ('timing netcdf4: {:6.2f}s'.format(time.time()-t00)) # xarray t00 = time.time() ds2 = xr.open_mfdataset(l_fichiers_nc) print ('timing xarray: {:6.2f}s'.format(time.time()-t00)) # xarray tune t00 = time.time() ds3 = xr.open_mfdataset(l_fichiers_nc, decode_cf=False, concat_dim='time') ds3 = xr.decode_cf(ds3) print ('timing xarray tune: {:6.2f}s'.format(time.time()-t00)) The output I get is : timing glob: 0.00s timing netcdf4: 3.80s timing xarray: 44.60s timing xarray tune: 15.61s I made tests on a centOS server using python2.7 and 3.6, and on mac OS as well with python3.6. The timing changes but the ratios are similar between netCDF4 and xarray. Is there any way of making open_mfdataset go faster? In case it helps, here are output from xr.show_versions and %prun xr.open_mfdataset(l_fichiers_nc). I do not know anything about the output of %prun but I have noticed that the first two lines of the ouput are different wether I'm using python 2.7 or python 3.6. I made those tests on centOS and macOS with anaconda environments. for python 2.7: 13996351 function calls (13773659 primitive calls) in 42.133 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 2664 16.290 0.006 16.290 0.006 {time.sleep} 912 6.330 0.007 6.623 0.007 netCDF4_.py:244(_open_netcdf4_group) for python 3.6: 9663408 function calls (9499759 primitive calls) in 31.934 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 5472 15.140 0.003 15.140 0.003 {method 'acquire' of '_thread.lock' objects} 912 5.661 0.006 5.718 0.006 netCDF4_.py:244(_open_netcdf4_group) longer output of %prun with python3.6: 9663408 function calls (9499759 primitive calls) in 31.934 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 5472 15.140 0.003 15.140 0.003 {method 'acquire' of '_thread.lock' objects} 912 5.661 0.006 5.718 0.006 netCDF4_.py:244(_open_netcdf4_group) 4104 0.564 0.000 0.757 0.000 {built-in method _operator.getitem} 133152/129960 0.477 0.000 0.660 0.000 indexing.py:496(shape) 1554550/1554153 0.414 0.000 0.711 0.000 {built-in method builtins.isinstance} 912 0.260 0.000 0.260 0.000 {method 'close' of 'netCDF4._netCDF4.Dataset' objects} 6384 0.244 0.000 0.953 0.000 netCDF4_.py:361(open_store_variable) 910 0.241 0.000 0.595 0.001 duck_array_ops.py:141(array_equiv) 20990 0.235 0.000 0.343 0.000 {pandas._libs.lib.is_scalar} 37483/36567 0.228 0.000 0.230 0.000 {built-in method builtins.iter} 93986 0.219 0.000 1.607 0.000 variable.py:239(__init__) 93982 0.194 0.000 0.194 0.000 variable.py:706(attrs) 33744 0.189 0.000 0.189 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects} 15511 0.175 0.000 0.638 0.000 core.py:1776(normalize_chunks) 5930 0.162 0.000 0.350 0.000 missing.py:183(_isna_ndarraylike) 297391/296926 0.159 0.000 0.380 0.000 {built-in method builtins.getattr} 134230 0.155 0.000 0.269 0.000 abc.py:180(__instancecheck__) 6384 0.142 0.000 0.199 0.000 netCDF4_.py:34(__init__) 93986 0.126 0.000 0.671 0.000 variable.py:414(_parse_dimensions) 156545 0.119 0.000 0.811 0.000 utils.py:450(ndim) 12768 0.119 0.000 0.203 0.000 core.py:747(blockdims_from_blockshape) 6384 0.117 0.000 2.526 0.000 conventions.py:245(decode_cf_variable) 741183/696380 0.116 0.000 0.134 0.000 {built-in method builtins.len} 41957/23717 0.110 0.000 4.395 0.000 {built-in method numpy.core.multiarray.array} 93978 0.110 0.000 0.110 0.000 variable.py:718(encoding) 219940 0.109 0.000 0.109 0.000 _weakrefset.py:70(__contains__) 99458 0.100 0.000 0.440 0.000 variable.py:137(as_compatible_data) 53882 0.085 0.000 0.095 0.000 core.py:891(shape) 140604 0.084 0.000 0.628 0.000 variable.py:272(shape) 3192 0.084 0.000 0.170 0.000 utils.py:88(_StartCountStride) 10494 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects} 44688 0.077 0.000 0.157 0.000 variables.py:102(unpack_for_decoding) output of xr.show_versions() xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8 xarray: 0.11.0 pandas: 0.24.1 numpy: 1.15.4 scipy: None netCDF4: 1.4.2 h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 1.1.1 distributed: 1.25.3 matplotlib: 3.0.2 cartopy: None seaborn: None setuptools: 40.7.3 pip: 19.0.1 conda: None pytest: None IPython: 7.2.0 sphinx: None — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1385 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AiTXo7EQRagOh5iu3cYenLDTFDk68MGIks5vLHYMgaJpZM4NJOcQ> .

sbiner · 2019-02-07T19:22:58Z

I just tried and it did not help ...

In [5]: run test_ouverture_fichier_nc_vs_xr.py
timing glob:   0.00s
timing netcdf4:   3.36s
timing xarray:  44.82s
timing xarray tune:  14.47s

In [6]: xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 19:04:19) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.1

xarray: 0.11.3
pandas: 0.24.0
numpy: 1.13.3
scipy: 1.2.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: 1.0.0
distributed: 1.25.2
matplotlib: 2.2.3
cartopy: None
seaborn: None
setuptools: 40.5.0
pip: 19.0.1
conda: None
pytest: None
IPython: 5.8.0
sphinx: 1.8.2

jameshalgren · 2021-02-18T15:06:13Z

setting parallel=True seg faults... I'm betting that is some quirk of my python environment, though.

This is important! Otherwise that timing scales with number of files.
If you get that to work, then you can convert to a dask dataframe and keep things lazy.

Indeed @dcherian -- it took some experimentation to get the right engine to support parallel execution and even then, results are still mixed, which, to me, means further work is needed to isolate the issue.

Along the lines of suggestions here (thanks @jmccreight for pointing this out), we've introduced a very practical pre-processing step to rewrite the datasets so that the read is not striped across the file system, effectively isolating the performance bottleneck to a position where it can be dealt with independently. Of course, such an asynchronous workflow is not possible in all situations, so we're still looking at improving the direct performance.

Two notes as we keep working:

The preprocessor. Reading and re-manipulating an individual dataset is lightning fast. We saw that a small change or adjustment in the individual files, made with a preprocessor, made the multi-file read massively faster.
The "more sophisticated example" referenced here has proven to be very useful.

jtomfarrar · 2022-02-17T13:06:18Z

@rabernat wrote:

An update on this long-standing issue.

I have learned that open_mfdataset can be blazingly fast if decode_cf=False but extremely slow with decode_cf=True.

I seem to be experiencing a similar (same?) issue with open_dataset:
https://stackoverflow.com/questions/71147712/can-i-force-xarray-open-dataset-to-do-a-lazy-load?stw=2

rabernat · 2022-02-17T13:14:50Z

Hi Tom! 👋

So much has evolved about xarray since this original issue was posted. However, we continue to use it as a catchall for people looking to speed up open_mfdataset. I saw your stackoverflow post. Any chance you could post a link to the actual file in question?

jtomfarrar · 2022-02-17T13:43:21Z

Thanks, Ryan! Sure-- here's a link to the file:
https://drive.google.com/file/d/1-05bG2kF8wbvldYtDpZ3LYLyqXnvZyw1/view?usp=sharing

(I could post to a web server if there's any reason to prefer that.)

rabernat · 2022-02-17T14:21:45Z

(I could post to a web server if there's any reason to prefer that.)

In general that would be a little more convenient than google drive, because then we could download the file from python (rather than having a manual step). This would allow us to share a fully copy-pasteable code snippet to reproduce the issue. But don't worry about that for now.

First, I'd note that your issue is not really related to open_mfdataset at all, since it is reproduced just using open_dataset. The core problem is that you have ~15M timesteps, and it is taking forever to decode the times out of them. It's fast when you do decode_times=False because the data aren't actually being read. I'm going to make a post over in discussions to dig a bit deeper into this. StackOverflow isn't monitored too regularly by this community.

jtomfarrar · 2022-02-17T14:30:03Z

Thank you, Ryan. I will post the file to a server with a stable URL and replace the google drive link in the other post. My original issue was that I wanted to not read the data (yet), only to have a look at the metadata.

rabernat · 2022-02-17T14:36:23Z

Ah ok so if that is your goal, decode_times=False should be enough to solve it.

There is a problem with the time encoding in this file. The units (days since 1950-01-01T00:00:00Z) are not compatible with the values (738457.04166667, etc.). That would place your measurements sometime in the year 3971. This is part of the problem, but not the whole story.

jtomfarrar · 2022-02-17T14:42:41Z

Thank you. A member of my research group made the netcdf file, so we will make a second file with the time encoding fixed.

rabernat · 2022-02-17T14:57:03Z

See deeper dive in #6284

fraserwg · 2023-08-25T11:57:19Z

I've recently been trying to run open_mfdataset on a large list of large files. When using more than ~100 files the function became so slow that I gave up trying to run it. I then came upon this thread and discovered that if I passed the argument decode_cf=False the function would run in a matter of seconds. Applying decode_cf to the returned dataset after opening then ran in seconds and I ended up with the same dataset following this two step process as I did before. Would it be possible to change one of:

where decode_cf is called in open_mfdataset — essentially, open the individual datasets with decode_cf=False and then apply decode_cf to the merged dataset before it is returned;
change decode_cf=False to be the default in open_mfdataset?

To me the first solution feels better and I can make a pull request to do this.

From reading this thread I'm under the impression that there's probably something else going on under the hood that's causing the slowness of open_mfdataset at present. Obviously it would be best to address this; however, given that the problem was first raised in 2017 and a solution to the underlying problem doesn't seem to be forthcoming I'd be very pleased to see a "fix" that addresses the symptoms (the slowness) rather than the illness (whatever's going wrong behind the scenes).

shoyer · 2023-08-28T07:57:55Z

@fraserwg do you know what the performance bottleneck is in your case? (i.e., if you interrupt the comptuation, what is being computed?)

dcherian · 2023-08-31T03:44:14Z

It is very common for different netCDF files in a "dataset" (a folder) to be encoded differently so we can't set decode_cf=False by default.

there's probably something else going on under the hood that's causing the slowness of open_mfdataset at present.

There's

slowness in reading coordinate information from every file. We have parallel to help a bit here.
slowness in combining each file to form a single Xarray dataset. By default, we do lots of consistency checking by reading data. Xarray allows you to control this, data_vars='minimal', coords='minimal', compat='override' is a common choice.

What your describing sounds like a failure of lazy decoding or acftime slowdown (example)which should be fixed. If you can provide a reproducible example, that would help.

fraserwg · 2023-09-07T10:14:51Z

@shoyer I will double check what the bottle neck is and report back. @dcherian interestingly, the parallel option doesn't seem to do anything when decode_cf=True. From looking at the dask dashboard it seems to load each file sequentially, with each opening being carried out by a different worker but not concurrently. I will see what I can do minimal example wise!

ashjbarnes · 2024-02-21T04:29:45Z

I'm another person who stumbled across this thread, and found that decode_cf = False fix to work really well.

I appreciate that we can't set this by default, but maybe this could be put into the docstring of open_mfdataset directly? It appears to be passed as a kwarg, so is hard to find despite it being such a helpful fix for so many people in this thread!

It also seems like the decode_cf step is done in serial. My dask cluster has plenty of workers but when decode_cf is set to True it only processes one of my (many) files at a time. Switching to decode_cf = False and the task stream shows my entire cluster being utilised. Perhaps this is part of the issue?

xarray version: 2024.2.0

dcherian · 2024-02-22T04:50:47Z

@ashjbarnes Are you able to share two small files that illustrate the issue?

ashjbarnes · 2024-02-22T05:17:33Z

@dcherian Thanks, the files are shared in this google drive folder.

This is a spatial subsample of the same files. Perhaps it's a bad idea to store static data like the lat/lon coordinates in each file? The overall size of this is so tiny compared to the 4D data that I left the coordinates there for convenience but I'm not sure whether this has broader implications.

In my tests, running:

xr.open_mfdataset(PATH,decode_times = False,parallel = True,decode_cf = False)

on ~3000 files of 300mb each had an order of magnitude speedup over the same command with decode_cf = True on xarray 2024.2.0. The real files are chunked on disk in time and the yb dimension

dcherian · 2024-02-22T06:01:28Z

Wow, thank you!

This is an amazing bug. The defaults say data_vars="all", coords="different" which means always concatenate all data_vars along the concat dimensions (here inferred to be "time") but only concatenate coords if they differ in the different files.

When decode_cf=False , lat ,lon are data_vars and get concatenated without any checking or reading.
When decode_cf=True, lat, lon are promoted to coords, then get checked for equality across all files. The two variables get read sequentially from all files. This is the slowdown you see.

Once again, this is a consequence of bad defaults for concat and open_mfdataset.

I would follow https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use data_vars="minimal", coords="minimal", compat="override" which will only concatenate those variables with the time dimension, and skip any checking for variables that don't have a time dimension (simply pick the variable from the first file).

ashjbarnes · 2024-02-22T22:43:32Z

Well done @dcherian great find! Changing the defaults does seem like a really good idea in this case

hannahzafar · 2024-03-13T23:15:10Z

@dcherian I am yet another person stumbling on this problem. Unfortunately decode_cf = False seems to override decode_times=True (https://stackoverflow.com/questions/77243075/xarray-wont-decode-times-from-netcdf-file-with-decode-cf-false-even-if-decode) so you cannot use that fix if you want to maintain datetime objects.

dcherian · 2024-03-13T23:19:09Z

A simpler fix is probably decode_coords=False

rabernat · 2024-03-14T01:17:51Z

This issue is almost seven years old! It has been "fixed" many times since my original post, but people keep finding new ways to make it reappear. 😆

It seems like having better diagnostics / logging of what is happening under the hood with open_mfdataset is what people really need. Maybe even some sort of utility to pre-scan the files and figure out if they are easily openable or not.

jtomfarrar · 2024-03-14T01:31:21Z

This issue is almost seven years old! It has been "fixed" many times since my original post, but people keep finding new ways to make it reappear. 😆

It seems like having better diagnostics / logging of what is happening under the hood with open_mfdataset is what people really need. Maybe even some sort of utility to pre-scan the files and figure out if they are easily openable or not.

Both of those seem like great ideas. Maybe there could be a verbose or logging mode to help users identify what is wrong with the files (e.g., where the time is being spent and whether that seems suspicious). It is probably true that people (like me) will keep finding new ways to generate problematic netcdf files. (I'm sure we can think of something even worse than 20 Hz data referenced to a time origin 75 years ago).

shoyer added contrib-help-wanted topic-performance labels Apr 26, 2017

rabernat mentioned this issue May 2, 2017

selecting a point from an mfdataset #1396

Closed

rabernat mentioned this issue May 17, 2017

concat prealigned objects #1413

Closed

4 tasks

rabernat mentioned this issue Jun 13, 2017

PERF: Add benchmarking? #1257

Closed

shoyer mentioned this issue Aug 24, 2017

open_mfdataset reads coords from disk multiple times #1521

Closed

rabernat mentioned this issue Sep 6, 2017

Load nonindex coords ahead of concat() #1551

Merged

4 tasks

rabernat mentioned this issue Oct 13, 2017

Use Case Notebook for "Atmospheric Moisture Budgets" pangeo-data/pangeo#1

Closed

5 tasks

rabernat mentioned this issue Oct 23, 2017

Low memory/out-of-core index? #1650

Open

rabernat mentioned this issue Jan 12, 2018

We need a fast path for open_mfdataset #1823

Closed

jhamman mentioned this issue Mar 11, 2018

use dask to open datasets in parallel #1981

Closed

rabernat mentioned this issue Apr 5, 2018

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Open

TomNicholas mentioned this issue Nov 2, 2018

Concatenate across multiple dimensions with open_mfdataset #2159

Closed

dcherian added the topic-backends label Jan 27, 2021

davbyr mentioned this issue Feb 17, 2021

Slow performance of open_mfdataset for actual data British-Oceanographic-Data-Centre/COAsT#252

Open

tomvothecoder mentioned this issue Nov 1, 2022

Support for N axis dimensions mapped to N coordinates xCDAT/xcdat#343

Merged

9 tasks

tomvothecoder mentioned this issue Feb 22, 2023

Wrap open_dataset and open_mfdataset to flexibly open datasets xCDAT/xcdat#385

Merged

9 tasks

sharkinsspatial mentioned this issue May 22, 2023

sketch profiling developmentseed/titiler-xarray#4

Closed

dcherian mentioned this issue Feb 22, 2024

Stricter defaults for concat, combine, open_mfdataset #8778

Open

This was referenced Apr 11, 2024

[Enhancement]: Update coords="minimal" and compat="minimal" as defaults to improve performance of xc.open_mfdataset()? xCDAT/xcdat#641

Open

CDAT Migration: Refactor annual_cycle_zonal_mean set E3SM-Project/e3sm_diags#798

Merged

slow performance with open_mfdataset #1385

slow performance with open_mfdataset #1385

Comments

rabernat commented Apr 26, 2017

rabernat commented Apr 26, 2017

shoyer commented Apr 26, 2017

rabernat commented Mar 2, 2018

shoyer commented Mar 2, 2018

rabernat commented Mar 9, 2018

shoyer commented Mar 9, 2018

shoyer commented Nov 10, 2018

chuaxr commented Nov 15, 2018

shoyer commented Nov 15, 2018

chuaxr commented Nov 15, 2018 • edited Loading

shoyer commented Nov 16, 2018

chuaxr commented Nov 16, 2018

shoyer commented Nov 16, 2018 via email

chuaxr commented Nov 16, 2018

sbiner commented Feb 7, 2019

TomNicholas commented Feb 7, 2019 via email

sbiner commented Feb 7, 2019

jameshalgren commented Feb 18, 2021

jtomfarrar commented Feb 17, 2022

rabernat commented Feb 17, 2022

jtomfarrar commented Feb 17, 2022

rabernat commented Feb 17, 2022 • edited Loading

jtomfarrar commented Feb 17, 2022

rabernat commented Feb 17, 2022

jtomfarrar commented Feb 17, 2022

rabernat commented Feb 17, 2022

fraserwg commented Aug 25, 2023

shoyer commented Aug 28, 2023

dcherian commented Aug 31, 2023

fraserwg commented Sep 7, 2023

ashjbarnes commented Feb 21, 2024

dcherian commented Feb 22, 2024

ashjbarnes commented Feb 22, 2024

dcherian commented Feb 22, 2024 • edited Loading

ashjbarnes commented Feb 22, 2024

hannahzafar commented Mar 13, 2024

dcherian commented Mar 13, 2024

rabernat commented Mar 14, 2024

jtomfarrar commented Mar 14, 2024

chuaxr commented Nov 15, 2018 •

edited

Loading

rabernat commented Feb 17, 2022 •

edited

Loading

dcherian commented Feb 22, 2024 •

edited

Loading