Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up opening multiple files with changing data variables #1845

Closed
jbusecke opened this issue Jan 19, 2018 · 1 comment
Closed

speed up opening multiple files with changing data variables #1845

jbusecke opened this issue Jan 19, 2018 · 1 comment

Comments

@jbusecke
Copy link
Contributor

jbusecke commented Jan 19, 2018

Code Sample, a copy-pastable example if possible

I am trying to open several ocean model data files. During the model run additional variables were written to the files. So for instance the first file will look like this:

<xarray.Dataset>
Dimensions:         (st_edges_ocean: 51, st_ocean: 50, time: 1, xt_ocean: 3600, yt_ocean: 2700)
Coordinates:
  * xt_ocean        (xt_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ...
  * yt_ocean        (yt_ocean) float64 -81.11 -81.07 -81.02 -80.98 -80.94 ...
  * time            (time) float64 4.401e+04
  * st_ocean        (st_ocean) float64 5.034 15.1 25.22 35.36 45.58 55.85 ...
  * st_edges_ocean  (st_edges_ocean) float64 0.0 10.07 20.16 30.29 40.47 ...
Data variables:
    jp_recycle      (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    jp_reminp       (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    jp_uptake       (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    jo2             (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    dic_stf         (time, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 2700, 3600), chunksize=(1, 2700, 3600)>
    o2_stf          (time, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 2700, 3600), chunksize=(1, 2700, 3600)>
Attributes:
    filename:   01210101.ocean_minibling_term_src.nc
    title:      CM2.6_miniBling
    grid_type:  mosaic
    grid_tile:  1

and the last file will look like this (with additional data variables o2_btf, dic_btf, and 'po4_btf`).


<xarray.Dataset>
Dimensions:         (st_edges_ocean: 51, st_ocean: 50, time: 1, xt_ocean: 3600, yt_ocean: 2700)
Coordinates:
  * xt_ocean        (xt_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ...
  * yt_ocean        (yt_ocean) float64 -81.11 -81.07 -81.02 -80.98 -80.94 ...
  * st_ocean        (st_ocean) float64 5.034 15.1 25.22 35.36 45.58 55.85 ...
  * st_edges_ocean  (st_edges_ocean) float64 0.0 10.07 20.16 30.29 40.47 ...
  * time            (time) float64 7.25e+04
Data variables:
    jp_recycle      (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    jp_reminp       (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    jp_uptake       (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    jo2             (time, st_ocean, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    dic_stf         (time, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 2700, 3600), chunksize=(1, 2700, 3600)>
    dic_btf         (time, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 2700, 3600), chunksize=(1, 2700, 3600)>
    o2_stf          (time, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 2700, 3600), chunksize=(1, 2700, 3600)>
    o2_btf          (time, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 2700, 3600), chunksize=(1, 2700, 3600)>
    po4_btf         (time, yt_ocean, xt_ocean) float64 dask.array<shape=(1, 2700, 3600), chunksize=(1, 2700, 3600)>
Attributes:
    date:       created 2014-01-08
    program:    time_average_netcdf.rb
    history:    Perform time-means on all variables in 01990101.ocean_minibli...
    filename:   01990101.ocean_minibling_term_src.nc
    title:      CM2.6_miniBling
    grid_type:  mosaic
    grid_tile:  1

If I specify the additional variables to be dropped, reading all files with xarray.open_mfdataset works like a charm.
But without specifying the variables to be dropped it takes an excruciating amount of time to load.

First of all, I was wondering if there would be the possibility to display a warning if this situation occurs, suggesting to add these variables as drop_variables keyword. That would have saved me a ton of digging time.

Even better would be some way to read such datasets in a fast manner. If we could specify a fastpath option (like suggested in #1823), perhaps this could speed this task up (given that all dimensions stay the same)?

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-642.15.1.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US LOCALE: en_US.ISO8859-1

xarray: 0.10.0rc2-2-g1a01208
pandas: 0.20.3
numpy: 1.13.3
scipy: 0.19.1
netCDF4: 1.3.0
h5netcdf: 0.4.2
Nio: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.16.0
matplotlib: 2.1.0
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 36.3.0
pip: 9.0.1
conda: None
pytest: 3.2.3
IPython: 6.2.1
sphinx: None

@jbusecke
Copy link
Contributor Author

Wondering if this is still an issue. I dont have the data to check it but in my experience these kind of operations have been much better in recent versions. Ill close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants