Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5/Tensorflow compatibility issues with NetCDF4 file saving #242

Closed
angelolab opened this issue Oct 21, 2019 · 5 comments
Closed

HDF5/Tensorflow compatibility issues with NetCDF4 file saving #242

angelolab opened this issue Oct 21, 2019 · 5 comments

Comments

@angelolab
Copy link

angelolab commented Oct 21, 2019

The current set of requirements in the docker image is not compatible with the newest scipy backend for saving data in xarray. In particular, if I add xarray==0.12.1 and netcdf4==1.4.2 as requirements, rebuild the image, and run the code below, the legacy format (NETCDF3_64BIT) works, but the newest version (NETCDF4) does not.

import numpy as np

data = np.zeros((1024, 1024))

data_xr = xr.DataArray(data, coords=[range(1024), range(1024)], dims=["rows", "cols"])
data_xr.to_netcdf('example_CDF4.nc', format="NETCDF4")
data_xr.to_netcdf('example_CDF3.nc', format="NETCDF3_64BIT")

This is an issue because NetCDF3_64BIT only supports saving files that are 4GB or less. The segmentation channels for a typical MIBI cohort takes up more space than this.

A current workaround is just to save the cohort into multiple distinct data files and run each of them separately through deepcell.

The error message I get is below:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/xarray/backends/file_manager.py in acquire(self, needs_lock)
    166             try:
--> 167                 file = self._cache[self._key]
    168             except KeyError:

/usr/local/lib/python3.5/dist-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     41         with self._lock:
---> 42             value = self._cache[key]
     43             self._cache.move_to_end(key)

KeyError: [<function _open_netcdf4_group at 0x7fda1b93ae18>, ('/data/example_CDF4.nc', CombinedLock([<unlocked _thread.lock object at 0x7fda1b91c8c8>, <unlocked _thread.lock object at 0x7fda1b91c8f0>])), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('group', None), ('persist', False))]

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-23-9e2eb803f1ea> in <module>()
----> 1 xr_firstload = xr.open_dataarray(os.path.join('/data/' + "example_CDF4.nc"))

/usr/local/lib/python3.5/dist-packages/xarray/backends/api.py in open_dataarray(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime)
    507                            drop_variables=drop_variables,
    508                            backend_kwargs=backend_kwargs,
--> 509                            use_cftime=use_cftime)
    510 
    511     if len(dataset.data_vars) != 1:

/usr/local/lib/python3.5/dist-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime)
    361         if engine == 'netcdf4':
    362             store = backends.NetCDF4DataStore.open(
--> 363                 filename_or_obj, group=group, lock=lock, **backend_kwargs)
    364         elif engine == 'scipy':
    365             store = backends.ScipyDataStore(filename_or_obj, **backend_kwargs)

/usr/local/lib/python3.5/dist-packages/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    350             kwargs=dict(group=group, clobber=clobber, diskless=diskless,
    351                         persist=persist, format=format))
--> 352         return cls(manager, lock=lock, autoclose=autoclose)
    353 
    354     @property

/usr/local/lib/python3.5/dist-packages/xarray/backends/netCDF4_.py in __init__(self, manager, lock, autoclose)
    309 
    310         self._manager = manager
--> 311         self.format = self.ds.data_model
    312         self._filename = self.ds.filepath()
    313         self.is_remote = is_remote_uri(self._filename)

/usr/local/lib/python3.5/dist-packages/xarray/backends/netCDF4_.py in ds(self)
    354     @property
    355     def ds(self):
--> 356         return self._manager.acquire().value
    357 
    358     def open_store_variable(self, name, var):

/usr/local/lib/python3.5/dist-packages/xarray/backends/file_manager.py in acquire(self, needs_lock)
    171                     kwargs = kwargs.copy()
    172                     kwargs['mode'] = self._mode
--> 173                 file = self._opener(*self._args, **kwargs)
    174                 if self._mode == 'w':
    175                     # ensure file doesn't get overriden when opened again

/usr/local/lib/python3.5/dist-packages/xarray/backends/netCDF4_.py in _open_netcdf4_group(filename, lock, mode, group, **kwargs)
    242     import netCDF4 as nc4
    243 
--> 244     ds = nc4.Dataset(filename, mode=mode, **kwargs)
    245 
    246     with close_on_error(ds):

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -101] NetCDF: HDF error: b'/data/example_CDF4.nc
@ngreenwald
Copy link
Collaborator

Hey @willgraf, update on this. I was going to submit an issue to the netCDF4 package maintainers, but I can't recreate the issue outside of Docker.

If I set up a virtual_env, pip install the requirements.txt file from the current docker build, and run the code, it doesn't generate the above error. However, attempting this within a jupyter notebook from docker does prompt the error.

Is there something docker specific that could be causing this, or a hidden dependency that isn't captured in the requirements.txt file that's leading to the issue?

Here are the steps to reproduce the error-free version outside of docker:

# set up venv
$ python3 -m venv test_env
$ source test_env/bin/activate
$ pip install -r requirements.txt

# generate example xarray file
$ python3
$ import numpy as np
$ import xarray as xr
$ temp_data = np.zeros((50, 50))
$ temp_xr = xr.DataArray(temp_data, coords=[range(50), range(50)], dims=["rows", "cols"])

# both save formats work inside the venv
$ temp_xr.to_netcdf("test_output_NETCDF3_64BIT.nc", format="NETCDF3_64BIT")
$ temp_xr.to_netcdf("test_output_NETCDF4.nc", format="NETCDF4")

I'm using the following requirements.txt file:
pandas>=0.23.3,<1
numpy>=1.16.4,<2
scipy>=1.1.0,<2
scikit-image>=0.14.1,<1
scikit-learn>=0.19.1,<1
tensorflow-gpu==1.14
jupyter>=1.0.0,<2
nbformat>=4.4.0,<5
keras-applications==1.0.8
networkx>=2.1
opencv-python>=3.4.2.17,<4
cython>=0.28
pathlib==1.0.1
xarray
netcdf4
h5py
deepcell-tracking>=0.2.3

@ngreenwald
Copy link
Collaborator

Hi Will, I think I figured out the issue.
The issue only comes up if I load deepcell before calling any of the xarray functions.

However, if I load xarray first, I don't get the error, even If i then load deepcell afterwards.

I think the default library that is getting used as the backend is dependent on which package gets initialized first.

I'm going to close this for now, since the workaround of loading in a different order is fine for now.

@ngreenwald
Copy link
Collaborator

So now that I'm going to be adding a multiplexed applications model this issue has come up again.

Here's the summary from what I discovered so far:
The xarray package uses netCDF4 to save large files. This doesn't play well with Tensorflow.
For example:

import tensorflow
import numpy as np
import xarray as xr

# create a 10x10 xarray with labeled dimensions
first_xr = xr.DataArray(np.zeros((10, 10)), 
                        coords=[range(10), range(10)], 
                        dims=['rows', 'cols'])

# save it to disk
first_xr.to_netcdf('test_save.xr')

The above save command fails with an HDF error.

The workaround I discovered was that if the save command comes before tensorflow is imported, it works, and so do subsequent save/load commands:

import numpy as np
import xarray as xr

# create a 10x10 xarray with labeled dimensions
first_xr = xr.DataArray(np.zeros((10, 10)), 
                        coords=[range(10), range(10)], 
                        dims=['rows', 'cols'])

# save it to disk
first_xr.to_netcdf('test_save.xr')

import tensorflow

# create a 10x10 xarray with labeled dimensions
second_xr = xr.DataArray(np.zeros((11, 11)), 
                        coords=[range(11), range(11)], 
                        dims=['rows', 'cols'])

second_xr.to_netcdf('test_save_2.xr')

However, the really strange thing is that both versions work outside docker! Specifically, if i make a virtualenv and pip install all the same requirements, I don't have any issues executing the first example.

This is with
xarray==0.12.1
netCDF4==1.5.3

added to the requirements.txt file

@ngreenwald ngreenwald reopened this Apr 23, 2020
@willgraf
Copy link
Contributor

willgraf commented Apr 23, 2020

Seems like others have run into this issue as well. It could be a scipy engine vs netcdf engine issue?

This other issue makes it sound like h5py and netcdf do not play well together.


which format are you using? It sounds like NETCDF3_64BIT supports 2+GB files and also has support for scipy backend.

@ngreenwald
Copy link
Collaborator

So I had been using NETCDF3_64BIT before, but it would fail on files that were like 3 GB. It now appears to be working up through files that 10 GB. Not sure why that's the case, but that's more than enough for our typical use case.

I'll close this (again) for now. Thanks for the help! If it crops back up again, I may have to investigate not installing with pip, which appears to be what triggers these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants