Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable loading remote hdf5 files #2782

Merged
merged 12 commits into from
Mar 16, 2019
43 changes: 24 additions & 19 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,8 +170,8 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
Strings and Path objects are interpreted as a path to a netCDF file
or an OpenDAP URL and opened with python-netCDF4, unless the filename
ends with .gz, in which case the file is gunzipped and opened with
scipy.io.netcdf (only netCDF3 supported). File-like objects are opened
with scipy.io.netcdf (only netCDF3 supported).
scipy.io.netcdf (only netCDF3 supported). Byte-strings or file-like
objects are opened by scipy.io.netcdf (netCDF3) or h5py (netCDF4/HDF).
group : str, optional
Path to the netCDF4 group in the given file to open (only works for
netCDF4 files).
Expand Down Expand Up @@ -310,17 +310,9 @@ def maybe_decode_store(store, lock=False):
if isinstance(filename_or_obj, backends.AbstractDataStore):
store = filename_or_obj
ds = maybe_decode_store(store)
elif isinstance(filename_or_obj, str):

if (isinstance(filename_or_obj, bytes) and
filename_or_obj.startswith(b'\x89HDF')):
raise ValueError('cannot read netCDF4/HDF5 file images')
elif (isinstance(filename_or_obj, bytes) and
filename_or_obj.startswith(b'CDF')):
# netCDF3 file images are handled by scipy
pass
elif isinstance(filename_or_obj, str):
filename_or_obj = _normalize_path(filename_or_obj)
elif isinstance(filename_or_obj, str):
filename_or_obj = _normalize_path(filename_or_obj)

if engine is None:
engine = _get_default_engine(filename_or_obj,
Expand Down Expand Up @@ -352,11 +344,24 @@ def maybe_decode_store(store, lock=False):
with close_on_error(store):
ds = maybe_decode_store(store)
else:
if engine is not None and engine != 'scipy':
raise ValueError('can only read file-like objects with '
"default engine or engine='scipy'")
# assume filename_or_obj is a file-like object
store = backends.ScipyDataStore(filename_or_obj)
if engine not in [None, 'scipy', 'h5netcdf']:
raise ValueError('can only read bytes or file-like objects with '
"engine = None, 'scipy', or 'h5netcdf'")
else:
if isinstance(filename_or_obj, bytes):
filename_or_obj = BytesIO(filename_or_obj)
# read first bytes of file-like object to determine engine
jhamman marked this conversation as resolved.
Show resolved Hide resolved
magic_number = filename_or_obj.read(8)
filename_or_obj.seek(0)
if magic_number.startswith(b'CDF'):
store = backends.ScipyDataStore(filename_or_obj,
**backend_kwargs)
elif magic_number.startswith(b'\211HDF\r\n\032\n'):
store = backends.H5NetCDFStore(filename_or_obj, group=group,
lock=lock, **backend_kwargs)
else:
raise ValueError("byte header doesn't match netCDF3 or "
"netCDF4/HDF5: {}".format(magic_number))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is one of those rare cases where it's best not to report all the details -- most users probably don't know about magic numbers. Maybe something like:

  • "file-like object is not a netCDF file: {}".format(filename_or_obj)`, or
  • "bytes do not represent in-memory netCDF file: {}. (Pass a string or pathlib.Path object to read a filename from disk.)".format(filename_or_obj[:80] + b'...' if len(filename_or_obj) > 80 else b'')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with the first more-concise option

ds = maybe_decode_store(store)

# Ensure source filename always stored in dataset object (GH issue #2550)
Expand All @@ -383,8 +388,8 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
Strings and Paths are interpreted as a path to a netCDF file or an
OpenDAP URL and opened with python-netCDF4, unless the filename ends
with .gz, in which case the file is gunzipped and opened with
scipy.io.netcdf (only netCDF3 supported). File-like objects are opened
with scipy.io.netcdf (only netCDF3 supported).
scipy.io.netcdf (only netCDF3 supported). Byte-strings or file-like
objects are opened by scipy.io.netcdf (netCDF3) or h5py (netCDF4/HDF).
group : str, optional
Path to the netCDF4 group in the given file to open (only works for
netCDF4 files).
Expand Down
32 changes: 32 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -1955,6 +1955,38 @@ def test_dump_encodings_h5py(self):
assert actual.x.encoding['compression_opts'] is None


# Requires h5py>2.9.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a pytest.mark.skipif based on the version number? (The test on Travis-CI is failing on Python 3.5 because it has an old version of h5py installed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i did this correctly (added some lines to tests/__init__.py)

@requires_h5netcdf
class TestH5NetCDFFileObject(TestH5NetCDFData):
engine = 'h5netcdf'

@network
def test_h5remote(self):
# alternative: http://era5-pds.s3.amazonaws.com/2008/01/main.nc
import requests
url = ('https://www.unidata.ucar.edu/'
'software/netcdf/examples/test_hgroups.nc')
jhamman marked this conversation as resolved.
Show resolved Hide resolved
print(url)
bytes = requests.get(url).content
with xr.open_dataset(bytes) as ds:
assert len(ds['UTC_time']) == 74
assert ds['UTC_time'].attrs['name'] == 'time'

def test_h5bytes(self):
import h5py
bio = BytesIO()
with h5py.File(bio) as ds:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be nice if we supported writing to file-like objects, too? :)

(But don't do that now, this PR is a nice logical size already.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. hopefully someone else could pick that up!

v = np.array(2.0)
ds['scalar'] = v
bio.seek(0)
with xr.open_dataset(bio) as ds:
v = ds['scalar']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer using assert_identical and comparing to another expected dataset object.

Copy link
Contributor Author

@scottyhq scottyhq Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've changed that test to use assert_identical, and am using with raises_regex() to make sure the new error exceptions are hit

assert v == np.array(2.0)
assert v.dtype == 'float64'
assert v.ndim == 0
assert list(v.attrs) == []


@requires_h5netcdf
@requires_dask
@pytest.mark.filterwarnings('ignore:deallocating CachingFileManager')
Expand Down