Opening fsspec s3 file twice results in invalid start byte #6813

wroberts4 · 2022-07-19T21:20:26Z

What happened?

When I open an fsspec s3 file twice, it results in an error, "file-like object read/write pointer not at the start of the file".

Here's a Dockerfile I used for the environment:

FROM condaforge/mambaforge:4.12.0-0
RUN mamba install -y --strict-channel-priority -c conda-forge python=3.10 dask h5netcdf xarray fsspec s3fs

Input1:

import fsspec
import xarray as xr
fs = fsspec.filesystem('s3', anon=True)
fp = 'noaa-goes16/ABI-L1b-RadF/2019/079/14/OR_ABI-L1b-RadF-M3C03_G16_s20190791400366_e20190791411133_c20190791411180.nc'
data = fs.open(fp)
xr.open_dataset(data, engine='h5netcdf', chunks={})
xr.open_dataset(data, engine='h5netcdf', chunks={})

Output1:

Traceback (most recent call last):
  File "//example.py", line 26, in <module>
    xr.open_dataset(data, engine='h5netcdf', chunks={})
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/api.py", line 531, in open_dataset
    backend_ds = backend.open_dataset(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 389, in open_dataset
    store = H5NetCDFStore.open(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 157, in open
    magic_number = read_magic_number_from_file(filename)
  File "/opt/conda/lib/python3.10/site-packages/xarray/core/utils.py", line 645, in read_magic_number_from_file
    raise ValueError(
ValueError: cannot guess the engine, file-like object read/write pointer not at the start of the file, please close and reopen, or use a context manager

----- INVALID EXAMPLE 2 -----
Input2:

import fsspec
import xarray as xr
fs = fsspec.filesystem('s3', anon=True)
fp = 'noaa-goes16/ABI-L1b-RadF/2019/079/14/OR_ABI-L1b-RadF-M3C03_G16_s20190791400366_e20190791411133_c20190791411180.nc'
data = fs.open(fp, mode='r')
xr.open_dataset(data, engine='h5netcdf', chunks={})
xr.open_dataset(data, engine='h5netcdf', chunks={})

Output2:

Traceback (most recent call last):
  File "//example.py", line 25, in <module>
    xr.open_dataset(data, engine='h5netcdf', chunks={})
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/api.py", line 531, in open_dataset
    backend_ds = backend.open_dataset(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 389, in open_dataset
    store = H5NetCDFStore.open(
  File "/opt/conda/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 157, in open
    magic_number = read_magic_number_from_file(filename)
  File "/opt/conda/lib/python3.10/site-packages/xarray/core/utils.py", line 650, in read_magic_number_from_file
    magic_number = filename_or_obj.read(count)
  File "/opt/conda/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

----- INVALID EXAMPLE 2 -----

What did you expect to happen?

I expect both calls to open_dataset to yield the same result and not error. The following runs without errors:

import fsspec
import xarray as xr
fs = fsspec.filesystem('s3', anon=True)
fp = 'noaa-goes16/ABI-L1b-RadF/2019/079/14/OR_ABI-L1b-RadF-M3C03_G16_s20190791400366_e20190791411133_c20190791411180.nc'
data = fs.open(fp)
xr.open_dataset(data, engine='h5netcdf', chunks={})
data = fs.open(fp)
xr.open_dataset(data, engine='h5netcdf', chunks={})

Minimal Complete Verifiable Example

No response

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

I see the same error mentioned in other issues like #3991, but it was determined to be a problem with the input data.

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-348.20.1.el8_5.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: None

xarray: 2022.6.0rc0
pandas: 1.4.3
numpy: 1.23.1
scipy: None
netCDF4: None
pydap: None
h5netcdf: 1.0.1
h5py: 3.7.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.7.0
distributed: 2022.7.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 63.2.0
pip: 22.0.4
conda: 4.13.0
pytest: None
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

wroberts4 · 2022-07-19T21:55:51Z

Note that the second example fails on the first open_dataset, which I assume is expected since it's not mode mode='rb'

wroberts4 · 2022-07-19T22:00:48Z

Adding filename_or_obj.seek(0) fixes this, but I am not sure if it poses a security risk?

def read_magic_number_from_file(filename_or_obj, count=8) -> bytes:
    # check byte header to determine file type
    if isinstance(filename_or_obj, bytes):
        magic_number = filename_or_obj[:count]
    elif isinstance(filename_or_obj, io.IOBase):
        filename_or_obj.seek(0)
        if filename_or_obj.tell() != 0:
            raise ValueError(
                "cannot guess the engine, "
                "file-like object read/write pointer not at the start of the file, "
                "please close and reopen, or use a context manager"
            )
        magic_number = filename_or_obj.read(count)
        filename_or_obj.seek(0)
    else:
        raise TypeError(f"cannot read the magic number form {type(filename_or_obj)}")
    return magic_number

djhoese · 2022-08-03T18:02:20Z

I talked with @wroberts4 about this in person and if we're not missing some reason to not .seek(0) on a data source then this seems like a simple convenience and user experience improvement. We were thinking maybe it would make more sense to change the function to look like:

def read_magic_number_from_file(filename_or_obj, count=8) -> bytes:
    # check byte header to determine file type
    if isinstance(filename_or_obj, bytes):
        magic_number = filename_or_obj[:count]
    elif isinstance(filename_or_obj, io.IOBase):
        if filename_or_obj.tell() != 0:
            filename_or_obj.seek(0)
            # warn about re-seeking?
        magic_number = filename_or_obj.read(count)
        filename_or_obj.seek(0)
    else:
        raise TypeError(f"cannot read the magic number form {type(filename_or_obj)}")
    return magic_number

Additionally, the isinstance check is for io.IOBase but that base class isn't guaranteed to have a .read method. The check should probably be for RawIOBase:

https://docs.python.org/3/library/io.html#class-hierarchy

@kmuehlbauer @lamorton I saw you commented on the almost related #3991, do you have any thoughts on this? Should we put a PR together to continue the discussion? Maybe the fsspec folks (@martindurant?) have an opinion on this?

martindurant · 2022-08-03T18:10:57Z

Yes, it is reasonable to always seek(0) or to copy the file. I am not certain why/where xarray is caching the open file, though - I would have thought that a new file instance is made for each open_dataset(). I am not certain whether seeking/reading from the same file in multiple places might have unforeseen consequences, such as when doing open_dataset in multiple threads.

I am mildly against subclassing from RawIOBase, since some file-likes might choose to implement text mode right in the class (as opposed to a text wrapper layered on top). Pretty surprised that it doesn't have read()/write(), though, since all the derived classes do.

djhoese · 2022-08-03T18:17:41Z

I am not certain whether seeking/reading from the same file in multiple places might have unforeseen consequences, such as when doing open_dataset in multiple threads.

Oh duh, that's a good point. So it might be fine dask-wise if the assumption is that open_dataset is called in the main thread and then dask is used to do computations on the arrays later on. If we're talking regular Python Threads or dask delayed functions that are calling open_dataset on the same file-like object (that was passed to the worker function) then it would cause issues. Possibly rare case, but still probably something that xarray wants to support.

Yeah I thought the .read/.write omission from the IOBase class was odd too. Just wanted to point out that the if block is using .read but IOBase is not guaranteed to have .read.

wroberts4 · 2022-08-03T18:45:04Z

@djhoese Is that not already an existing problem since filename_or_obj.seek(0) is called in the existing code after reading the magic number?

djhoese · 2022-08-03T18:57:20Z

Good point. My initial answer was going to be that it isn't a problem because in the second usage of the file we would get the exception about .tell() not being at 0, but after the .seek(0) that would be true and we wouldn't get that exception. So...I guess maybe it should be documented that xarray doesn't support opening the same file-like object from different threads. In which case, making the changes suggested here would only add usability/functionality and not cause any additional issues...unless we're missing something.

djhoese · 2022-08-04T16:36:43Z

@wroberts4 I'd say maybe make a pull request and we'll see what (if any) tests fail and what the people in charge of merging think about it. I think we've gone through the various possibilities and I think if there were any thread-safety issues trying to be protected against with the exception as it was, they weren't actually being protected against (later reading of the file could have caused an issue).

weiji14 · 2022-11-21T17:55:16Z

Just hitting into this same issue mentioned downstream at xarray-contrib/datatree#130 while trying to read ICESat-2 HDF5 files from S3, but realized that the fix should happening in xarray, so I've started a PR at #7304 to fix this (thanks @djhoese for the code snippet at #6813 (comment))! Will look into the unit test failures and fix them one by one as they pop up on CI.

wroberts4 added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 19, 2022

wroberts4 mentioned this issue Jul 19, 2022

Fix reading from fsspec s3 xarray-contrib/datatree#130

Merged

dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 12, 2022

weiji14 mentioned this issue Nov 21, 2022

Reset file pointer to 0 when reading file stream #7304

Merged

4 tasks

dcherian closed this as completed in #7304 Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opening fsspec s3 file twice results in invalid start byte #6813

Opening fsspec s3 file twice results in invalid start byte #6813

wroberts4 commented Jul 19, 2022 •

edited

Loading

INSTALLED VERSIONS

wroberts4 commented Jul 19, 2022 •

edited

Loading

wroberts4 commented Jul 19, 2022

djhoese commented Aug 3, 2022

martindurant commented Aug 3, 2022

djhoese commented Aug 3, 2022

wroberts4 commented Aug 3, 2022

djhoese commented Aug 3, 2022

djhoese commented Aug 4, 2022

weiji14 commented Nov 21, 2022 •

edited

Loading

Opening fsspec s3 file twice results in invalid start byte #6813

Opening fsspec s3 file twice results in invalid start byte #6813

Comments

wroberts4 commented Jul 19, 2022 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

wroberts4 commented Jul 19, 2022 • edited Loading

wroberts4 commented Jul 19, 2022

djhoese commented Aug 3, 2022

martindurant commented Aug 3, 2022

djhoese commented Aug 3, 2022

wroberts4 commented Aug 3, 2022

djhoese commented Aug 3, 2022

djhoese commented Aug 4, 2022

weiji14 commented Nov 21, 2022 • edited Loading

wroberts4 commented Jul 19, 2022 •

edited

Loading

wroberts4 commented Jul 19, 2022 •

edited

Loading

weiji14 commented Nov 21, 2022 •

edited

Loading