Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset file pointer to 0 when reading file stream #7304

Merged
merged 8 commits into from
Dec 1, 2022

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Nov 21, 2022

Instead of raising a ValueError about the file pointer not being at the start of the file, reset the file pointer automatically to zero, and warn that the pointer has been reset.

Instead of raising a ValueError about the file pointer not being at the start of the file, reset the file pointer automatically to zero, and warn that the pointer has been reset.
@weiji14
Copy link
Contributor Author

weiji14 commented Nov 21, 2022

Traceback from the 2 test failures at https://github.com/pydata/xarray/actions/runs/3516849430/jobs/5893926099#step:9:252

=================================== FAILURES ===================================
____________________ TestH5NetCDFFileObject.test_open_twice ____________________
[gw2] linux -- Python 3.10.7 /home/runner/micromamba-root/envs/xarray-tests/bin/python

self = <xarray.tests.test_backends.TestH5NetCDFFileObject object at 0x7f211de81e40>

    def test_open_twice(self) -> None:
        expected = create_test_data()
        expected.attrs["foo"] = "bar"
>       with pytest.raises(ValueError, match=r"read/write pointer not at the start"):
E       Failed: DID NOT RAISE <class 'ValueError'>

/home/runner/work/xarray/xarray/xarray/tests/test_backends.py:3034: Failed
___________________ TestH5NetCDFFileObject.test_open_fileobj ___________________
[gw2] linux -- Python 3.10.7 /home/runner/micromamba-root/envs/xarray-tests/bin/python

self = <xarray.tests.test_backends.TestH5NetCDFFileObject object at 0x7f211de82530>

    @requires_scipy
    def test_open_fileobj(self) -> None:
        # open in-memory datasets instead of local file paths
        expected = create_test_data().drop_vars("dim3")
        expected.attrs["foo"] = "bar"
        with create_tmp_file() as tmp_file:
            expected.to_netcdf(tmp_file, engine="h5netcdf")
    
            with open(tmp_file, "rb") as f:
                with open_dataset(f, engine="h5netcdf") as actual:
                    assert_identical(expected, actual)
    
                f.seek(0)
                with open_dataset(f) as actual:
                    assert_identical(expected, actual)
    
                f.seek(0)
                with BytesIO(f.read()) as bio:
                    with open_dataset(bio, engine="h5netcdf") as actual:
                        assert_identical(expected, actual)
    
                f.seek(0)
                with pytest.raises(TypeError, match="not a valid NetCDF 3"):
                    open_dataset(f, engine="scipy")
    
            # TODO: this additional open is required since scipy seems to close the file
            # when it fails on the TypeError (though didn't when we used
            # `raises_regex`?). Ref https://github.com/pydata/xarray/pull/5191
            with open(tmp_file, "rb") as f:
                f.seek(8)
                with pytest.raises(
                    ValueError,
                    match="match in any of xarray's currently installed IO",
                ):
>                   with pytest.warns(
                        RuntimeWarning,
                        match=re.escape("'h5netcdf' fails while guessing"),
                    ):
E                   Failed: DID NOT WARN. No warnings of type (<class 'RuntimeWarning'>,) matching the regex were emitted.
E                    Regex: 'h5netcdf'\ fails\ while\ guessing
E                    Emitted warnings: [ UserWarning('cannot guess the engine, file-like object read/write pointer not at the start of the file, so resetting file pointer to zero. If this does not work, please close and reopen, or use a context manager'),
E                     RuntimeWarning("deallocating CachingFileManager(<class 'h5netcdf.core.File'>, <_io.BufferedReader name='/tmp/tmpoxdfl12i/temp-720.nc'>, mode='r', kwargs={'invalid_netcdf': None, 'decode_vlen_strings': True}, manager_id='b62ec6c8-b328-409c-bc5d-bbab265bea51'), but file is not already closed. This may indicate a bug.")]

/home/runner/work/xarray/xarray/xarray/tests/test_backends.py:3076: Failed
=========================== short test summary info ============================
FAILED xarray/tests/test_backends.py::TestH5NetCDFFileObject::test_open_twice - Failed: DID NOT RAISE <class 'ValueError'>
FAILED xarray/tests/test_backends.py::TestH5NetCDFFileObject::test_open_fileobj - Failed: DID NOT WARN. No warnings of type (<class 'RuntimeWarning'>,) matching the regex were emitted.
 Regex: 'h5netcdf'\ fails\ while\ guessing
 Emitted warnings: [ UserWarning('cannot guess the engine, file-like object read/write pointer not at the start of the file, so resetting file pointer to zero. If this does not work, please close and reopen, or use a context manager'),
  RuntimeWarning("deallocating CachingFileManager(<class 'h5netcdf.core.File'>, <_io.BufferedReader name='/tmp/tmpoxdfl12i/temp-720.nc'>, mode='r', kwargs={'invalid_netcdf': None, 'decode_vlen_strings': True}, manager_id='b62ec6c8-b328-409c-bc5d-bbab265bea51'), but file is not already closed. This may indicate a bug.")]
= 2 failed, 14608 passed, 1190 skipped, 203 xfailed, 73 xpassed, 54 warnings in 581.98s (0:09:41) =

Fixes the `Failed: DID NOT RAISE <class 'ValueError'>`
The ValueError and RuntimeWarning isn't raised anymore.
Comment on lines 658 to 663
warnings.warn(
"cannot guess the engine, "
"file-like object read/write pointer not at the start of the file, "
"so resetting file pointer to zero. If this does not work, "
"please close and reopen, or use a context manager"
)
Copy link
Contributor Author

@weiji14 weiji14 Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a warning actually necessary here, or can we remove the warning and let the file pointer reset to zero silently?

Edit: warning has been removed in 929cb62

@weiji14 weiji14 marked this pull request as ready for review November 21, 2022 18:54
@dcherian
Copy link
Contributor

@martindurant do you have time to take a look here please?

@martindurant
Copy link
Contributor

It loos reasonable to me. I'm not sure if the warning is needed or not - we don't expect anyone to see it, or if they do, necessarily do anything about it. It's not unusual for code interacting with a file-like object to move the file pointer.

@weiji14
Copy link
Contributor Author

weiji14 commented Nov 29, 2022

It loos reasonable to me. I'm not sure if the warning is needed or not - we don't expect anyone to see it, or if they do, necessarily do anything about it. It's not unusual for code interacting with a file-like object to move the file pointer.

Hmm, in that case, I'm leaning towards removing the warning. The file pointer is reset anyway after reading the magic byte number, and that hasn't caused any issues (as mentioned in #6813 (comment)), so it should be more or less safe. Let me push another commit. Edit: done at 929cb62.

File pointer is reset to zero after reading the magic byte number anyway, so should be ok not to warn about this.
@dcherian
Copy link
Contributor

dcherian commented Dec 1, 2022

Thanks @weiji14 !

@dcherian dcherian merged commit ea9b3a5 into pydata:main Dec 1, 2022
@weiji14 weiji14 deleted the fix_s3_start_byte branch December 1, 2022 16:28
dcherian added a commit to dcherian/xarray that referenced this pull request Dec 2, 2022
* upstream/main: (39 commits)
  Support the new compression argument in netCDF4 > 1.6.0 (pydata#6981)
  Remove setuptools-scm-git-archive, require setuptools-scm>=7 (pydata#7253)
  Fix mypy failures (pydata#7343)
  Docs: add example of writing and reading groups to netcdf (pydata#7338)
  Reset file pointer to 0 when reading file stream (pydata#7304)
  Enable mypy warn unused ignores (pydata#7335)
  Optimize some copying (pydata#7209)
  Add parse_dims func (pydata#7051)
  Fix coordinate attr handling in `xr.where(..., keep_attrs=True)` (pydata#7229)
  Remove code used to support h5py<2.10.0 (pydata#7334)
  [pre-commit.ci] pre-commit autoupdate (pydata#7330)
  Fix PR number in what’s new (pydata#7331)
  Enable `origin` and `offset` arguments in `resample` (pydata#7284)
  fix doctests: supress urllib3 warning (pydata#7326)
  fix flake8 config (pydata#7321)
  implement Zarr v3 spec support (pydata#6475)
  Fix polyval overloads (pydata#7315)
  deprecate pynio backend (pydata#7301)
  mypy - Remove some ignored packages and modules (pydata#7319)
  Switch to T_DataArray in .coords (pydata#7285)
  ...
weiji14 added a commit to CryoInTheCloud/hub-image that referenced this pull request Dec 5, 2022
Bump xarray from 2022.11.0 to 2022.12.0 which contains a
[bugfix](pydata/xarray#7304) useful for reading
multiple groups from ICESat-2 HDF5 files in AWS S3 buckets.

Also taking the opportunity to sort packages alphabetically in the
`environment.yml` and reorganize some sections originally added in #2.
Hopefully this will make it clearer on where new conda packages can be
added in the future!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opening fsspec s3 file twice results in invalid start byte
4 participants