Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid loading entire dataset by getting the nbytes in an array #7356

Merged
merged 7 commits into from
Dec 12, 2022

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented Dec 5, 2022

Using .data accidentally tries to load the whole lazy arrays into memory.

Sad.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 5, 2022

I personally do not even think the hasattr is really that useful. You might as well use size and itemsize

@hmaarrfk hmaarrfk marked this pull request as ready for review December 5, 2022 03:31
Using `.data` accidentally tries to load the whole lazy arrays into
memory.

Sad.
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 5, 2022

Looking into the history a little more. I seem to be proposing to revert:
60f8c3d

I think this is important since many users have arrays that are larger than memory. For me, I found this bug when trying to access the number of bytes in a 16GB dataset that I'm trying to load on my wimpy laptop. Not fun to start swapping. I feel like others might be hitting this too.

xref:
#6797
#4842

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 5, 2022

I think that at the very lease, the current implementation works as well as the old one for arrays that are defined by the sparse package.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 5, 2022

It seems that checking hasattr on the _data variable achieves both purposes.

@Illviljan
Copy link
Contributor

Is that test targetting your issue with RAM crashing the laptop? Shouldn't there be some check if the values were loaded?

How did you import your data? self.data looks like this:

@property
def data(self) -> Any:
"""
The Variable's data as an array. The underlying array type
(e.g. dask, sparse, pint) is preserved.
See Also
--------
Variable.to_numpy
Variable.as_numpy
Variable.values
"""
if is_duck_array(self._data):
return self._data
else:
return self.values

I was expecting your data to be a duck_array?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 6, 2022

No explicit test was added to ensure that the data wasn't loaded. I just experienced this bug enough (we would accidentally load 100GB files in our code base) that I knew exactly how to fix it.

If you want i can add a test to ensure that future optimizations to nbytes do not trigger a data load.

I was hoping the 1 line fix would be a shoe in.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 6, 2022

The data is loaded from an NetCDF store through open_dataset

@Illviljan
Copy link
Contributor

I'm not really opposed to this change, shape and dtype uses self._data aswell.

Without using chunks={} in open_dataset? I just find it a little odd that it's not a duck_array, what type is self._data?

This test just looked so similar to the tests in #6797. I think you can do a similar lazy test taking inspiration from:

def test_lazy_array_wont_compute() -> None:
from xarray.core.indexing import LazilyIndexedArray
class LazilyIndexedArrayNotComputable(LazilyIndexedArray):
def __array__(self, dtype=None):
raise NotImplementedError("Computing this array is not possible.")
arr = LazilyIndexedArrayNotComputable(np.array([1, 2]))
var = xr.DataArray(arr)
# These will crash if var.data are converted to numpy arrays:
var.__repr__()
var._repr_html_()

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 6, 2022

Very smart test!

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Dec 6, 2022

Yes, without chunks of anything

xarray/core/variable.py Show resolved Hide resolved
xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks!

@dcherian dcherian added the plan to merge Final call for comments label Dec 7, 2022
@dcherian dcherian changed the title Avoid instantiating entire dataset by getting the nbytes in an array Avoid loading entire dataset by getting the nbytes in an array Dec 12, 2022
@dcherian dcherian enabled auto-merge (squash) December 12, 2022 16:27
@dcherian dcherian merged commit 021c73e into pydata:main Dec 12, 2022
@hmaarrfk
Copy link
Contributor Author

👍🏾

@hmaarrfk
Copy link
Contributor Author

Any chance of a release, this is quite breaking for large datasets that can only be out of memory.

dcherian added a commit to dcherian/xarray that referenced this pull request Jan 18, 2023
* main: (41 commits)
  v2023.01.0 whats-new (pydata#7440)
  explain keep_attrs in docstring of apply_ufunc (pydata#7445)
  Add sentence to open_dataset docstring (pydata#7438)
  pin scipy version in doc environment (pydata#7436)
  Improve performance for backend datetime handling (pydata#7374)
  fix typo (pydata#7433)
  Add lazy backend ASV test (pydata#7426)
  Pull Request Labeler - Workaround sync-labels bug (pydata#7431)
  see also : groupby in resample doc and vice-versa (pydata#7425)
  Some alignment optimizations (pydata#7382)
  Make `broadcast` and `concat` work with the Array API (pydata#7387)
  remove `numbagg` and `numba` from the upstream-dev CI (pydata#7416)
  [pre-commit.ci] pre-commit autoupdate (pydata#7402)
  Preserve original dtype when accessing MultiIndex levels (pydata#7393)
  [pre-commit.ci] pre-commit autoupdate (pydata#7389)
  [pre-commit.ci] pre-commit autoupdate (pydata#7360)
  COMPAT: Adjust CFTimeIndex.get_loc for pandas 2.0 deprecation enforcement (pydata#7361)
  Avoid loading entire dataset by getting the nbytes in an array (pydata#7356)
  `keep_attrs` for pad (pydata#7267)
  Bump pypa/gh-action-pypi-publish from 1.5.1 to 1.6.4 (pydata#7375)
  ...
@TomNicholas
Copy link
Member

This came up in the xarray office hours today, and I'm confused why this PR made any difference to the behavior at all? The .data property just points to ._data, so why would it matter which one we check?

@dcherian
Copy link
Contributor

dcherian commented Mar 17, 2023

Because we have lazy data reading functionality

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
var = ds.air.variable

print(type(var._data))  # memory cached array
print(type(var._data.array.array))  # ah that's wrapping a lazy array, no data read in yet
print(var._data.size)  # can access size
print(type(var._data.array.array))  # still a lazy array

#.data forces a disk load
print(type(var.data))  # oops disk-load
print(type(var._data)) # "still memory cached array"
print(type(var._data.array.array)) # but that's wrapping numpy data in memory
<class 'xarray.core.indexing.MemoryCachedArray'>
<class 'xarray.core.indexing.LazilyIndexedArray'>
3869000
<class 'xarray.core.indexing.LazilyIndexedArray'>
<class 'numpy.ndarray'>
<class 'xarray.core.indexing.MemoryCachedArray'>
<class 'numpy.ndarray'>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants