Avoid loading entire dataset by getting the nbytes in an array #7356

hmaarrfk · 2022-12-05T03:29:53Z

Using .data accidentally tries to load the whole lazy arrays into memory.

Sad.

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

hmaarrfk · 2022-12-05T03:30:31Z

I personally do not even think the hasattr is really that useful. You might as well use size and itemsize

Using `.data` accidentally tries to load the whole lazy arrays into memory. Sad.

hmaarrfk · 2022-12-05T03:36:31Z

Looking into the history a little more. I seem to be proposing to revert:
60f8c3d

I think this is important since many users have arrays that are larger than memory. For me, I found this bug when trying to access the number of bytes in a 16GB dataset that I'm trying to load on my wimpy laptop. Not fun to start swapping. I feel like others might be hitting this too.

xref:
#6797
#4842

hmaarrfk · 2022-12-05T03:58:50Z

I think that at the very lease, the current implementation works as well as the old one for arrays that are defined by the sparse package.

hmaarrfk · 2022-12-05T04:20:08Z

It seems that checking hasattr on the _data variable achieves both purposes.

xarray/tests/test_dataarray.py

reported

Illviljan · 2022-12-06T13:53:03Z

Is that test targetting your issue with RAM crashing the laptop? Shouldn't there be some check if the values were loaded?

How did you import your data? self.data looks like this:

xarray/xarray/core/variable.py

Lines 420 to 435 in ed60c6c

    
               @property 
        
               def data(self) -> Any: 
        
                   """ 
        
                   The Variable's data as an array. The underlying array type 
        
                   (e.g. dask, sparse, pint) is preserved. 
        
                   See Also 
        
                   -------- 
        
                   Variable.to_numpy 
        
                   Variable.as_numpy 
        
                   Variable.values 
        
                   """ 
        
                   if is_duck_array(self._data): 
        
                       return self._data 
        
                   else: 
        
                       return self.values

I was expecting your data to be a duck_array?

hmaarrfk · 2022-12-06T14:14:57Z

No explicit test was added to ensure that the data wasn't loaded. I just experienced this bug enough (we would accidentally load 100GB files in our code base) that I knew exactly how to fix it.

If you want i can add a test to ensure that future optimizations to nbytes do not trigger a data load.

I was hoping the 1 line fix would be a shoe in.

hmaarrfk · 2022-12-06T14:18:11Z

The data is loaded from an NetCDF store through open_dataset

Illviljan · 2022-12-06T15:44:01Z

I'm not really opposed to this change, shape and dtype uses self._data aswell.

Without using chunks={} in open_dataset? I just find it a little odd that it's not a duck_array, what type is self._data?

This test just looked so similar to the tests in #6797. I think you can do a similar lazy test taking inspiration from:

xarray/xarray/tests/test_formatting.py

Lines 715 to 727 in ed60c6c

    
           def test_lazy_array_wont_compute() -> None: 
        
               from xarray.core.indexing import LazilyIndexedArray 
        
               class LazilyIndexedArrayNotComputable(LazilyIndexedArray): 
        
                   def __array__(self, dtype=None): 
        
                       raise NotImplementedError("Computing this array is not possible.") 
        
               arr = LazilyIndexedArrayNotComputable(np.array([1, 2])) 
        
               var = xr.DataArray(arr) 
        
               # These will crash if var.data are converted to numpy arrays: 
        
               var.__repr__() 
        
               var._repr_html_()

hmaarrfk · 2022-12-06T16:18:59Z

Very smart test!

hmaarrfk · 2022-12-06T16:19:19Z

Yes, without chunks of anything

xarray/core/variable.py

xarray/tests/test_dataarray.py

for more information, see https://pre-commit.ci

dcherian

LGTM. thanks!

hmaarrfk · 2022-12-12T17:27:47Z

👍🏾

hmaarrfk · 2022-12-22T02:40:59Z

Any chance of a release, this is quite breaking for large datasets that can only be out of memory.

* main: (41 commits) v2023.01.0 whats-new (pydata#7440) explain keep_attrs in docstring of apply_ufunc (pydata#7445) Add sentence to open_dataset docstring (pydata#7438) pin scipy version in doc environment (pydata#7436) Improve performance for backend datetime handling (pydata#7374) fix typo (pydata#7433) Add lazy backend ASV test (pydata#7426) Pull Request Labeler - Workaround sync-labels bug (pydata#7431) see also : groupby in resample doc and vice-versa (pydata#7425) Some alignment optimizations (pydata#7382) Make `broadcast` and `concat` work with the Array API (pydata#7387) remove `numbagg` and `numba` from the upstream-dev CI (pydata#7416) [pre-commit.ci] pre-commit autoupdate (pydata#7402) Preserve original dtype when accessing MultiIndex levels (pydata#7393) [pre-commit.ci] pre-commit autoupdate (pydata#7389) [pre-commit.ci] pre-commit autoupdate (pydata#7360) COMPAT: Adjust CFTimeIndex.get_loc for pandas 2.0 deprecation enforcement (pydata#7361) Avoid loading entire dataset by getting the nbytes in an array (pydata#7356) `keep_attrs` for pad (pydata#7267) Bump pypa/gh-action-pypi-publish from 1.5.1 to 1.6.4 (pydata#7375) ...

TomNicholas · 2023-03-17T17:10:44Z

This came up in the xarray office hours today, and I'm confused why this PR made any difference to the behavior at all? The .data property just points to ._data, so why would it matter which one we check?

dcherian · 2023-03-17T17:30:51Z

Because we have lazy data reading functionality

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
var = ds.air.variable

print(type(var._data))  # memory cached array
print(type(var._data.array.array))  # ah that's wrapping a lazy array, no data read in yet
print(var._data.size)  # can access size
print(type(var._data.array.array))  # still a lazy array

#.data forces a disk load
print(type(var.data))  # oops disk-load
print(type(var._data)) # "still memory cached array"
print(type(var._data.array.array)) # but that's wrapping numpy data in memory

<class 'xarray.core.indexing.MemoryCachedArray'>
<class 'xarray.core.indexing.LazilyIndexedArray'>
3869000
<class 'xarray.core.indexing.LazilyIndexedArray'>
<class 'numpy.ndarray'>
<class 'xarray.core.indexing.MemoryCachedArray'>
<class 'numpy.ndarray'>

hmaarrfk force-pushed the avoid_memory_instantiation branch from 353c9b8 to 4729a35 Compare December 5, 2022 03:30

hmaarrfk marked this pull request as ready for review December 5, 2022 03:31

Avoid instantiating entire dataset by getting the nbytes in an array

8826c14

Using `.data` accidentally tries to load the whole lazy arrays into memory. Sad.

hmaarrfk force-pushed the avoid_memory_instantiation branch from 4729a35 to 8826c14 Compare December 5, 2022 03:32

DOC: Add release note for bugfix.

f934b90

hmaarrfk force-pushed the avoid_memory_instantiation branch from 5d3a5d3 to 1543c62 Compare December 5, 2022 03:58

headtr1ck reviewed Dec 5, 2022

View reviewed changes

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved

Add test to ensure that number of bytes of sparse array is correctly

a04ba20

reported

hmaarrfk force-pushed the avoid_memory_instantiation branch from 1543c62 to a04ba20 Compare December 5, 2022 12:41

dcherian reviewed Dec 6, 2022

View reviewed changes

xarray/core/variable.py Show resolved Hide resolved

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved

hmaarrfk and others added 3 commits December 6, 2022 21:31

Add suggested test using InaccessibleArray

362eecc

[pre-commit.ci] auto fixes from pre-commit.com hooks

58d3834

for more information, see https://pre-commit.ci

Remove duplicate test

171c932

Illviljan approved these changes Dec 7, 2022

View reviewed changes

dcherian approved these changes Dec 7, 2022

View reviewed changes

dcherian added the plan to merge Final call for comments label Dec 7, 2022

Merge branch 'main' into avoid_memory_instantiation

02e3cb1

dcherian changed the title ~~Avoid instantiating entire dataset by getting the nbytes in an array~~ Avoid loading entire dataset by getting the nbytes in an array Dec 12, 2022

dcherian enabled auto-merge (squash) December 12, 2022 16:27

dcherian merged commit 021c73e into pydata:main Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid loading entire dataset by getting the nbytes in an array #7356

Avoid loading entire dataset by getting the nbytes in an array #7356

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

Illviljan commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

Illviljan commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

dcherian left a comment

hmaarrfk commented Dec 12, 2022

hmaarrfk commented Dec 22, 2022

TomNicholas commented Mar 17, 2023

dcherian commented Mar 17, 2023 •

edited

Loading

Avoid loading entire dataset by getting the nbytes in an array #7356

Avoid loading entire dataset by getting the nbytes in an array #7356

Conversation

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

hmaarrfk commented Dec 5, 2022

Illviljan commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

Illviljan commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

hmaarrfk commented Dec 6, 2022

dcherian left a comment

Choose a reason for hiding this comment

hmaarrfk commented Dec 12, 2022

hmaarrfk commented Dec 22, 2022

TomNicholas commented Mar 17, 2023

dcherian commented Mar 17, 2023 • edited Loading

dcherian commented Mar 17, 2023 •

edited

Loading