Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update contains_cftime_datetimes to avoid loading entire variable array #7494

Merged
merged 23 commits into from
Mar 7, 2023

Conversation

agoodm
Copy link
Contributor

@agoodm agoodm commented Jan 30, 2023

This PR greatly improves the performance for opening datasets with large arrays of object type (typically string arrays) since contains_cftime_datetimes was triggering the entire array to be read from the file just to check the very first element in the entire array.

@Illviljan continuing our discussion from the issue thread, I did try to pass in var._data to _contains_cftime_datetimes, but I had a lot of trouble finding a way to generalize how to index the first array element. The best I could do was var._data.array.get_array(), but I don't think get_array is implemented for every backend. So for now I am leaving my original proposed solution.

@mathause
Copy link
Collaborator

Thanks for the PR. However, does that actually make a difference? To me it looks like _contains_cftime_datetimes also only considers one element of the array.

xarray/xarray/core/common.py

Lines 1779 to 1780 in b451558

if array.dtype == np.dtype("O") and array.size > 0:
sample = np.asarray(array).flat[0]

@agoodm
Copy link
Contributor Author

agoodm commented Jan 31, 2023

Thanks for the PR. However, does that actually make a difference? To me it looks like _contains_cftime_datetimes also only considers one element of the array.

xarray/xarray/core/common.py

Lines 1779 to 1780 in b451558

if array.dtype == np.dtype("O") and array.size > 0:
sample = np.asarray(array).flat[0]

This isn't actually the line of code that's causing the performance bottleneck, it's the access to var.data in the function call that is actually problematic as I explained in the issue thread. You can verify this yourself running this simple example before and after applying the changes in this PR:

import numpy as np
import xarray as xr

str_array = np.arange(100000000).astype(str)
ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array')
ds = ds.chunk(x=10000)
ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem
ds.to_zarr('str_array.zarr')

%time xr.open_zarr('str_array.zarr')

@Illviljan
Copy link
Contributor

@agoodm, what you think of this version? Using xr.Variable directly seems a little easier to work with than trying to guess which type of array (cupy, dask, pint, backendarray, etc) is in the variable.

@agoodm
Copy link
Contributor Author

agoodm commented Jan 31, 2023

@Illviljan I gave your update a quick test, it seems to work well enough and still maintains the performance improvement. It looks fine to me though I guess it looks like you still need to fix this failing mypy stuff now?

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Jan 31, 2023
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought da.data passes the array along as is - but you can learn something everyday. Thanks @Illviljan for taking over and sorry @agoodm for not properly reading the issue...

xarray/core/common.py Outdated Show resolved Hide resolved
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
xarray/core/common.py Outdated Show resolved Hide resolved
Comment on lines 1792 to 1795
if var.dtype == np.dtype("O") and var.size > 0:
first_idx = (0,) * var.ndim
sample = var[first_idx]
return isinstance(sample.to_numpy().item(), cftime.datetime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clean. It'd be nice to add some sort of test like DuckBackendArrayWrapper in https://github.com/pydata/xarray/pull/6874/files . __getitem__ should raise if it will return more than one value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this. I am a little confused for what you are suggesting here. Are you looking for a simple test in test_variable.py that applies the same logic in this block to extract the very first element via Variable.__getitem__ here and check that it returns one value, a more general contains_cftime_datetimes test, or both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sotry that was a bit complicated and intended for IIlviljan.

I pushed a commit with a test. I also changed the code to account for those lazily indexed backend arrays explicitly.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Some simple test of the new functionality would be nice.

If you are really motivated, you can think about adding an asv benchmark here: https://github.com/pydata/xarray/blob/main/asv_bench/benchmarks/dataset_io.py

doc/whats-new.rst Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor

dcherian commented Mar 1, 2023

Seems like we're passing a DataArray instead of a Variable somewhere.

@mathause
Copy link
Collaborator

mathause commented Mar 1, 2023

I think that's in the tests themselves

@pytest.fixture()

xarray/core/common.py Outdated Show resolved Hide resolved
@dcherian dcherian added the plan to merge Final call for comments label Mar 2, 2023
@dcherian dcherian removed the plan to merge Final call for comments label Mar 2, 2023
@dcherian dcherian closed this Mar 3, 2023
@dcherian dcherian reopened this Mar 3, 2023
@dcherian dcherian requested a review from Illviljan March 5, 2023 05:40
@dcherian dcherian added the plan to merge Final call for comments label Mar 5, 2023
@dcherian
Copy link
Contributor

dcherian commented Mar 6, 2023

Thanks @agoodm this work prompted a bunch of internal cleanup!

@dcherian dcherian merged commit 798f4d4 into pydata:main Mar 7, 2023
@agoodm
Copy link
Contributor Author

agoodm commented Mar 7, 2023

Thanks @Illviljan and @dcherian for helping to see this through.

@agoodm agoodm deleted the improve_cftime_check_performance branch March 7, 2023 16:22
dcherian added a commit to dcherian/xarray that referenced this pull request Mar 9, 2023
* main:
  Preserve `base` and `loffset` arguments in `resample` (pydata#7444)
  ignore the `pkg_resources` deprecation warning (pydata#7594)
  Update contains_cftime_datetimes to avoid loading entire variable array (pydata#7494)
  Support first, last with dask arrays (pydata#7562)
  update the docs environment (pydata#7442)
  Add xCDAT to list of Xarray related projects (pydata#7579)
  [pre-commit.ci] pre-commit autoupdate (pydata#7565)
  fix nczarr when libnetcdf>4.8.1 (pydata#7575)
  use numpys SupportsDtype (pydata#7521)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-cftime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opening datasets with large object dtype arrays is very slow
5 participants