Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy concatenation of arrays #4628

Open
nbren12 opened this issue Nov 30, 2020 · 6 comments
Open

Lazy concatenation of arrays #4628

nbren12 opened this issue Nov 30, 2020 · 6 comments
Labels

Comments

@nbren12
Copy link
Contributor

nbren12 commented Nov 30, 2020

Is your feature request related to a problem? Please describe.
Concatenating xarray objects forces the data to load. I recently learned about this object allowing lazy indexing into an DataArrays/sets without using dask. Concatenation along a single dimension is the inverse operation of slicing, so it seems natural to also support it. Also, concatenating along dimensions (e.g. "run"/"simulation"/"ensemble") can be a common merging workflow.

Describe the solution you'd like

xr.concat([a, b], dim=...) does not load any data in a or b.

Describe alternatives you've considered
One could rename the variables in a and b to allow them to be merged (e.g. a['air_temperature'] -> "air_temperature_a"), but it's more natural to make a new dimension.

Additional context

This is useful when not using dask for performance reasons (e.g. using another parallelism engine like Apache Beam).

@shoyer
Copy link
Member

shoyer commented May 24, 2021

If you write write something like xarray.concat(..., data_vars='minimal', coords='minimal'), dask should entirely lazy -- the non-laziness only happens with the default value of coords='different'.

But I agree, it would be nice if Xarray's internal lazy indexing machinery supported concatenation. It currently does not.

@dcherian dcherian added the topic-combine combine/concat/merge label Jul 8, 2021
@LunarLanding
Copy link

Any pointers regarding where to start / modules involved to implement this? I would like to have a try.

@dcherian
Copy link
Contributor

From @rabernat in #6588:

Right now, if I want to concatenate multiple datasets (e.g. as in open_mfdataset), I have two options:

  • Eagerly load the data as numpy arrays ➡️ xarray will dispatch to np.concatenate
  • Chunk each dataset ➡️ xarray will dispatch to dask.array.concatenate

In pseudocode:

ds1 = xr.open_dataset("some_big_lazy_source_1.nc")
ds2 = xr.open_dataset("some_big_lazy_source_2.nc")
item1 = ds1.foo[0, 0, 0]  # lazily access a single item
ds = xr.concat([ds1.chunk(), ds2.chunk()], "time")  # only way to lazily concat
# trying to access the same item will now trigger loading of all of ds1
item1 = ds.foo[0, 0, 0]
# yes I could use different chunks, but the point is that I should not have to 
# arbitrarily choose chunks to make this work

However, I am increasingly encountering scenarios where I would like to lazily concatenate datasets (without loading into memory), but also without the requirement of using dask. This would be useful, for example, for creating composite datasets that point back to an OpenDAP server, preserving the possibility of granular lazy access to any array element without the requirement of arbitrary chunking at an intermediate stage.

Describe the solution you'd like

I propose to extend our LazilyIndexedArray classes to support simple concatenation and stacking. The result of applying concat to such arrays will be a new LazilyIndexedArray that wraps the underlying arrays into a single object.

The main difficulty in implementing this will probably be with indexing: the concatenated array will need to understand how to map global indexes to the underling individual array indexes. That is a little tricky but eminently solvable.

Describe alternatives you've considered

The alternative is to structure your code in a way that avoids needing to lazily concatenate arrays. That is what we do now. It is not optimal.

@nbren12
Copy link
Contributor Author

nbren12 commented May 10, 2022

@rabernat It seems that great minds think alike ;)

@rabernat
Copy link
Contributor

rabernat commented May 10, 2022

Any pointers regarding where to start / modules involved to implement this? I would like to have a try.

The starting point would be to look at the code in indexing.py and try to understand how lazy indexing works.

In particular, look at

class LazilyIndexedArray(ExplicitlyIndexedNDArrayMixin):
"""Wrap an array to make basic and outer indexing lazy."""
__slots__ = ("array", "key")
def __init__(self, array, key=None):

Then you may want to try writing a class that looks like

 class LazilyConcatenatedArray:  # have to decide what to inherit from 
  
    def __init__(self, *arrays: LazilyIndexedArray, concat_axis=0):
        # figure out what you need to keep track of

    @property
    def shape(self):
        # figure out how to determine the total shape

    def __getitem__(self, indexer) -> LazilyIndexedArray:
        # figure out how to map an indexer to the right piece of data

@ilan-gold
Copy link
Contributor

ilan-gold commented Sep 23, 2024

I would +1 your great proposal @dcherian . My use-case would be the extension arrays - being able to concat them without dask is basically a must for us.

Can't do this without virtual concat machinery (#4628) which someone decided to implement elsewhere 🙄 ;)

From #9038 (comment) - what is this referring to?

In short, this is not implemented here: scverse/anndata#1247 (comment) . It is implemented for non-lazy categoricals but not lazy ones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants