-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low memory/out-of-core index? #1650
Comments
Just to add a further thought, which is that the upper levels of the binary search tree could be be cached to get faster performance for repeated searches. |
But this is already the case? See #1017 With on file datasets I think it is sufficient to
|
This is related to the performance issue documented in #1385. |
It looks like #1017 is about having no index at all. I want indexes, but I
want to avoid loading all coordinate values into memory.
…On Mon, Oct 23, 2017 at 1:47 PM, Fabien Maussion ***@***.***> wrote:
Has anyone considered implementing an index for monotonic data that does
not require loading all values into main memory?
But this is already the case? #1017
<#1017>
With on file datasets I *think* it is sufficient to drop_variables when
opening the dataset in order not to parse the coordinates:
ds = xr.open_dataset(f, drop_variables=['lon', 'lat'])
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1650 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QsbZ81N2pKybO1sFHVHK0KTk1aELks5svIrJgaJpZM4QCq62>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
This should be easier after the index/coordinates separation envisioned in #1603. We could potentially define a basic index API (based on what we currently use from pandas) and allow alternative index implementations. There are certainly other use cases where go beyond pandas makes sense -- a KDTree for indexing geospatial data is one obvious example. |
Index API sounds good. Also I was just looking at dask.dataframe indexing, there .loc is implemented using information about index values at the boundaries of each partition (chunk). Not sure xarray should use same strategy for chunked datasets, but is another approach to avoid loading indexes into memory. |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
I'd be interested in this kind of thing as well. 👍 We are having long time series data, which we would like to access via opendap or zarr over HTTP. Currently, the |
@lsetiawan has a cool use-case of opening a time dimension with a trillion elements stored in Zarr: https://github.com/lsetiawan/ooi-indexing/blob/main/demo.ipynb cc @scottyhq |
Thanks @dcherian. I've modified the notebook after the discovery of some issues in the data, however, I was still able to try somethings with half the data that works (~283.7 GB of total data with ~39.13 GB of |
Perhaps a dumb question - why does the |
Here is a prototype of a I also added it to the list in #7041. It has basic support for label-based selection, where query labels may be slices or scalar values. It assumes that coordinate data is monotonic, and of course queries are not nearly as efficient as with, e.g., a pandas index. But data selection is fully lazy! I guess it would still be possible to implement some sort of basic data structure and/or cache to speed up the process? It doesn't work well for alignment, but I doubt that we want automatic-alignment support for such index anyway, as this could trigger costly re-indexing operations and load huge amount of data. The implementation is mostly taken from another index prototype based on numpy. I guess we could merge the two and provide a generic |
I missed @lsetiawan and @dcherian's prototype (#1650 (comment)), which implements some lookup structure. Interesting!
Only the coordinates with a |
Hi @benbovy I set up a little example. Perhaps I have a mistake: import fsspec, zarr
import xarray as xr
import dask.array as da
import numpy as np
g = zarr.open_group('./')
g['coords'] = zarr.array(np.arange(0,10000), chunks=(1000,))
g['data'] = zarr.array(np.arange(0,10000), chunks=(1000,))
class AccessTrackingStore(zarr.LRUStoreCache): # tracks first access to elements
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def __getitem__(self, key):
if key not in self._values_cache:
print(key)
return super().__getitem__(key)
mapper = fsspec.get_mapper('./')
store = AccessTrackingStore(mapper, max_size=2**28)
g = zarr.open_group(store)
data = da.from_zarr(g['data'])
coords = da.from_zarr(g['coords']) Up until this point, only metadata should have been accessed. But then: data_array = xr.DataArray(data, coords=[coords], dims=['dim']) The above code will print out all of the chunks backing the |
@ilan-gold Something like this would skip the creation of those default indexes: coords_obj = xr.Coordinates({"dim": coords}, indexes={})
data_array = xr.DataArray(data, coords=coords_obj, dims=['dim']) Then you can set a dask (lazy) index to EDIT: this works only with the last Xarray release v2023.08.0 |
@benbovy The above errors out for me when appended directly to the end of my first code block: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 coords_obj = xr.Coordinates({"dim": coords}, indexes={})
File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/coordinates.py:235, in Coordinates.__init__(self, coords, indexes)
233 variables = {k: v.copy() for k, v in coords.variables.items()}
234 else:
--> 235 variables = {k: as_variable(v) for k, v in coords.items()}
237 if indexes is None:
238 indexes = {}
File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/coordinates.py:235, in <dictcomp>(.0)
233 variables = {k: v.copy() for k, v in coords.variables.items()}
234 else:
--> 235 variables = {k: as_variable(v) for k, v in coords.items()}
237 if indexes is None:
238 indexes = {}
File ~/Projects/Theis/anndata/venv/lib/python3.10/site-packages/xarray/core/variable.py:154, in as_variable(obj, name)
152 obj = Variable(name, data, fastpath=True)
153 else:
--> 154 raise TypeError(
155 f"Variable {name!r}: unable to convert object into a variable without an "
156 f"explicit list of dimensions: {obj!r}"
157 )
159 if name is not None and name in obj.dims and obj.ndim == 1:
160 # automatically convert the Variable into an Index
161 obj = obj.to_index_variable()
TypeError: Variable None: unable to convert object into a variable without an explicit list of dimensions: dask.array<from-zarr, shape=(10000,), dtype=int64, chunksize=(1000,), chunktype=numpy.ndarray> I looked into this a bit. The error seems to be coming from the lack of a In [4]: xr.as_variable(coords, name='dim')
coords/0
coords/1
coords/2
coords/3
coords/4
coords/5
coords/6
coords/7
coords/8
coords/9
Out[4]:
<xarray.IndexVariable 'dim' (dim: 10000)>
array([ 0, 1, 2, ..., 9997, 9998, 9999]) |
Ah yes sorry, it is actually a bit messy as there are multiple open pull-requests for fixing those issues. So it should work with #8094 and coords_obj = xr.Coordinates({"dim": ("dim", coords)}, indexes={})
data_array = xr.DataArray(data, coords=coords_obj, dims=['dim']) |
Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory?
Motivation: We have data where first dimension can be length ~100,000,000, and coordinates for this dimension are stored as 32-bit integers. Currently if we used a pandas Index this would cast to 64-bit integers, and the index would require ~1GB RAM. This isn't enormous, but isn't negligible for people working on modest computers. Our use cases are simple, typically we only ever need to locate a slice of this dimension from a pair of coordinates, i.e., we only need to do binary search (bisect) on the coordinates. To achieve binary search in fact there is no need at all to load the coordinate values into memory, they could be left on disk (e.g., in HDF5 or Zarr dataset) and still achieve perfectly adequate performance for our needs.
This is of course also relevant to pandas but thought I'd post here as I know there have been some discussions about how to handle indexes when working with larger datasets via dask.
The text was updated successfully, but these errors were encountered: