Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't materialize full or offsets arrays #247

Closed
tomwhite opened this issue Jul 5, 2023 · 0 comments · Fixed by #303
Closed

Don't materialize full or offsets arrays #247

tomwhite opened this issue Jul 5, 2023 · 0 comments · Fixed by #303

Comments

@tomwhite
Copy link
Member

tomwhite commented Jul 5, 2023

Arrays whose values are a function of the index do not need to be materialized - either in memory or on disk. These include arrays where every entry is the same (created with full), and arrays of offsets (used in map_blocks to provide block IDs to the map function).

Zarr arrays created with full are not expensive to create since they specify write_empty_chunks=False so no chunks are written - just the metadata. However, offsets are materialized to disk: https://github.com/tomwhite/cubed/blob/e21786591d9832a85f2e492641ad2061bdb8c14a/cubed/core/ops.py#L431-L432

Although the array is small since its size is the number of chunks, there is a scalability issue due to the fact that every chunk is stored in a separate file in the Zarr store.

These problems could be solved by writing an array implementation that implements indexing on the fly: either by returning a subarray of fill values for the case of full, or np.ravel_multi_index to turn an index into an offset for an array of offsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant