Provide offset for memory mapping / contiguous layout #321

maartenbreddels · 2018-11-02T14:09:59Z

Related: #265 #149
Thanks @rabernat for pointing me to this library.

In vaex (out of core dataframes) I use .hdf5 or arrow files which are memory mapped, which gives really good performance. Arrow natively supports this, and .hdf5 can be used if contiguous layout is specified. In this case, you can ask the offset (and size) of the array in the file. Once the offsets, types, endianness and lengths/shapes are collected, the Nd-arrays can be memory mapped, linked to a numpy array, and passed around to any library, giving you lazy reading and no memory wasting out of the box.

Is it an idea for zarr to support this layout and, provide an API to get this offset? This would make it really easy for me to support zarr in vaex. In case of chuncked storage, or compressions options, the hdf5 library returns an offset of -1 (the h5py translates that to None I think).

jakirkham · 2019-01-03T22:11:31Z

Thanks for stopping by @maartenbreddels.

If one created a Zarr Array with no chunks (contiguously as you say) and disabled compression, then it should be possible to memory map the entire Zarr file and access it if you like. One can check to see if an Array is a single chunk with the nchunks parameter (which should be 1 in this case).

Currently we load full chunks into memory, which it sounds like you don't want. The easiest way to fix this is the proposal in issue ( #265 ). Simpler still might be to add a flag to DirectoryStore to optionally allow memory-mapping mentioned in this comment.

jakirkham · 2019-01-04T07:41:19Z

Also not totally sure what you mean by offset here. Could you please clarify? Is this suppose to be the size of the binary file's header or something?

maartenbreddels · 2019-07-19T17:30:36Z

I missed the notifications for this. But by offset I mean, given a particular file, at what location in that file does the array data start.

jakirkham · 2019-07-19T18:30:07Z

There are no offsets currently.

weiji14 · 2020-06-06T10:05:36Z

Hi there 👋, just gently bumping this thread to see if there's been any progress on either vaex or zarr's side. I've been using zarr for about a month now and recently found out about vaex's 'out-of-core' processing capabilities. Wondering if there's anything I can contribute to make this vaex-zarr connection happen.

alimanfoo · 2020-06-11T23:01:32Z

Hi @weiji14, I think this depends on a bit of technical discussion with @maartenbreddels.

Zarr is intended primarily for storing data in chunks, optionally with compression. It is possible to store data without chunks (effectively a single chunk, as @jakirkham suggested) and with no compression, but then there is no benefit over using Arrow or HDF5 in contiguous layout.

So I'm wondering:

(1) Is there a use case that would justify getting vaex and zarr plumbed together for just the case of no chunking and no compression?

(2) Does anyone want/need to get vaex working over zarr with chunking (and possibly also compression)?

weiji14 · 2020-06-16T02:35:45Z

Hi @alimanfoo, I suppose I'm primarily interested in the out-of-core/memmap use case, which seems to be what #265 is handling. To the best of my knowledge, Vaex currently supports 2D data tables, but doesn't work so well on N-dimensional cases which is what Zarr is built to handle.

(1) Is there a use case that would justify getting vaex and zarr plumbed together for just the case of no chunking and no compression?

I suppose there is, if the benefit of doing out-of-core processing on memory-limited hardware outweighs that of running parallel workloads on multiple chunks. E.g. processing on a laptop rather than on a cluster.

(2) Does anyone want/need to get vaex working over zarr with chunking (and possibly also compression)?

Not very well educated on the technical aspects, but is it possible to memory map to individual zarr chunks (assuming no compression), since they're basically just files in the filesystem, or is it more trouble than what it's worth?

alimanfoo · 2020-06-16T11:04:58Z

Hi @weiji14

Hi @alimanfoo, I suppose I'm primarily interested in the out-of-core/memmap use case, which seems to be what #265 is handling. To the best of my knowledge, Vaex currently supports 2D data tables, but doesn't work so well on N-dimensional cases which is what Zarr is built to handle.

(1) Is there a use case that would justify getting vaex and zarr plumbed together for just the case of no chunking and no compression?

I suppose there is, if the benefit of doing out-of-core processing on memory-limited hardware outweighs that of running parallel workloads on multiple chunks. E.g. processing on a laptop rather than on a cluster.

The next question is, in this case, what value would using zarr add over other backends (hdf5, arrow)?

(2) Does anyone want/need to get vaex working over zarr with chunking (and possibly also compression)?

Not very well educated on the technical aspects, but is it possible to memory map to individual zarr chunks (assuming no compression), since they're basically just files in the filesystem, or is it more trouble than what it's worth?

I imagine you could memory map chunks with no compression. And you could also figure out for a given row what chunk and what offset within a chunk you need.

Cheers,
Alistair

weiji14 · 2020-06-16T11:46:05Z

The next question is, in this case, what value would using zarr add over other backends (hdf5, arrow)?

A very good point 😄 I can only speak for the HDF5 case since I don't use arrow, but off the top of my head, what I can think of are 1) the ability to see some of the ndarray's metadata (via the external .zmetadata file) without necessarily opening the data file, 2) Nicer thread-safe/concurrent read/writes by default (see also https://stackoverflow.com/questions/34906652/does-hdf5-support-concurrent-reads-or-writes-to-different-files), though I'm not too sure if this applies for the contiguous chunk case.

I imagine you could memory map chunks with no compression. And you could also figure out for a given row what chunk and what offset within a chunk you need.

Ok, good to know that it's a possibility.

Should have probably mentioned too that I'm coming in as an xarray user. I've actually read the blog post at https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314 (a few times now) and can see some discussion over at #556 on 'File Chunk Stores'. It might be the case that a parallelized Zarr interface over a HDF5-like storage object would suffice for my use case (specificially, avoiding the time cost of loading many hdf5/zarr objects into memory via xarray).

jakirkham mentioned this issue Jan 4, 2019

RFC: Optionally support memory-mapping DirectoryStore values #377

Closed

7 tasks

weiji14 mentioned this issue Jun 18, 2020

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr weiji14/deepicedrain#100

Open

6 tasks

joshmoore mentioned this issue Sep 23, 2021

Outreachy project proposals (Oct. 2021) zarr-developers/community#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide offset for memory mapping / contiguous layout #321

Provide offset for memory mapping / contiguous layout #321

maartenbreddels commented Nov 2, 2018

jakirkham commented Jan 3, 2019 •

edited

Loading

jakirkham commented Jan 4, 2019

maartenbreddels commented Jul 19, 2019

jakirkham commented Jul 19, 2019

weiji14 commented Jun 6, 2020

alimanfoo commented Jun 11, 2020

weiji14 commented Jun 16, 2020

alimanfoo commented Jun 16, 2020

weiji14 commented Jun 16, 2020

Provide offset for memory mapping / contiguous layout #321

Provide offset for memory mapping / contiguous layout #321

Comments

maartenbreddels commented Nov 2, 2018

jakirkham commented Jan 3, 2019 • edited Loading

jakirkham commented Jan 4, 2019

maartenbreddels commented Jul 19, 2019

jakirkham commented Jul 19, 2019

weiji14 commented Jun 6, 2020

alimanfoo commented Jun 11, 2020

weiji14 commented Jun 16, 2020

alimanfoo commented Jun 16, 2020

weiji14 commented Jun 16, 2020

jakirkham commented Jan 3, 2019 •

edited

Loading