Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide offset for memory mapping / contiguous layout #321

Open
maartenbreddels opened this issue Nov 2, 2018 · 9 comments
Open

Provide offset for memory mapping / contiguous layout #321

maartenbreddels opened this issue Nov 2, 2018 · 9 comments

Comments

@maartenbreddels
Copy link

Related: #265 #149
Thanks @rabernat for pointing me to this library.

In vaex (out of core dataframes) I use .hdf5 or arrow files which are memory mapped, which gives really good performance. Arrow natively supports this, and .hdf5 can be used if contiguous layout is specified. In this case, you can ask the offset (and size) of the array in the file. Once the offsets, types, endianness and lengths/shapes are collected, the Nd-arrays can be memory mapped, linked to a numpy array, and passed around to any library, giving you lazy reading and no memory wasting out of the box.

Is it an idea for zarr to support this layout and, provide an API to get this offset? This would make it really easy for me to support zarr in vaex. In case of chuncked storage, or compressions options, the hdf5 library returns an offset of -1 (the h5py translates that to None I think).

@jakirkham
Copy link
Member

jakirkham commented Jan 3, 2019

Thanks for stopping by @maartenbreddels.

If one created a Zarr Array with no chunks (contiguously as you say) and disabled compression, then it should be possible to memory map the entire Zarr file and access it if you like. One can check to see if an Array is a single chunk with the nchunks parameter (which should be 1 in this case).

Currently we load full chunks into memory, which it sounds like you don't want. The easiest way to fix this is the proposal in issue ( #265 ). Simpler still might be to add a flag to DirectoryStore to optionally allow memory-mapping mentioned in this comment.

@jakirkham
Copy link
Member

Also not totally sure what you mean by offset here. Could you please clarify? Is this suppose to be the size of the binary file's header or something?

@maartenbreddels
Copy link
Author

I missed the notifications for this. But by offset I mean, given a particular file, at what location in that file does the array data start.

@jakirkham
Copy link
Member

There are no offsets currently.

@weiji14
Copy link

weiji14 commented Jun 6, 2020

Hi there 👋, just gently bumping this thread to see if there's been any progress on either vaex or zarr's side. I've been using zarr for about a month now and recently found out about vaex's 'out-of-core' processing capabilities. Wondering if there's anything I can contribute to make this vaex-zarr connection happen.

@alimanfoo
Copy link
Member

Hi @weiji14, I think this depends on a bit of technical discussion with @maartenbreddels.

Zarr is intended primarily for storing data in chunks, optionally with compression. It is possible to store data without chunks (effectively a single chunk, as @jakirkham suggested) and with no compression, but then there is no benefit over using Arrow or HDF5 in contiguous layout.

So I'm wondering:

(1) Is there a use case that would justify getting vaex and zarr plumbed together for just the case of no chunking and no compression?

(2) Does anyone want/need to get vaex working over zarr with chunking (and possibly also compression)?

@weiji14
Copy link

weiji14 commented Jun 16, 2020

Hi @alimanfoo, I suppose I'm primarily interested in the out-of-core/memmap use case, which seems to be what #265 is handling. To the best of my knowledge, Vaex currently supports 2D data tables, but doesn't work so well on N-dimensional cases which is what Zarr is built to handle.

(1) Is there a use case that would justify getting vaex and zarr plumbed together for just the case of no chunking and no compression?

I suppose there is, if the benefit of doing out-of-core processing on memory-limited hardware outweighs that of running parallel workloads on multiple chunks. E.g. processing on a laptop rather than on a cluster.

(2) Does anyone want/need to get vaex working over zarr with chunking (and possibly also compression)?

Not very well educated on the technical aspects, but is it possible to memory map to individual zarr chunks (assuming no compression), since they're basically just files in the filesystem, or is it more trouble than what it's worth?

@alimanfoo
Copy link
Member

Hi @weiji14

Hi @alimanfoo, I suppose I'm primarily interested in the out-of-core/memmap use case, which seems to be what #265 is handling. To the best of my knowledge, Vaex currently supports 2D data tables, but doesn't work so well on N-dimensional cases which is what Zarr is built to handle.

(1) Is there a use case that would justify getting vaex and zarr plumbed together for just the case of no chunking and no compression?

I suppose there is, if the benefit of doing out-of-core processing on memory-limited hardware outweighs that of running parallel workloads on multiple chunks. E.g. processing on a laptop rather than on a cluster.

The next question is, in this case, what value would using zarr add over other backends (hdf5, arrow)?

(2) Does anyone want/need to get vaex working over zarr with chunking (and possibly also compression)?

Not very well educated on the technical aspects, but is it possible to memory map to individual zarr chunks (assuming no compression), since they're basically just files in the filesystem, or is it more trouble than what it's worth?

I imagine you could memory map chunks with no compression. And you could also figure out for a given row what chunk and what offset within a chunk you need.

Cheers,
Alistair

@weiji14
Copy link

weiji14 commented Jun 16, 2020

The next question is, in this case, what value would using zarr add over other backends (hdf5, arrow)?

A very good point 😄 I can only speak for the HDF5 case since I don't use arrow, but off the top of my head, what I can think of are 1) the ability to see some of the ndarray's metadata (via the external .zmetadata file) without necessarily opening the data file, 2) Nicer thread-safe/concurrent read/writes by default (see also https://stackoverflow.com/questions/34906652/does-hdf5-support-concurrent-reads-or-writes-to-different-files), though I'm not too sure if this applies for the contiguous chunk case.

I imagine you could memory map chunks with no compression. And you could also figure out for a given row what chunk and what offset within a chunk you need.

Ok, good to know that it's a possibility.

Should have probably mentioned too that I'm coming in as an xarray user. I've actually read the blog post at https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314 (a few times now) and can see some discussion over at #556 on 'File Chunk Stores'. It might be the case that a parallelized Zarr interface over a HDF5-like storage object would suffice for my use case (specificially, avoiding the time cost of loading many hdf5/zarr objects into memory via xarray).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants