Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Zarr as a possible alternate array data / storage backend #230

Closed
oruebel opened this issue Nov 9, 2017 · 17 comments
Closed

Evaluate Zarr as a possible alternate array data / storage backend #230

oruebel opened this issue Nov 9, 2017 · 17 comments
Assignees
Labels
category: enhancement improvements of code or code behavior
Milestone

Comments

@oruebel
Copy link
Contributor

oruebel commented Nov 9, 2017

Possible Future Enhancement

This ticket is mainly as a note for a possible future enhancement, i.e., something to possibly look at in the future. I just came across the Zarr library http://zarr.readthedocs.io/en/latest/index.html The library implements an interface that looks similar to h5py but is designed more generally for chunked, compressed, N-dimensional arrays with the ability to serialize data to files on disk, inside a Zip file, on S3. As such the it might be a possible candidate for implementing alternate storage backends in the future. It looks like Zarr is still in the early stages of development.

Problem/Use Case

This ticket will be mainly relevant to future discussion of possible other storage backends in addition to HDF5 in the future.

@oruebel oruebel added the category: enhancement improvements of code or code behavior label Nov 9, 2017
@oruebel oruebel added this to the NWB 2.x milestone Nov 9, 2017
@oruebel oruebel self-assigned this Nov 9, 2017
@jakirkham
Copy link

Thanks for raising this issue. It would be great to discuss with you about your experiences trying Zarr. Opened issue ( https://github.com/zarr-developers/zarr/issues/333 ) to discuss any feedback and/or suggestions you may have.

@oruebel
Copy link
Contributor Author

oruebel commented Nov 16, 2018

@jakirkham I tested to drop in zarr as backend a while ago. Getting the base primitives to work (ie., Groups, Datasets, Attributes) is not too complicated because you can pretty much just copy the HDF5IO backend in FORM and replace h5py. The part that gets more tricky is how to deal with links and datasets of object- and region-references. As far as I can tell, those concepts are not supported by Zarr. That is basically the part I stopped at, because I didn't have the time to really work on implementing those features in a cross-platform way.

@jakirkham
Copy link

Could you please explain a bit more about why those features are needed for NWB?

@oruebel
Copy link
Contributor Author

oruebel commented Dec 6, 2018

NWB:N stores data from complex neurophysiology experiments. As such, there are large collections of different kinds of data and metadata that are related to each other. Links to other objects are critical to allow users to identify ,e.g., related metadata and avoid duplicate storage of the same data. Datasets of object references can serve a similar purpose but allow us to reference multiple objects, e.g,. which TimeSeries belong to a sweep or epoch. Region references then allow use to transparently reference subsets of datasets and is critical to support annotation of data (e.g., epochs), referencing of of subsets of metadata (e.g., select electrodes), and allows us to support ragged arrays , e.g., to create variable-length data vectors in tables. Ultimately, linking data allows us to make relationships between data explicit, avoid data duplication, and model complex links between data that pure hierarchical structure can't express. For details on where links, object, and region references are being used in NWB:N please the format specification https://nwb-schema.readthedocs.io/en/latest/format.html

@jakirkham
Copy link

Thanks for the details. Have raised issue ( zarr-developers/zarr-python#389 ). To discuss how best to implement this in Zarr. Feel free to chime in over there.

@oruebel
Copy link
Contributor Author

oruebel commented Jan 13, 2019

@jakirkham thanks for creating the zarr developer issue. If this can be done, then I think this would be a great opportunity to then also do a comparison of the different Zarr backends and HDF5 for different NWB:N use cases.

@oruebel
Copy link
Contributor Author

oruebel commented Jan 31, 2019

See also #629

@chrisroat
Copy link

I think this is a very interesting discussion. While I've become a user of many of the tools people on this thread are developing, I'm not down in the trenches, so please forgive if my comments here are naive.

It has been pointed out that for full zarr<->NWB integration, zarr would need links and references. If I take a step back, this feels like it's putting extra pressure and complexity on the raw data format. Is the use of zarr attributes, which seems to be what is happening in hdmf-dev/hdmf#98, going to be sufficient? It seems like HDMF is a nice middle layer that coordinates things like this, though I could imagine something general like some of the proposals in zarr-developers/zarr-specs#49

@jakirkham
Copy link

If working with Zarr attributes would be sufficient, that seems like a good path forward for getting something working today. Suspect even if links were added in Zarr 3.0 it would be an extension protocol (so not guaranteed to be implemented). Whereas attributes already make sense to keep in Zarr 3.0. So there's a good chance an attribute based solution will work seamlessly between Zarr 2 and 3.

@jakirkham
Copy link

@oruebel @chrisroat, I wonder if you could stop by the Zarr meeting in 2 weeks? This would be an interesting use case to discuss and timely as well since we are working on the Zarr v3 spec. Details in issue ( zarr-developers/community#1 ). Please check the latest comment for agenda, call link, and meeting time.

cc @thewtex (who may be interested in this or know others who would be interested in this as well)

@chrisroat
Copy link

Yeah, I could stop in. It's on the calendar. Tagging @bendichter from NWB.

I think MATLAB would be an important use case, from my brief time in neuro now.
zarr-developers/community#16 (comment)

@thewtex
Copy link

thewtex commented May 7, 2020

@jakirkham I will plan on attending the next meeting, too.

@mgrauer

@jmdelahanty
Copy link

Hello! I just learned about the Zarr library and it looks pretty neat! I was wondering how I might help work on something like this. Are there any active branches looking into Zarr?

@oruebel
Copy link
Contributor Author

oruebel commented Jun 13, 2021

Are there any active branches looking into Zarr?

@jmdelahanty PR hdmf-dev/hdmf#98 on HDMF implements a Zarr backend and this #1018 is the corresponding PR on PyNWB to setup the Zarr backend for NWB. The PRs are a bit stale but since they mostly add new functionality, I don't think syncing them with the current dev branches should be too hard. The latest state (before I had to put those PRs to the side) was that the code was working and mostly fully functional. The main piece missing in the HDMF PR is to work out some of details with links when converting from HDF5 to Zarr (and vice-versa). I.e., with the code from the PRs I was able to read/write NWB data to/from Zarr as well as convert NWB files from HDF5 to Zarr via the export function (but as I said there are a few corner cases with links that need to be worked out). Dealing with links is in general one of the main hurdles, as Zarr does not support links and object references nativel and so the PRs implement a custom solution for links (essentially storing definitions of links as JSON and adding reserved attributes to help distinguish between regular datasets/groups and links).

In terms of publications related to this, the following papers may be of interest:

  • A. J. Tritt, O. Rübel, B. Dichter, R. Ly, D. Kang, E. F. Chang, L. M. Frank, K. E. Bouchard, “HDMF: Hierarchical Data Modeling Framework for Modern Science Data Standards,” IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, December 2019, pp. 165-179. https://ieeexplore.ieee.org/document/9005648
  • D. Kang, O. Rübel,S. Byna, and S. Blanas, "Predicting and Comparing the Performance of Array Management Libraries," 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, 2020, pp. 906-915, doi: 10.1109/IPDPS47924.2020.00097.

Another approach to working with Zarr is to create JSON metadata files to allow Zarr to read from HDF5 files directly. @bendichter had experimented with this at some point and was able stream NWB data from S3 via Zarr. However, since HDF5 now (since h5py 3.2) also has an S3 driver, we have been focusing more on that approach for reading data from S3. Support for h5py 3.x (and S3 read support) will be part of the upcoming main HDMF/PyNWB releases.

I was wondering how I might help work on something like this.

What are your main interests and use-cases for NWB+Zarr. So far this work has been mainly exploratory to evaluate what we can do with Zarr. However, so far this work has not made it into production-ready state, among others because 1) the overhead for supporting multiple storage backends; 2) lack of support for Zarr in other languages (e.g., supporting Zarr in MatNWB will be tricky); 3) lack of funding for Zarr integration; and 4) a lack of clear use cases that require Zarr that would justify the effort needed to make this production-ready. Happy to chat if this is something you are interested in diving in more.

@oruebel
Copy link
Contributor Author

oruebel commented Nov 11, 2022

This is being addressed in https://github.com/hdmf-dev/hdmf-zarr

@oruebel oruebel closed this as completed Nov 11, 2022
@jmdelahanty
Copy link

I somehow never saw your reply to me so I'm sorry I missed out on seeing this come to fruition, but congratulations!! Very cool!

1 similar comment
@jmdelahanty
Copy link

I somehow never saw your reply to me so I'm sorry I missed out on seeing this come to fruition, but congratulations!! Very cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior
Projects
None yet
Development

No branches or pull requests

5 participants