Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MISC] Update guidelines on file formats and multidimensional arrays - for derivatives #1614

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

CPernet
Copy link
Collaborator

@CPernet CPernet commented Sep 13, 2023

clarifies when to keep the same format and possibly to use h5 and zarr (although matlab format for zarr not clear)
based on https://docs.google.com/document/d/1JtTu5u7XTkWxxnCIH6sxGajGn1qG_syJ-p14aejpk3E/edit?usp=sharing

@CPernet
Copy link
Collaborator Author

CPernet commented Sep 18, 2023

@Lestropie do you know if Zarr has matlab support? also, addimg here @mikecroucher JIC

@Lestropie
Copy link
Collaborator

Sorry, can't provide any insight on Zarr support; I had flagged it as a prospect at one point, but don't have any hands-on experience.

I'm not entirely familiar with all historical discussions (eg. #197), but I wonder if it's premature to be adding "suggestions" for ND data storage given the contention?
Or was there greater progress on reaching some consensus on this at the derivatives meeting, in which case the document could perhaps do with providing some additional justification / evidence for the decision?
Otherwise may I propose making a re-attempt at #197 where the scope is better restricted, maybe the OP can be an updated list with pros & cons, & can draw a vote if necessary?

@CPernet
Copy link
Collaborator Author

CPernet commented Sep 18, 2023

Yes this has been discussed a lot - see for instance bids-standard/bep021#1.
Given that there is a need to N-D arrays, and hdf5 works maybe we should limit to that? I cannot remember who/how we ended up proposing zarr as well

@satra
Copy link
Collaborator

satra commented Sep 18, 2023

just an fyi - ome.zarr is already adopted in the bids standard (see microscopy). the actual container is just half of the puzzle, having a good metadata layer for that container is the second and very important component. regarding zarr and matlab, we have talked to people at mathworks that they should take the lead on it or work with people to integrate it.

@CPernet
Copy link
Collaborator Author

CPernet commented Nov 11, 2023

this one has been extensively discussed on the google doc .. even @robertoostenveld agreed ... ready to merge
@effigies @Remi-Gau @christinerogers

@sappelhoff sappelhoff changed the title guideline updates [MISC] Update guidelines on file formats and multidimensional arrays Nov 30, 2023
Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CPernet I have pushed a commit to slightly rephrase your original wording and to streamline it with the text that had existed previously.

What are your opinions @Remi-Gau @effigies ?

clarifies when to keep format and add multidimensional arrays
Copy link

codecov bot commented Jan 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.92%. Comparing base (1afbfe8) to head (f1af753).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1614   +/-   ##
=======================================
  Coverage   87.92%   87.92%           
=======================================
  Files          16       16           
  Lines        1375     1375           
=======================================
  Hits         1209     1209           
  Misses        166      166           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +134 to +139
HDF5 and Zarr container format files (note that `.zarr` is typically a directory) should contain the data only (with the field `data`).
This `data` field should be treated as a "virtual directory tree" with a depth one level,
containing BIDS paths at the level of the multidimensional file
(that is, the `.zarr` directory root or the `.h5` file).
BIDS path rules MUST be applied as though these paths existed within the dataset.
Metadata about the multidimensional array SHOULD be documented in the associated JSON sidecar file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to review this. I think this is roughly what's being proposed here (using raw data as an example):

dataset/
  sub-01/
    anat.zarr/
      .zgroup
      sub-01_T1w/
        .zarray
        .zattrs
        ...
  sub-02/
    anat.zarr/
      .zgroup
      sub-02_T1w/
        .zarray
        .zattrs
        ...

This repackaging of BIDS data inside a hierarchical data format feels very radical and will require tools to be rewritten to understand entire datasets, as opposed to specific derivative files. I suspect that this is not what was actually intended, so I think it would be very helpful to see examples of the intent.

I see basically two cases that should be addressed:

  1. Existing BIDS-supported formats are built on HDF5 (.nwb, .snirf) or Zarr (.ome.zarr). When considering options for new formats, these should be prioritized to reduce the expansion of necessary tooling.
  2. For generic multidimensional array outputs, HDF5 and Zarr can be treated as extensions of .tsv files. Where TSV files with a header row represent a collection of named 1D arrays, an HDF5/Zarr container contains named N-D arrays that are not constrained to have a common shape. For simplicity, it is encouraged to use a collection of names at the root, which are to be described in a sidecar JSON. For example, to output raw model parameters for an undefined model, one might use:
sub-<label>/<datatype>/<entities>_<suffix>.zarr/
    .zgroup
    alpha/
        .zarray
        ...
    beta/
        .zarray
        ...
sub-<label>/<datatype>/<entities>_params.json

And the JSON file would contain:

{
  "alpha": {
    "Description": "alpha parameter for XYZ model, fit using ABC estimation process",
    "Units": "arbitrary"
  },
  "beta": {
    "Description": "beta parameter for XYZ model, fit using ABC estimation process",
    "Units": "arbitrary"
  }
}

If this was the intent, I'm happy to propose alternative text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand your example, those arrays were meant for 'stuff' that does not fit the current formats - why one would start allowing packing current data, might be BIDS 2.0. but seems to radical at this stage. ??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial example is the best I can make of the current text. I don't know what is being described here.

@CPernet
Copy link
Collaborator Author

CPernet commented Feb 22, 2024

the PR is about the use of mutidim array in derivatives only - as discussed at the meeting - and we'll use that in the other BEPs to make examples

@CPernet
Copy link
Collaborator Author

CPernet commented Feb 22, 2024

do I understand that you ask for an example?

@effigies
Copy link
Collaborator

Yes, an example would be useful to understand what is being specified. At the bottom of my previous comment, I gave an example that made sense to me.

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Apr 15, 2024

Quick note (not necessarily specific to this PR) several maintainers have noticed that we have introduced new features in the past months / years (e.G citation.cff files) that do not necessarily have been accompanied by updates to the bids-examples which makes it harder:

  • to ensure correct validation by the validator
  • to guide users who would like to take advantage of those features.

I suspect that an accompanying PR in the bids-example repo would be welcomed for the present PR.

@SylvainTakerkart
Copy link

SylvainTakerkart commented Apr 15, 2024

the PR is about the use of mutidim array in derivatives only - as discussed at the meeting - and we'll use that in the other BEPs to make examples

just a note: the "in derivatives only" info might be worth being mentioned in the title of the PR, so that the content and aim of the PR is more explicit for newcomers ;)

@CPernet CPernet changed the title [MISC] Update guidelines on file formats and multidimensional arrays [MISC] Update guidelines on file formats and multidimensional arrays - for derivatives Apr 16, 2024
@fangq
Copy link

fangq commented Apr 18, 2024

at the heart of Zarr is the blosc1 meta-compressor. however, blosc2 has been out for a number of years and shows significant speed benefit over its predecessor, with more options for data compression/slicing, but it is not backward compatible with blosc1. if the goal of this thread is to improve IO performance, why not leap-frog to blosc2 directly?

@effigies
Copy link
Collaborator

What container format is using blosc2? The goal is wide support, not the fastest compression algorithm.

@fangq
Copy link

fangq commented Apr 19, 2024

What container format is using blosc2? The goal is wide support, not the fastest compression algorithm.

I will let @FrancescAlted, the upstream author of blosc2, comment on the current availability of the codec.

in general, any container format that supports "filters" can use blosc2 in place of other common compressors (such as zlib, gzip , for example, hdf5 filters, or my JSON/JData ND-array annotations) - from this document, blosc2 filter for hdf5 seems to have been available in 2022

https://www.hdfgroup.org/wp-content/uploads/2022/05/Blosc2-and-HDF5-European-HUG2022.pdf

also, I added blosc2 support to my NIfTI JSON-wrapper, JNIfTI, since 2022 for both MATLAB/Octave and Python

if you want to try

for python

pip install jdata bjdata blosc2

download this benchmark script at https://github.com/NeuroJSON/pyjdata/blob/master/test/benchcodecs.py
and run

python3 benchcodecs.py

for matlab/octave

install zmat and jsonlab
and run this script after git clone https://github.com/NeuroJSON/jnifti.git

https://github.com/NeuroJSON/jnifti/blob/master/samples/headct/create_headct_jnii.m

the outputs on my laptop (i7-12700H) is shown below

# Python
jdata version:0.6.0

- Testing binary JSON (BJData) files (.jdb) ...
{'codec': 'npy', 'save': 0.3153388500213623, 'sum': 10000.0, 'load': 0.9327757358551025, 'size': 800000128}
{'codec': 'npz', 'save': 1.8001885414123535, 'sum': 10000.0, 'load': 1.3338582515716553, 'size': 813846}
{'codec': 'bjd', 'save': 0.5631542205810547, 'sum': 10000.0, 'load': 1.4387249946594238, 'size': 800000012}
{'codec': 'zlib', 'save': 1.4407787322998047, 'sum': 10000.0, 'load': 1.0786707401275635, 'size': 813721}
{'codec': 'lzma', 'save': 6.666639566421509, 'sum': 10000.0, 'load': 0.830712080001831, 'size': 113067}
{'codec': 'lz4', 'save': 0.31477975845336914, 'sum': 10000.0, 'load': 0.736771821975708, 'size': 3371487}
{'codec': 'blosc2blosclz', 'save': 0.019755840301513672, 'sum': 10000.0, 'load': 0.3465735912322998, 'size': 819576}
{'codec': 'blosc2lz4', 'save': 0.012031078338623047, 'sum': 10000.0, 'load': 0.3506603240966797, 'size': 817635}
{'codec': 'blosc2lz4hc', 'save': 0.1105949878692627, 'sum': 10000.0, 'load': 0.33371639251708984, 'size': 3236580}
{'codec': 'blosc2zlib', 'save': 0.08629131317138672, 'sum': 10000.0, 'load': 0.3578808307647705, 'size': 952705}
{'codec': 'blosc2zstd', 'save': 0.04992985725402832, 'sum': 10000.0, 'load': 0.3468794822692871, 'size': 189897}

- Testing text-based JSON files (.jdt) ...
{'codec': 'npy', 'save': 0.2528667449951172, 'sum': 10000.0, 'load': 0.5686492919921875, 'size': 800000128}
{'codec': 'npz', 'save': 1.8248929977416992, 'sum': 10000.0, 'load': 1.1545724868774414, 'size': 813846}
{'codec': 'bjd', 'save': 0.6585536003112793, 'sum': 10000.0, 'load': 1.4138269424438477, 'size': 800000012}
{'codec': 'zlib', 'save': 1.4275908470153809, 'sum': 10000.0, 'load': 1.0514166355133057, 'size': 1084942}
{'codec': 'lzma', 'save': 6.216473817825317, 'sum': 10000.0, 'load': 0.7995953559875488, 'size': 150738}
{'codec': 'lz4', 'save': 0.3157920837402344, 'sum': 10000.0, 'load': 0.7323768138885498, 'size': 4495297}
{'codec': 'blosc2blosclz', 'save': 0.014778375625610352, 'sum': 10000.0, 'load': 0.3591310977935791, 'size': 1092747}
{'codec': 'blosc2lz4', 'save': 0.015497684478759766, 'sum': 10000.0, 'load': 0.3527402877807617, 'size': 1090159}
{'codec': 'blosc2lz4hc', 'save': 0.12303662300109863, 'sum': 10000.0, 'load': 0.33176469802856445, 'size': 4315421}
{'codec': 'blosc2zlib', 'save': 0.0899665355682373, 'sum': 10000.0, 'load': 0.35532426834106445, 'size': 1270252}
{'codec': 'blosc2zstd', 'save': 0.04385828971862793, 'sum': 10000.0, 'load': 0.33220958709716797, 'size': 253176}
% MATLAB 
Saving headct_.bnii:	 Saving: t=0.087299 s	Loading: 	0.027760 s	Size: 11700.77 kB
Saving headct_zlib.jnii:	 Saving: t=0.199829 s	Loading: 	0.064823 s	Size: 3382.12 kB
Saving headct_zlib.bnii:	 Saving: t=0.187931 s	Loading: 	0.035956 s	Size: 2536.55 kB
Saving headct_gzip.jnii:	 Saving: t=0.212959 s	Loading: 	0.066499 s	Size: 3382.14 kB
Saving headct_gzip.bnii:	 Saving: t=0.206740 s	Loading: 	0.050813 s	Size: 2536.56 kB
Saving headct_lzma.jnii:	 Saving: t=1.352349 s	Loading: 	0.113200 s	Size: 2511.28 kB
Saving headct_lzma.bnii:	 Saving: t=1.348166 s	Loading: 	0.094130 s	Size: 1883.42 kB
Saving headct_lz4.jnii:	 Saving: t=0.036000 s	Loading: 	0.040914 s	Size: 4367.56 kB
Saving headct_lz4.bnii:	 Saving: t=0.026024 s	Loading: 	0.016265 s	Size: 3275.63 kB
Saving headct_lz4hc.jnii:	 Saving: t=0.295036 s	Loading: 	0.053748 s	Size: 3504.56 kB
Saving headct_lz4hc.bnii:	 Saving: t=0.289909 s	Loading: 	0.016173 s	Size: 2628.38 kB
Saving headct_blosc2lz4.jnii:	 Saving: t=0.036049 s	Loading: 	0.046936 s	Size: 4377.62 kB
Saving headct_blosc2lz4.bnii:	 Saving: t=0.026930 s	Loading: 	0.013801 s	Size: 3283.17 kB
Saving headct_blosc2zstd.jnii:	 Saving: t=0.121007 s	Loading: 	0.036588 s	Size: 3229.69 kB
Saving headct_blosc2zstd.bnii:	 Saving: t=0.115441 s	Loading: 	0.014060 s	Size: 2422.23 kB

@FrancescAlted
Copy link

at the heart of Zarr is the blosc1 meta-compressor. however, blosc2 has been out for a number of years and shows significant speed benefit over its predecessor, with more options for data compression/slicing, but it is not backward compatible with blosc1. if the goal of this thread is to improve IO performance, why not leap-frog to blosc2 directly?

Just a note here. blosc2 is actually backward compatible with blosc1 (i.e. blosc2 tools can read blosc1 data without problems), but it is not forward compatible (i.e. blosc2 data cannot be read by using blosc1 tools).

@FrancescAlted
Copy link

What container format is using blosc2? The goal is wide support, not the fastest compression algorithm.

[Upstream developer speaking here] The format for Blosc2 is documented in a series of (relatively short) documents:

Regarding adoption, and besides the NIfTI wrapper by @fangq, there are other well-stablished formats adopting it. For example, HDF5 supports Blosc2 via PyTables since 2022, and more recently in h5py via hdf5plugin. Also, there is b2h5py a package that monkey patches h5py for optimized reading of n-dimensional Blosc2 slices in HDF5 files. Another package that recently adopted blosc2 for replacing blosc1 for lossless compression has been ADIOS2. Besides that, there are different wrappers for languages like Julia, Rust, and probably others as well.

@fangq
Copy link

fangq commented Apr 19, 2024

Just a note here. blosc2 is actually backward compatible with blosc1 (i.e. blosc2 tools can read blosc1 data without problems), but it is not forward compatible (i.e. blosc2 data cannot be read by using blosc1 tools).

sorry, what I meant was that blosc2 is not forward-compatible with blosc1 - that the stream format contains breaking changes. Zarr also has filter support - is there a plan to support blosc2 as a filter?

ideally, data formats used in BIDS should prioritize "terminal formats" that either standardized (like NIFTI-1, TSV/CSV/json), or formats that is committed to both forward/backward compatibility for long-term reusability (LTS).

@FrancescAlted, is the blosc2 team more or less committed to keeping the forward compatibility of blosc2 format? or there is still a good chance for breaking changes in the future?

@FrancescAlted
Copy link

FrancescAlted commented Apr 19, 2024

Just a note here. blosc2 is actually backward compatible with blosc1 (i.e. blosc2 tools can read blosc1 data without problems), but it is not forward compatible (i.e. blosc2 data cannot be read by using blosc1 tools).

sorry, what I meant was that blosc2 is not forward-compatible with blosc1 - that the stream format contains breaking changes. Zarr also has filter support - is there a plan to support blosc2 as a filter?

ideally, data formats used in BIDS should prioritize "terminal formats" that either standardized (like NIFTI-1, TSV/CSV/json), or formats that is committed to both forward/backward compatibility for long-term reusability (LTS).

@FrancescAlted, is the blosc2 team more or less committed to keeping the forward compatibility of blosc2 format? or there is still a good chance for breaking changes in the future?

Definitely, Blosc2 is supporting all the features we wanted, so our plan is to support it without breaking changes for the years to come.

@CPernet
Copy link
Collaborator Author

CPernet commented Apr 22, 2024

Hi all, thanks for contributing! should we add to the PR some specific info about what is supported/preferred as Zarr type? blosc 1 or 2

Note for myself: once this is done, we should also update the resource part (e.g. starter kit and over areas with @fangq resources for reading -- the only matlab/octave resource I am aware of)

@satra
Copy link
Collaborator

satra commented Apr 22, 2024

@CPernet - zarr uses blosc1 by default, but most people can and do choose optimized compression settings using numcodecs and/or imagecodecs depending on the application needs (read/write, datatype, etc.,). however, technically speaking blosc2 can be used through registered numcodecs (see someone's example here: https://github.com/openclimatefix/ocf_blosc2). it won't be efficient as the zarr api does not have sub-chunk capabilities at the moment, which is where blosc2 benefits generally comes in.

i know there were conversations back in 2021/22 about caterva/blosc2 and zarr. for the moment, i think the biggest change coming to zarr is the implementation of sharding (i.e. storing multiple chunks in a binary blob to optimize for compute, storage, and transport) through storage transformers. @martindurant and others may know more about the state of zarr/blosc2 etc.,. or point people to the conversations.

@CPernet
Copy link
Collaborator Author

CPernet commented Apr 22, 2024

I have 0 knowledge about those, just asking if you guys want to change the PR adding some details about what is expected/supported (@satra blosc1 I'm guessing)

@satra
Copy link
Collaborator

satra commented Apr 22, 2024

@CPernet - personally, i would leave that to the downstream user and stay away from a specific recommendation. for example, for light sheet microscopy we can recommend blosc1 (zip std level 5) - slower to write but optimized for storage + reading. that's a very specific subtype though.

@fangq
Copy link

fangq commented Apr 24, 2024

thanks @satra and @CPernet for your comments.

because blosc1/blosc2 naively supports data chunking/fast reading data slices/hypercubes from a large data volume (or sparse frame splitting into distributed chunk files), sort of overlap with the goal of zarr, for some reason, I had the impression that zarr was a distributed storage interface built on top of blosc1.

after reading the docs more carefully, it appears that blosc1 was only used in zarr as if it is just a regular compressor, such as zlib/gzip, and is only applied to the chunk-level data instead of the global array. is this correct? if this is the case, does zarr benefit from blosc1 data format at all (aside from faster multi-threaded compression, SIMD and shuffle filters etc)?

my only reason of bringing up blosc2 here was trying to highlight the needs for considering "forward compatibility" when adding new formats to BIDS - even for derivatives. When I tried to write a BIDS-to-JSON converter to convert 1000+ OpenNeuro BIDS datasets in order to host in my NoSQL database on https://neurojson.io, I had to handle different supported data formats. For raw data, the number of formats are still manageable, and most of these formats are "terminal formats" that are perpetually unchanged.

if we allow to add more data formats to derivatives, especially formats that still evolving (for example, I see zarr v3 spec has breaking changes to v2, so does v2 to v1), this will make the ecosystems dependent on BIDS, such as my project, complicated in terms of adding support to additional parsers, additional versions of the parsers, and additional codecs of the new versions etc in order to fully handle the files.

if the zarr team can affort to specify a subset of the features (metadata keys, organization schemes, codecs) that are somewhat production-stable and we only add those in BIDS, that would make the lives of additional BIDS-dependent projects easier. but I understand it is a fast evolving format and promising a stable interface is not yet feasible (same applies to HDF5)

@martindurant
Copy link

if the zarr team can affort to specify a subset of the features (metadata keys, organization schemes, codecs) that are somewhat production-stable and we only add those in BIDS, that would make the lives of additional BIDS-dependent projects easier. but I understand it is a fast evolving format and promising a stable interface is not yet feasible (same applies to HDF5)

Hardly fast-evolving! Yes, there is a new version coming with breaking changes, but v2 is stable and will be supported far into the future. It even has partial-buffer read support for blosc1. Exactly how sharding will interact with upcoming blosc2 support, I don't know - I suspect the two are unconnected.

CPernet and others added 3 commits September 9, 2024 15:16
the issue being the hd5 and zarr are planned for future derivatives, so the example uses something not currently supported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants