Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Proposal for multidimensional array file format #197

Open
tyarkoni opened this issue Apr 5, 2019 · 106 comments
Open

[ENH] Proposal for multidimensional array file format #197

tyarkoni opened this issue Apr 5, 2019 · 106 comments

Comments

@tyarkoni
Copy link

tyarkoni commented Apr 5, 2019

At the BIDS-ComputationalModels meeting, it became pretty clear that a wide range of applications require (or would at benefit considerably from) the ability to read in generic n-dimensional arrays from a binary file. There are at least two major questions that should be discussed here, and then we should move to draft a PR modifying the BIDS-Raw specification:

  1. What file format should we use? This should be something generic enough that it can be easily read on all common platforms and languages. The main proposals that came up at the meeting were for numpy (.npy) or HDF5 containers (.h5). While numpy is technically a Python format, it's sufficiently simple and well-supported that there appear to be available libraries for the major languages. Please suggest other potential solutions.

  2. How and where should we represent associated metadata? The generic file format (and naming conventions, etc.) will eventually described in the BIDS-Raw spec, alongside all of the other valid formats (.tsv, nifti, etc.). But some applications are likely to require fairly specific interpretations of the data contained in the file. There appears to be some convergence on the notion of representing the relevant metadata in relevant sections of the BIDS-Derivatives spec (or current BEPs)—i.e., every major use case would describe how data in the binary array format should be interpreted when loaded. We could also associate suffixes with use cases, so that a tool like PyBIDS can automatically detect which rules/interpretations to apply at load time. But if there are other proposals (e.g., a single document describing all uses cases), we can discuss that here.

I'm probably forgetting/overlooking other relevant aspects of the discussion; feel free to add to this. Tagging everyone who expressed interest, or who I think might be interested: @JohnGriffiths @maedoc @effigies @yarikoptic @satra.

@effigies
Copy link
Collaborator

effigies commented Apr 5, 2019

Note that there are two versions of npy, so compatibility levels of 1 and 2 should be assessed.

My primary concern with npy is that npy is not compressed, npz is just a zip of a directory of npy files, which almost certainly won't handle random read access as well as HDF5.

My primary concern with HDF5 is that it's just a container, and we will find ourselves defining formats. Perhaps just saying it contains only a dataset with name /data or similar will resolve that.

@satra
Copy link
Collaborator

satra commented Apr 5, 2019

in a different world, but probably related to computational models BRAIN has funded development of the NWB standard. to the extent that needs may become similar, it may be worthwhile thinking about supporting NWB in BIDS.

this will make the metadata world both easier (included in the NWB file) and harder (non conformant with BIDS), depending on your point of view. however, the NWB folks are also considering alternatives like exdir, which is like HDF5 but with external metadata and binary blobs as numpy files.

@arokem
Copy link
Collaborator

arokem commented Apr 5, 2019

Sorry: could I ask for a bit more context? What kind of data will be stored in these files? If it's large enough to justify parallel processing of its contents, allow me to throw in a plea to consider zarr compatibility. I think that HDF5 could be made to play nice with zarr.

@effigies
Copy link
Collaborator

effigies commented Apr 5, 2019

@satra In principle that seems fine, but their HDF5 format looks basically like HDF5 + some mandatory metadata, so if flexibility is a potential downside, it persists.

If it's not a downside, then I have no principled objection.

@arokem The issue driving us here is less the size of the data than the dimensionality. That said, there's no reason that the files couldn't get large enough for random and parallel access to be concerns, which is why I think HDF5 is my inclination (despite my above-noted reservations). The goal is wide interoperability (in particular, C, R, MATLAB and Python) and not reinventing the wheel, so if that format fits, I for one am happy to consider it.

@satra
Copy link
Collaborator

satra commented Apr 5, 2019

@arokem - the NWB folks are also considering zarr compatibility, especially with the N5 API. which would also constrain HDF5, since N5 doesn't support all aspects of it.

@arokem
Copy link
Collaborator

arokem commented Apr 5, 2019

Yup. For reference: NeurodataWithoutBorders/pynwb#230

@yarikoptic
Copy link
Collaborator

On one hand I am in strong favor of reusing someone else's "schema" and possibly "tooling" on top of HDF5 (container)! NWB might (do not know how well it aligns with the needs of ComputationalModels metadata) be a good choice. Import/export "compatibility" with other computation-oriented formats (like zarr) might be a plus.

BUT thinking in conjunction with 2. -- if we choose a "single file" container format to absorb both (data and metadata), we would step a bit away from "human accessibility" of BIDS. We already have an issue of metadata location duality, e.g. it being present in the data files (nii.gz) headers -- "for machines", and some (often additional but some times just duplicate) in side car files -- "for machines and humans" (related - recent #196). Sure thing bids-validator could assure consistency, but we subconsciously trying to avoid such redundancy, and I wonder if that might still be a way to keep going. May be there is a lightweight format (or some basic "schema" for HDF5) which would not aim to store any possible metadata, but just store minimally sufficient for easy and unambiguous IO of multi-dimensional arrays (if that is the goal here). And then pair it up with the side car .json file convenient access to metadata (defined in BIDS, if there is no existing schema for "ComputationalModels" elsewhere to reuse; not duplicated in the actual data file) for easy human and machines use (without requiring to open the actual data file which would require tooling)? If we end up with a single file format to contain both -- I think we might need to extract/duplicate metadata in a sidecar file anyways for easier human (and at times tools) consumption.

@tyarkoni
Copy link
Author

tyarkoni commented Apr 5, 2019 via email

@yarikoptic
Copy link
Collaborator

@yarikoptic sorry, I realize on re-read that I wasn't clear, but your proposed approach (putting metadata in the json sidecar and only the raw ndarray in the binary file) is exactly what we seemed to converge on at the end of the BIDS-CM meeting. (I.e., the sidecar would supply the metadata needed to interpret the the common-format array appropriately for the use case specified in the suffix.)

I am delighted to hear that similar minded us independently decided to contribute the XXXX-th model of the wheel to the humanity!

FWIW, I ran into https://news.ycombinator.com/item?id=10858189 on https://cyrille.rossant.net/moving-away-hdf5/ (even @chrisgorgo himself commented on there) -- seems a good number of groups/projects ended up switching from HDF5 to some ad-hoc data blob + metadata files "format". May be it would be worth stating the desired features (I think those weren't mentioned)? e.g. which among the following would be most important?

  • portability and library support -- probably a must...
  • efficient random access / slicing / ... - desired or not?
    • relates to parallel processing etc. if just a "good to have" then probably not worth jumping to anything fancy
  • memory mapping - desired or not?
  • compression - desired or not? optional?

or in other words - aiming for processing or archival? if aiming for archival - probably compression is heavily desired... may be could be optional (we already have both .nii and .nii.gz supported IIRC, so could be .blob[.gz])... kinda boils down to .npy - which was also the choice at https://cyrille.rossant.net/moving-away-hdf5/ ;-)

@satra
Copy link
Collaborator

satra commented Apr 6, 2019

@yarikoptic - be careful of that blog post (i think it leads a lot of people astray), and do read all the threads that have emanated from it. for every such use case its easy to point to MATLAB and say that they use it for their base data format. also there are enough posts out there to also say that people who moved away ended up requiring many of the facilities of hdf5 and switching back to it. finally you should take a look at exdir and zarr as well as pointed in earlier threads, and in this followup thread to cyrille's original post and it's comments including the earliest one by konrad hinson (https://cyrille.rossant.net/should-you-use-hdf5/).

at the end of the day it's mostly about blobs and metadata. what one uses to house and link these things is indeed going to keep evolving depending on use cases. so i think the important thing is to think of the use cases, in both short term and to the extent possible longer term.

i like the questions that you have raised, and i think more than the format itself, the thought process should be around those required features, including archiving.

i'm not saying hdf5 is the answer here nor am i saying hdf5 is issue free, but i have also used it through MATLAB and Python over many years, for my use cases, without an issue. i would need to know their specific goals, applications, and use cases to make an informed judgment.

@maedoc
Copy link

maedoc commented Apr 6, 2019

We've made simple use of HDF5 (often just one or two datasets) for heavy numerical data (well, MB to TBs) in TVB, a computational modeling environment, for the last 7 years without the problems cited in Rossant's blogpost, mainly by keeping usage simple and heavily vetting library usage prior to version changes. I'd expect transparent compression (lz4 has nearly no CPU overhead) and memmapping are particularly useful for BIDS CM.

@effigies
Copy link
Collaborator

I've asked the participants in the computational models meeting to contribute their specific use cases, but I'll try to summarize according to my memory.

  1. Visual stimuli, which are 2D arrays of luminance/RGB (or similar) values + time. NIfTI has been used to include these, but it's somewhat an abuse of the format.

  2. Machine-learning training corpora, which will have an item dimension that will often be shuffled on multiple training runs, and other dimensions that have meaningful structure such as space or time which should be preserved.

  3. Simulation state variables. Environment states will look similar to corpora, with some spatial structure, a time dimension, and potentially many runs. Simulated states may or may not be spatially ordered, but still don't fit NIfTI well.

  4. Per-ROI covariance matrices. In the general discussion of statistical outputs, per-voxel statistics are easily represented in NIfTI, and even covariance matrices can be packed into dimensions 5 and 6 of NIfTI. For ROI-based outputs, we have the morphometry and timeseries examples to go by for packing single statistics or time series into TSVs, but multiple dimensions per entry would not work easily. We can get around it by having one file per matrix, and that would presumably be an option, but for large numbers of variables or ROIs, a multidimensional array structure would be useful.

I think there were a couple other examples, but as it became clear that some kind of multidimensional array would likely be the result, we did not compile a specific enumeration of all the needed properties, so hopefully we'll get some feedback.

Perhaps @maedoc can clarify the TVB uses that aren't suited to TSV/NIfTI, and what their minimal criteria and additional desiderata are.

@maedoc
Copy link

maedoc commented Apr 11, 2019

TVB uses that aren't suited to TSV/NIfTI

Surfaces & sparse matrices come to mind; these have straightforward serializations to arrays, so I would specify conventions for the serialization (e.g. faces, triangles, 0-based; sparse CSR, 0-based) instead of worrying about a new format.

@effigies
Copy link
Collaborator

Surfaces will be covered in GIFTI. What do you currently use HDF5 for?

@maedoc
Copy link

maedoc commented Apr 11, 2019

What do you currently use HDF5 for?

We don't use HDF5 for relational metadata, which is stored in an SQL DB and sidecar XML files, but just about everything else.

@effigies
Copy link
Collaborator

Okay.

To get back to @yarikoptic's desiderata:

  • portability and library support -- probably a must...

Agreed, this is most important IMO.

  • efficient random access / slicing / ... - desired or not?
    • relates to parallel processing etc. if just a "good to have" then probably not worth jumping to anything fancy
  • memory mapping - desired or not?

I see these three as basically related. Whether you want slicing for parallel access or just to avoid loading a ton of memory, if this isn't provided, the thing people are going to do is immediately convert to something that can be chunked for good performance over the desired slices and mmaped. Maybe they'll do it out of love for BIDS, but conversions are an adoption hurdle, to my mind.

  • compression - desired or not? optional?

I guess I'd say it should be an option. There are dense data that are difficult to compress where mmap access is going to be a higher priority, but there's also going to be sparse data that would be ridiculous to store without compression.


I may be prematurely pessimistic, but I don't see much hope for pleasing even a simple majority of people with any of the choices discussed here. (I may be projecting and it is just the case that I won't be pleased by my prediction of the majority's choice.) Another option to consider is not requiring a specific binary format, letting apps deal with the choice, and wait for some consensus to emerge in the community. If in a few years all MD arrays are, say, .npy/.npz files, then we can just acknowledge it in BIDS 2.0.

I would then add these conditions:

  1. One MD array per file (or directory, if exdir is used)
  2. Future-proofing
    1. Open formats
    2. Optional lossless compression with an open codec
  3. Standard BIDS metadata
    1. JSON sidecars, with metadata to be defined for each data type
    2. In-file metadata must match JSON metadata where duplication occurs

@maedoc
Copy link

maedoc commented Apr 16, 2019

I don't see much hope for pleasing even a simple majority of people with any of the choices discussed here

JSON is hardly ideal, but once it's chosen, use cases and implementations can get done, exploring the positives/negatives of the choice. You should just declare a fiat format (import random; random.choice(…)), with the provision that other contenders will have their chance in future iterations.

@effigies
Copy link
Collaborator

effigies commented Apr 18, 2019

Well, if we can consider JSON an acceptable choice, then I would probably just push on with .npy/.npz, for the simple reasons that it doesn't depend on a decimal serialization, it's mmap-able, can only hold one MD array (and thus doesn't permit complexity), and people have written parsers for MATLAB and R.

@fangq
Copy link

fangq commented Jun 14, 2019

I just want to let everyone know I am currently working on a new neuroimaging data interchange format, called JNIfTI.

My current draft of the file specification can be found at

https://github.com/fangq/jnifti/

together you can find a matlab nifti-1/2 to jnifti converter and jnii/bnii data samples.

https://github.com/fangq/jnifti/blob/master/lib/matlab/nii2jnii.m
https://github.com/fangq/jnifti/tree/master/samples

The basic idea is to use JSON and binary JSON (UBJSON) format to store complex scientific data, and completely get rid of a rigid, difficult-to-extend binary header. This makes the data readable, easy to extend and mixing with scientific data from other domains (like multi-modal data, physiology recordings, or computational models etc). There are also numerous JSON/UBJSON parsers out there, so, without writing any new code, a JNIfTI file can be readily parsed by these existing codes.

JNIfTI is designed with a compatibility layer to 100% translate the NIFTI-1/2 header/data/extension to the new structure, but once it is moved to JSON, you gain enormous flexibility to add new metadata, header info, organizing multiple datasets inside one document etc. I'd love to hear from this community, what additional information that are current lacking, and happy to accept proposals on defining new "required" metadata headers in to this format. My preference is to gradually shift the main metadata container from the NIFTIHeader structure to the "_DataInfo_" and "Properties" subfields in NIFTIData as the primary containers for metadata. This provides an opportunity to completely redesign the header entries.

https://github.com/fangq/jnifti/blob/master/JNIfTI_specification.md#structure-form

look forward to hearing from you.

PS: The foundation of the JNIfTI format is another specification called JData - a proposal to systematically serialize complex data structures, such as N-D arrays, trees, graphs, linked lists etc. The JData specification is current in Draft 1, and can be found at

https://github.com/fangq/jdata/

@CPernet
Copy link
Collaborator

CPernet commented Oct 11, 2019

I'm also all for @yarikoptic approach. Note that electrophys derivatives have the same issue with processed data typically in a different format, and we need a common ground. I discussed HDF5 with @GaelVaroquaux who have a strong opinion against it (maybe he can comment on that).

I'm sure @jasmainak made a table of pros and cons of various format already - but I cannot find it?

@CPernet
Copy link
Collaborator

CPernet commented Oct 11, 2019

as an additional point, I was wondering if you should state somewhere in the specification that any derived data that can be stored using the native format must do so (eg keep nii as long as possible and do not start using the 'what ever' over format we decide to support as well)

@GaelVaroquaux
Copy link

GaelVaroquaux commented Oct 11, 2019 via email

@effigies
Copy link
Collaborator

@CPernet I'm not hearing anybody clamoring for HDF5, and several voices at least wary of it. My inclination at this point is to push on with .npy, since there wasn't really any push-back from that.

If we do want to resume consideration of options, I can start a list of pros/cons:

HDF5

Pros:

  • Simple ndarrays without additional internal metadata (i.e., can be equally well packed in .npy) should not suffer from maintenance complexity
  • libhdf5 exists with bindings in many languages
  • Transparent compression
  • Memory mapping

Cons:

  • People can abuse and start building hierarchical structures and encoding metadata directly in the file
  • Dependency on a single reference implementation (libhdf5); implausible to write alternative parsers

The former can be addressed by the spec and easily validated. And it's possible that parsing an HDF5 file with a single data blob would not be very problematic for an independent implementation.

npy

Pros:

  • Simple ndarrays without additional internal metadata are basically all that's allowed
  • Simple structure lends itself to easy reimplementation, at need
  • Existing implementations for multiple languages
  • Compression XOR memory mapping

Cons:

  • Compression XOR memory mapping
  • Possible (somewhat justified) perception of Python-preference baked into standard

as an additional point, I was wondering if you should state somewhere in the specification that any derived data that can be stored using the native format must do so (eg keep nii as long as possible and do not start using the 'what ever' over format we decide to support as well)

I think that might be going a bit far. For instance, per-ROI time series could be encoded in NIfTI, but not very naturally. TSV would make more sense, but a strict reading of this proposed rule would lend itself to contorting to keep things in NIfTI.

But the overall sentiment seems reasonable. I think a simple statement along those lines, but with a SHOULD, such that any deviation would need to be made with good reason, would be useful guidance.

@maedoc
Copy link

maedoc commented Oct 11, 2019

Pro: libhdf5 exists with bindings in many languages

This is offset by HFD5 being a single, C-based, strictly versioned API/ABI implementation deal, e.g. a browser based app can't ingest these files, a JVM app has to go through JNI, Julians who want pure Julia stuff won't be happy, etc.

Compression XOR memory mapping

Is offset by simple format; asking for simple, fast & small is greedy (have you ever listened to the clock tick while running xz?)

Possible (somewhat justified) perception of Python-preference baked into standard

You don't have to call it NumPy if you reproduce the definition as part of the standard; NumPy "compatibility" falls out as a happy side effect. If NumPy project decides to change formats down the line, you avoid another problem

@CPernet
Copy link
Collaborator

CPernet commented Oct 11, 2019

Following @GaelVaroquaux 'weak' opinion :-) if maintenance is an issue we should not go for HDF5.
I have nothing against numpy array but you have to consider that SPM is still the most used software for fMRI, that MEEG is mostly Matlab (EEGLAB, FieldTrip, Brainstorm) and many users won't be familiar with it -- if .npy then also .mat otherwise language agnostic format

@CPernet
Copy link
Collaborator

CPernet commented Oct 11, 2019

as an additional point, I was wondering if you should state somewhere in the specification that any derived data that can be stored using the native format must do so (eg keep nii as long as possible and do not start using the 'what ever' over format we decide to support as well)

I think that might be going a bit far. For instance, per-ROI time series could be encoded in NIfTI, but not very naturally. TSV would make more sense, but a strict reading of this proposed rule would lend itself to contorting to keep things in NIfTI.

But the overall sentiment seems reasonable. I think a simple statement along those lines, but with a SHOULD, such that any deviation would need to be made with good reason, would be useful guidance.

Happy with having a statement and use SHOULD (I was not actually thinking .nii that much but .edf for electrophys)

@gllmflndn
Copy link
Contributor

A few quick comments:

  • HDF5: It might have been "rejected" in the past (was it or just a lack of enthusiasm?) but I guess this can always be revisited if needs be.
  • I looked at npy/npz when it was mentioned here and added a reader in SPM (in spm_load) but I have to say I'm not a big fan of it. According to its specification:

The next HEADER_LEN bytes form the header data describing the array’s format. It is an ASCII string which contains a Python literal expression of a dictionary. The dictionary contains three keys:
“descr”dtype.descr: An object that can be passed as an argument to the numpy.dtype constructor to create the array’s dtype.

and compression via a zip file. I think we should aim at something a bit better than that.

  • @fangq made a proposal in this thread based on OpenJData that should be discussed.
  • I remember hearing of another implementation of HDF5. I could now only find pyfive and jsfive - is anyone aware of something else?
  • MathWorks using HDF5 for their .mat file format is not really a poster child story. It is slower than their previous simpler binary format (they had to introduce a flag to disable compression from high level) and they haven't documented how the data are structured within the container (while the previous format has a public specification) making open implementations more difficult.
  • What are the latest thoughts of other options mentioned here, e.g. Zarr?
  • Is there anything to learn from Apache Arrow / Feather / Parquet?

@effigies
Copy link
Collaborator

@maedoc Thanks for those thoughts.

This is offset by HFD5 being a single, C-based, strictly versioned API/ABI implementation deal, e.g. a browser based app can't ingest these files, a JVM app has to go through JNI, Julians who want pure Julia stuff won't be happy, etc.

This is a pretty strong argument against HDF5, IMO. The Javascript validator is critical BIDS infrastructure, so specifying something it can't validate seems like a bad move. There are NodeJS bindings, so one option would be for the browser to warn on ndarrays and say "Use the CLI to fully validate." I don't really like it, but that's an option.

I'm not sure that a distaste for C bindings among some language partisans should be a significant criterion. It's obviously not ideal, but I don't think there are ideal solutions, here.

You don't have to call it NumPy if you reproduce the definition as part of the standard; NumPy "compatibility" falls out as a happy side effect. If NumPy project decides to change formats down the line, you avoid another problem

We haven't done something like this, up to this point. Referencing existing standards has been BIDS' modus operandi, and I think changing that shouldn't be done lightly. We can specify a given version of .npy format, if we aren't comfortable depending on their posture toward backwards compatibility.

@CPernet

have nothing against numpy array but you have to consider that SPM is still the most used software for fMRI, that MEEG is mostly Matlab (EEGLAB, FieldTrip, Brainstorm) and many users won't be familiar with it -- if .npy then also .mat otherwise language agnostic format

Unfortunately, there isn't really a language agnostic format for basic, typed, n-dimensional arrays. .npy is probably the closest that there is, and that's because it's so simple that reimplementing it in another language is very easy: Matlab, C++, R, Julia, Rust

@jasmainak
Copy link
Collaborator

jasmainak commented Oct 11, 2019

@CPernet the table in question is here although perhaps the discussion here is more sophisticated already than what the table offers. I do remember that support for .npy in Matlab was experimental at the time we wrote the table although this may have changed.

@gllmflndn
Copy link
Contributor

@effigies

This is a pretty strong argument against HDF5, IMO. The Javascript validator is critical BIDS infrastructure, so specifying something it can't validate seems like a bad move. There are NodeJS bindings, so one option would be for the browser to warn on ndarrays and say "Use the CLI to fully validate." I don't really like it, but that's an option.

Not that I'm too keen on HDF5 but cannot we expect this to be solved with WebAssembly? And this makes me come across yet another project...

@Tokazama
Copy link
Member

Is there a reason that the Arrow format isn't being considered

@satra
Copy link
Collaborator

satra commented Jun 26, 2023

@Tokazama - are there examples of storing n-d arrays in arrow? and doing chunked (space and/or time) operations on them?

@rabernat
Copy link

Arrow is fundamentally a tabular format. You can put "tensors" as items in Arrow column. But Arrow has no way to represent these as chunks of a larger array, nor does it allow the notion of chunking across multiple dimensions.

@Tokazama
Copy link
Member

are there examples of storing n-d arrays in arrow? and doing chunked (space and/or time) operations on them?

Any format that supports memory mapping (no inscrutable compression) should be fine for multi-dimensional chunking. Chunking is only an issue with table like data where you have non-uniform bits encoding that you need to somehow change the stream to encode differently periodically. I'm not sure what operations besides reading and writing you want. Once it is loaded in any language it should be treated like any other array data.

Arrow is fundamentally a tabular format. You can put "tensors" as items in Arrow column. But Arrow has no way to represent these as chunks of a larger array, nor does it allow the notion of chunking across multiple dimensions.

Well, that's how it's most often used but I definitely use it to store non-table data.

Is the goal here to find a single format that can be used to represent multidimensional tables and arrays, on disk and on server so we don't have to mess with all the file types we have now? That seems like a tall order (perhaps unrealistic).

@CPernet
Copy link
Collaborator

CPernet commented Jun 27, 2023

The goal is always the same, share data in a way that is easily understood - nobody says one has to compute with the same format ... IMO this is out of scope to focus on other aspects but accessibility (ie inter language support)

@effigies
Copy link
Collaborator

effigies commented Jun 27, 2023

The goal is to generalize over TSV, which permits a collection of named (via the column headers) 1D arrays. Zarr/h5 permit named ND arrays. The isomorphism ensures that many languages can use a single API to access either. This is a new file type (to BIDS) and is not intended to replace other file types.

Last time arrow came up, I did explore it and was able to save/load ND arrays with its Python API. If I recall correctly, this feature was not uniformly implemented for all languages.

@fangq
Copy link

fangq commented Jun 27, 2023

Previously, Chris (@neurolabusc) had created a benchmark project with sample data to allow comparisons of sizes and saving/loading speed of various surface-mesh-related formats. I thought that was a great idea.

would the group be interested in setting up something like that so that everyone can see/test the pros and cons of various volumetric data formats mentioned above? this also gives people a clear expectation on the type of data structures that this discussion is aimed to generalize.

@Tokazama
Copy link
Member

Benchmarks are great but this discussion is a bit all over the place. Tables of metadata and multidimensional data have separate issues to be considered.

Outperforming a human readable text based format for tables is trivial, so comparison of CSV or TSV to anything isnt necessary. If we can agree on a well established byte storage format for tables the that's a no brainer.

ND array formats are more difficult to motivate because it's another format that BIDS compliant software would need to have some level of support for and we already have a method for storing those that can be efficient.

If you're looking at mesh formats, then that should be its own venture to converge on a standard.

Perhaps I'm the odd man out here, and it's perfectly clear to everyone else. If so my apologies. Otherwise this whole thing needs to be pinned down to a single objective in order to move forward

@satra
Copy link
Collaborator

satra commented Feb 13, 2024

it may be good to revive this discussion as i'm seeing a few upcoming use cases that will require a more sophisticated consideration for many things that are now in TSVs.

here is a temporary proposal to narrow down the conversation.

  • apache parquet for table like formats (the reason i'm separating this out is that there are significant efficiencies in not considering this a subset of n-d array).
  • zarr/n5 for n-d array like formats for spatial data (personally i would say zarr and ask the matlab community to tell mathworks to support it). however, there are practical challenges between efficiency and convenience, and there is one upcoming change (sharding) i would wait for before fully embracing it in bids as a default format. given our ent-like nature on some of these discussions, i'm sure this change will happen before we decide.
  • separate discussion on meshes and annotations that are not covered by the above. technically speaking parquet could store many things but would need libraries to do proper operations on these (e.g., smoothing on a mesh).

there are also similar approaches in the increasingly API-based and commercial offerings for the world of storage (polars - columnar storage for table like data, arraylake - version controlled zarr in the cloud)

@effigies
Copy link
Collaborator

I am +1 for parquet to be adopted for any TSV data files (physio, stim, motion, blood). It's an open spec with broad implementation and readily available command-line tools for inspection. I think it should probably be discouraged if not prohibited for metadata files (participants.tsv, samples.tsv, sessions.tsv and scans.tsv, electrodes.tsv, channels.tsv), which benefit from human readability. I think it will often be a poor choice for events.tsv, but I wouldn't rule it out.

I am not sure that there is an actual "to-do" here for N-dimensional named arrays except to adopt them in principle so that a BEP that needs this structure can use it. I do not think there is any call to allow an events.zarr file with 2D onsets or 3D durations. HDF5 and Zarr are both already present in NWB, SNIRF and OME-Zarr.

@CPernet
Copy link
Collaborator

CPernet commented Feb 13, 2024

#1614 is what we proposed for zarr and hdf5 as we (almost all here in this thread) have discussed -- was review and passed CI test -- just need someone to merge

no idea about parquet - what are the cases you have in mind? I think as long as we have a good case for that and a BIDS example, good open source format are what we need

@fangq
Copy link

fangq commented Feb 13, 2024

I don't know how much priority BIDS developers give to Octave - at least for me, for the purpose of batch/parallel processing without limited by license, it is an quite useful platform given lots of data analysis tools were written in the MATLAB language.

one thing I want to mention is that it still does not have hdf5 support - I encountered it when processing snirf data

https://octave.discourse.group/t/saving-struct-and-arrays-to-hdf5-without-octave-new-format-and-transpose/5024

@bendichter
Copy link
Contributor

Hey all, I'm in the neurophysiology/NWB community and just recently getting deep into BIDS. This thread was a great read. It's cool to see such a thriving community here working on this.

Another +1 for the usage of parquet for tabular data, e.g. physio, stim, motion, etc. I like to call these types of data "measurements" and call e.g. participants.tsv, eletrodes.tsv, etc. "records." The current TSV have some problems that are limiting for measurements:

  • you need to truncate decimals which means you lose precision
  • they are very space-inefficient
  • you don't have direct/random access to data

Gzipping the TSVs doesn't really solve any of these issues. Parquet is more performant in read, write, and storage volume, and is an open standard with large cross-platform support.

We are looking at adopting BIDS for neurophysiology applications. Without a binary-style filetype option, we would need to convert our efficient data storage solutions into TSV which is a much less efficient/performant file type than the current solution. Being able to use parquet for physio etc. would make me much more comfortable with adopting BIDS.

I also think this should be considered separately from a format for ND data. I have thoughts on that as well but will save that for another post.

@bendichter
Copy link
Contributor

For multi-dimensional arrays, @effigies' post #197 (comment) did a good job of summarizing the pros and cons. IMO it's worth using a standard that allows chunking and compression like HDF5 or Zarr. These multi-dimensional arrays can get quite big and being able to compress and have direct access facilitates certain kinds of applications and use-cases. I'm not aware of a similar standard that only allows a single dataset, but it would not be hard to impose that for HDF5 or Zarr.

A good example of a use-case that leverages the compression and direct access of HDF5/Zarr is Neurosift developed by @magland. This project also demonstrates that you can indeed read HDF5 files in javascipt as part of web applications. Here is an example of a view that demonstrates streaming a portion of NWB data on DANDI directly from an HDF5 file on S3 into the browser and plotting it. DANDI wants large datasets to be compressed and Neurosift would not be possible without direct access for large files.

As @satra mentioned, Zarr is better than HDF5 for cloud compute. The issue is really the C library h5lib rather than the HDF5 standard itself. @rabernat mentioned kerchunk, which fixes this problem for HDF5 by indexing all the chunks and creating a json file that expresses them as a Zarr dataset. Then you can read the HDF5 file directly using Zarr tools, which can read chunks faster and in parallel. In the last few weeks, @magland built a kerhunk-inspired approach to pre-index HDF5 files on DANDI, which has dramatically sped up Neurosift without moving the data from the original HDF5 files.

Going straight Zarr is fine for the cloud, but you may run into problems with certain compute environments, e.g. HPC and local. With Zarr, every data chunk is its own file so you may end up with an inode issue. That is to say, you simply have too many files for your computer to keep track of. The Zarr community is working on solving this with sharding as @satra mentioned, which groups chunks into larger files (see ZEP002). This has been accepted by the Zarr community for v3, but to my knowledge has not been implemented.

I've been experimenting a bit with an approach that is a combo of @SylvainTakerkart 's suggestion and kerchunk. You can write an external file that can define a small JSON index that points into any chunked or non-chunked binary data file in Zarr style. Then you can use the Zarr API to read from Zarr, npy, HDF5, TIFF, and also lots of proprietary formats like spikeGLX, Blackrock, and OpenEphys binary. You can also define extra metadata in this json to annotate dimensions so this can be read into Xarray with dimensional labels, so you could explicitly label dimensions as well as individual rows/columns. Then you could support HDF5 and Zarr as well as many other source formats without having to copy the data into a standardized file format. I have a gist on this here.

@CPernet
Copy link
Collaborator

CPernet commented Apr 15, 2024

@bendichter this issue should be closed really -- there is a pull request but we need more examples
see #1614 -- if you could push one, that would be amazing (and of course edit/suggest change in the PR is fine too)

@CPernet
Copy link
Collaborator

CPernet commented Apr 15, 2024

moved to pull request 1614 after Copenhagen meeting

@CPernet CPernet closed this as completed Apr 15, 2024
@Remi-Gau Remi-Gau reopened this Apr 15, 2024
@Remi-Gau
Copy link
Collaborator

@sappelhoff
Copy link
Member

agreed with @Remi-Gau

@CPernet while I understand your motivation to "move forward" with solving this issue and directing attention to the PR that was discussed in the meeting in Copenhagen 2023, I would personally (and in my role as a BIDS maintainer) prefer if we kept the fruitful discussion in this issue open until all its aspects are resolved.

@CPernet
Copy link
Collaborator

CPernet commented Apr 15, 2024

Sure - although editing the PR seems more fruitful...

@Remi-Gau
Copy link
Collaborator

Agreed.

@bendichter @fangq

@CPernet
Copy link
Collaborator

CPernet commented Sep 9, 2024

updated the repo and added an example (with the issue it is for data we do not support yet) @effigies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests