Support zipped subject and session folders in derivatives datasets #3151

tsalo · 2024-09-16T17:06:39Z

What would you like to see added?

The current standard practice for running Datalad-tracked processing jobs on HPCs is FairlyBig. For many reasons, FairlyBig stores processing results in zip files, which are tracked in datalad instead of the sometimes huge numbers of individual derivatives files. On our end, we’ve encountered server issues (e.g., running out of inodes) when trying to process and analyze large datasets without zipping subject- or session-level folders.

As a lab, we have processed large BIDS datasets and made the derivatives available as datalad datasets in the (e.g., https://github.com/ReproBrainChart/NKI_CPAC), and we have found that indexing all of the individual files slows things down to an almost unusable pace. Even storing individual subjects or sessions as subdatasets only improves the situation for the initial clone, but after that everything becomes very slow again.

We’re going to be running quite a few datasets from OpenNeuro through FairlyBig and would like to share the results on OpenNeuro. Currently, we’d need to unzip the results before upload. We were wondering if this is something OpenNeuro’s infrastructure will handle well (our local git operation could sometimes take up to hours to complete on unzipped derivatives datasets) or whether you would be interested in supporting uploaded zip files full of derivatives.

We don’t know the details of how the OpenNeuro infrastructure works and would be happy to also upload unzipped data, but wanted to get a discussion started just in case it would be easier for you to store zipped results.

We were thinking this might also help with problems on OpenNeuro’s end with datasets with many files, like #3011. We don’t know what effect this would have on the online file browser or BIDS validation (i.e., could those be run on unzipped versions of the datasets?).

Pinging @mattcieslak, who basically came up with this idea. Also @mih, who might have thoughts.

Alternatives

We can continue to push unzipped derivatives to OpenNeuro, but we will need to keep zipped versions on our server (which, of course, isn’t OpenNeuro’s problem).

Or… would it be possible to datalad install the datasets with zipped folders?

Do you have any interest in helping implement the feature?

No

Additional information / screenshots

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support zipped subject and session folders in derivatives datasets #3151

Support zipped subject and session folders in derivatives datasets #3151

tsalo commented Sep 16, 2024

Support zipped subject and session folders in derivatives datasets #3151

Support zipped subject and session folders in derivatives datasets #3151

Comments

tsalo commented Sep 16, 2024

What would you like to see added?

Alternatives

Do you have any interest in helping implement the feature?

Additional information / screenshots