Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zipped subject and session folders in derivatives datasets #3151

Open
tsalo opened this issue Sep 16, 2024 · 0 comments
Open

Support zipped subject and session folders in derivatives datasets #3151

tsalo opened this issue Sep 16, 2024 · 0 comments

Comments

@tsalo
Copy link

tsalo commented Sep 16, 2024

What would you like to see added?

The current standard practice for running Datalad-tracked processing jobs on HPCs is FairlyBig. For many reasons, FairlyBig stores processing results in zip files, which are tracked in datalad instead of the sometimes huge numbers of individual derivatives files. On our end, we’ve encountered server issues (e.g., running out of inodes) when trying to process and analyze large datasets without zipping subject- or session-level folders.

As a lab, we have processed large BIDS datasets and made the derivatives available as datalad datasets in the (e.g., https://github.com/ReproBrainChart/NKI_CPAC), and we have found that indexing all of the individual files slows things down to an almost unusable pace. Even storing individual subjects or sessions as subdatasets only improves the situation for the initial clone, but after that everything becomes very slow again.

We’re going to be running quite a few datasets from OpenNeuro through FairlyBig and would like to share the results on OpenNeuro. Currently, we’d need to unzip the results before upload. We were wondering if this is something OpenNeuro’s infrastructure will handle well (our local git operation could sometimes take up to hours to complete on unzipped derivatives datasets) or whether you would be interested in supporting uploaded zip files full of derivatives.

We don’t know the details of how the OpenNeuro infrastructure works and would be happy to also upload unzipped data, but wanted to get a discussion started just in case it would be easier for you to store zipped results.

We were thinking this might also help with problems on OpenNeuro’s end with datasets with many files, like #3011. We don’t know what effect this would have on the online file browser or BIDS validation (i.e., could those be run on unzipped versions of the datasets?).

Pinging @mattcieslak, who basically came up with this idea. Also @mih, who might have thoughts.

Alternatives

We can continue to push unzipped derivatives to OpenNeuro, but we will need to keep zipped versions on our server (which, of course, isn’t OpenNeuro’s problem).

Or… would it be possible to datalad install the datasets with zipped folders?

Do you have any interest in helping implement the feature?

No

Additional information / screenshots

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant