Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructuring organization for participant level grouping #43

Open
satra opened this issue Aug 17, 2020 · 12 comments
Open

Restructuring organization for participant level grouping #43

satra opened this issue Aug 17, 2020 · 12 comments
Labels
folder-structure Proposals to reorganize files in the specification. impact: high Estimated high impact change

Comments

@satra
Copy link

satra commented Aug 17, 2020

In BIDS thus far the notion of source data and derived data is a little contrived/vague. For example a multi-echo T1-weighted recon comes out of the scanner from a MEMPRAGE sequence is considered source data, while the FA image that comes out is not considered source data.

As scanners and other instruments get more advanced and start generating what we traditionally call derivatives (think GPU based processing on the scanner), this will lead to questions of where data goes.

To simplify consideration, the possibility I would like the BIDS community to consider is to separate data not by source vs derivatives, but by participant vs aggregate non-individual. As examples:

Participant

  1. source dicoms
  2. freesurfer recon
  3. fmriprep output
  4. meg windows around individual stimuli
  5. average ERP response
    ...

Aggregate Non-individual

  1. Templates
  2. group statistical maps
  3. (partial) correlations
    ...

This makes it, in my opinion, simpler to consider with regard to both metadata and with respect to provenance.

Would love to hear thoughts on this potential reframing.

@tsalo
Copy link
Member

tsalo commented Aug 17, 2020

Since this proposal is for 2.0, would this issue perhaps be a better fit for bids-standard/bids-2-devel? BTW, I know that there are a couple of issues there that also propose massive restructuring (e.g., #28, #37).

@satra
Copy link
Author

satra commented Aug 17, 2020

i don't have the authorization to transfer, but i think it would be a good place for this to go.

@tsalo tsalo transferred this issue from bids-standard/bids-specification Aug 17, 2020
@tsalo tsalo added folder-structure Proposals to reorganize files in the specification. impact: high Estimated high impact change labels Aug 17, 2020
@poldrack
Copy link

poldrack commented Aug 17, 2020 via email

@satra
Copy link
Author

satra commented Aug 17, 2020

@poldrack - this phrase is exactly the reason i wrote this.

"source data" i.e. data that came directly from the measurement instrument

we don't treat these things consistently (see the MEMPRAGE and FA example above). and with new tools developing that do significant processing in the scanner itself (e.g., label regions and compute volumes), we would have to as part of the source processing make determinations as to where things would go.

for example, we aggregate across individual images to analyze a
timeseries, aggregate across runs or sessions within participant, etc.

but these are still individual-specific. perhaps aggregate is not the right term i meant to use. individual vs non-individual is what i wanted to convey.

@tsalo
Copy link
Member

tsalo commented Aug 17, 2020

I believe that BEP001 does propose symlinking scanner-computed "derivatives" (like FA maps) from the "raw" dataset to the derivatives folder. This isn't a complete solution, but it does explicitly support derivatives coming directly from the scanner.

@tyarkoni
Copy link

I agree with @poldrack that any organization we try to impose is going to be intuitive for some applications and problematic for others. I don't feel I have a good sense of which of these two schemes would be preferable, and I'd suggest that we stack these kinds of proposals and then at some point do a UX survey/study asking people what they (think they) prefer.

That said, as a practical matter, I think we should try to maintain backwards compatibility with BIDS 1.0 wherever possible, unless we have a really good reason not to. So, e.g., if 80% of users say that @satra's proposal would make their life considerably easier, then sure, let's break the BIDS 1.0 structure. But if, say, 55% prefer @satra's proposal and 45% prefer the existing scheme, I'd argue that that doesn't really justify having to introduce major changes to the entire tooling ecosystem, break people's habits, etc.

@satra
Copy link
Author

satra commented Aug 17, 2020

@tsalo - i think using symlinks is not a good option moving forward as storage providers move more towards object stores (so won't work on s3 for example).

@tyarkoni - in general i have always seen bids as a view, and a darn useful organized view, on a more complex underlying information flow model. so yes, there is no perfect view, just a pragmatic one that addresses a large set of use cases. i really like the idea of doing some A/B testing, but in general before we even implement something like this, i would like a discussion of considerations as to how many folks would find the view useful.

so here are some use cases where the participant-centered view can be useful.

  • aggregation of individuals across datasets
  • sharing/removing individual participants
  • decisions about where to find information about an individual
  • longitudinal applications within an individual
  • privacy protection (everything in a subject folder is subject to privacy considerations)
  • provenance (most things are derived from other things within a participant object)

ps. i haven't yet commented on the hierarchy principle issue, but will do so sometime soon. it's a complex issue and relates to this proposal as well.

@yarikoptic
Copy link
Contributor

yarikoptic commented Aug 26, 2020

Sorry -- my reply came out long, but I think the issue is touching on many of largely orthogonal issues and should be broken into separate ones. So I added some sectioning

raw-vs-derived -- everything is derived!

In BIDS thus far the notion of source data and derived data is a little contrived/vague.

I can only repeat an idiom I think BIDS should just accept and promote: any BIDS data(set) is derived data(set). Accepting it would IMHO resolve aforementioned contradiction.
It is exemplified by many already existing provisions in BIDS mentioned above and a simple fact that BIDS provisions for sourcedata/ -- in my view anything which has "source (data)" it came from is "derived (data)".
I think such idiom is not in conflict with a notion that BIDS 1.x dataset to contain "raw" data (as close to the origin of the data, just merely harmonized to conform BIDS). Taking it further, common derivatives dataset is just an enhancement on top of BIDS 1.x "raw" - it is a possible overlay on top of it (i.e. can be original "raw" + processed files where necessary).
I think a possible way forward is to provision in dataset_description.json a field listing the "tiers" (or "features" or ... ?) of the dataset: "raw", "common-derivatives" (or simply "processed"), and just provide guidance on when to augment "raw" BIDS with derived data (annonimization, close-to-raw preprocessing etc), and when to produce "separate" derivatives (big pipelines output).

participant-vs-aggregate -- orthogonal issue, can be BIDS 1.x compatible

... to consider is to separate data not by source vs derivatives, but by participant vs aggregate non-individual.

I think it is largely an orthogonal aspect to raw-vs-derived (again -- everything in BIDS is derived IMHO ;)). Even though hardware ATM does not produce "aggregates", I do not see why it hypothetically couldn't and my wild prediction would be that at some point it might produce population templates per study etc. So I would have added it as an additional "feature" explicitly (again annotated for in dataset_description.json with e.g. levels": ["subject", "subject/session", "session", "study"] - ATM just implicit "subject" and "subject/session" levels) or implicitly (just by fact being present under standardized location e.g. agg-<label>/ folders accompanied with aggregates.json describing aggregates or ses- for aggregates for sessions across subjects; sub-*/ without ses-/ subfolder for aggregates across sessions within subject with all "derivatives" annotation). BIDS then would standardize layout/naming in those folders to follow overall BIDS naming approach (which would largely be "drop sub- and/or ses- prefix depending on the group level" + introduce missing entities to standardize composition annotation). And most likely it could be worked out in "backward compatible" way with BIDS 1.x thus even introduced prior BIDS-2.

Composition -- yet for BIDS to standardize a bit more

Another aspect which I think is discussed above without giving it an explicit name is "composition": we have not reached an ultimate agreement and thus have not provided a definite guidance on how BIDS datasets are composed together. Yes -- it was improved significantly with common derivatives adding a 2nd "alternative" composition in common-principles. But IMHO my_dataset there should be promoted (or at least described to typically correspond to a "study level") to the explicit scope of a "study", which makes sense since there could be multiple ways to combine/process etc "raw collected data" for any study. IMHO largely due to this absent "study" level standardization (just an "alternative" now,), "raw" BIDS originally provisioned having derivatives/ only as a subfolder within raw BIDS. BIDS itself has sourcedata/ and it "scales" to the derivatives as well: any derivative dataset can have sourcedata/ pointing to (or many -- see below on SourceDatasets) source (possibly BIDS) datasets, thus allowing them to be "also" instantiated (installed/uninstalled in DataLad land) under corresponding sourcedata/ within rawdata/ and derivatives/.

Note that a "study" even emerged naturally while preparing fmriprep Nature protocols paper, where there was $STUDY/{ds000003/,derivatives/} thus not sticking all derivatives within BIDS dataset. One approach to general (non-BIDS specific really) composition is YODA principles (see e.g. reused YODA figure in ReproNim/containers).

Linking+Provenance -- platform specific features should be avoided in BIDS but "acknowledged"

Decision on how to "compose" would affect "provenance" and thus possibility/fragility to any type of "linking" across datasets/modules. E.g. under YODA principles, all necessary components for dataset generation should be reachable "under" that dataset boundary/directory. So you could make a cut at $STUDY level and have everything to produce that study. You could take a derivative/* dataset and have everything to produce that derivative (by it having source BIDS datasets "referenced" and "instantiatable" under e.g. sourcedata/). IMHO BIDS is almost there (see above on composition and SourceDatasets) but it should embrace and promote such idiom more (instead of having it merely an alternative).

Aforementioned composition talks about "dataset(s)" level. Discussions on "symlinking" (e.g. relevant non-completed discussion in BEP001) probably could be addressed by

  1. allowing derived data be placed alongside with "raw" (see above on "overlay"):
  • either that a file is a regular file or a symlink (where file system allows) on a particular dataset instance should not matter to BIDS!
  • tools reading BIDS datasets must not dereference and follow symlinks (that is something to annotate for in BIDS, ref: never finished PR)
  • if distribution/archival platform and receiving file system allows for symlinks -- they could be used/preserved, if not -- de-referenced (either by distribution or by the receiving tool, needs investigation).
  • the main point is that symlinks, if chosen to be used for "internal" to a dataset deployment structuring, should not cross boundaries of the dataset ("module" in YODA terms) in its distribution (tarball, git repo, etc).
  1. provenance annotation on how any particular file was produced (generic provenance, applicable also to >90% of "raw" files)
  2. referencing "subdatasets". common-derivatives already introduced SourceDatasets but it should be polished a bit more: IMHO it should not be just a list but an association for a folder (or subfolder(s) under sourcedata/).

Many of aforementioned aspects do not even need to wait for 2.0 IMHO, i.e. could be introduced in backward compatible way.

@robertoostenveld
Copy link

+1 for "everything is derived"!

The others are more subtle to simply give a positive or negative vote.

@poldrack
Copy link

poldrack commented Aug 27, 2020 via email

@satra satra changed the title Consideration for a BIDS 2.0+ restructuring Restructuring organization for participant level grouping Sep 28, 2020
@yarikoptic
Copy link
Contributor

@satra
Given that we formalized operational definition of "derivative" to be a "BIDS dataset derived from other BIDS dataset(s)", could/should we consider this issue overall addressed? Note that we also have specific issues which IMHO relate such as

and others slated for BIDS 2.0 in https://github.com/orgs/bids-standard/projects/10 .

If not resolved/sufficiently covered by other issues -- what specific changes would you propose?

@satra
Copy link
Author

satra commented May 18, 2024

@yarikoptic - i think the intent of this issue was primarily asking if some aspects of organization are participant/session/cohort/group specific. some of it would indeed benefit from simplify having the provenance, but others would need some notion of separating grouping of derivatives, e.g. something like a group average connectome would be different from individual connectomes. i think you note all of these in your response above, but i'm not sure they are mapped to specific other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
folder-structure Proposals to reorganize files in the specification. impact: high Estimated high impact change
Projects
None yet
Development

No branches or pull requests

6 participants