Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Move rawdata/ into sourcedata/raw in alternative structure example, clarify on naming of datasets themselves #1741

Merged
merged 10 commits into from
Apr 25, 2024

Conversation

yarikoptic
Copy link
Collaborator

@yarikoptic yarikoptic commented Mar 21, 2024

…dataset to be BIDS dataset

This is my take on an extended discussion about ambiguity of
`rawdata/` example:
https://github.com/bids-standard/bids-specification/pull/1734/files#r1534475631
Prior one bundled naming aspect under the same MUST. I separated into
separate sentences, added explicit statement that BIDS does not
prescribe a particular naming scheme for source data. And added
explicit RECOMMENDED on the example how to organize/name files there.
…mpliant'

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "sed -i -e s,rawdata,noncompliant,g tools/schemacode/bidsschematools/validator.py tools/schemacode/bidsschematools/tests/test_validator.py tools/schemacode/bidsschematools/tests/data/expected_bids_validator_xs_write.log",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Copy link

codecov bot commented Mar 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.79%. Comparing base (4c642bd) to head (1c029c6).
Report is 53 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1741      +/-   ##
==========================================
- Coverage   87.93%   87.79%   -0.15%     
==========================================
  Files          16       16              
  Lines        1351     1360       +9     
==========================================
+ Hits         1188     1194       +6     
- Misses        163      166       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Remi-Gau
Copy link
Collaborator

What the example looks like

image

Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with this approach. Hopefully this should prevent confusion.

@yarikoptic
Copy link
Collaborator Author

What the example looks like

anal me notices now that dataset_description.json in that example is "inconsistent" it its ordering with directories...

apparently we do not have full consistency but mostly list it AFTER the folders... but then logically it might need ... before if there are sub- folders, and after to signal that more files are possible...
src/appendices/qmri.md-   {
src/appendices/qmri.md-    "ds-example": {
src/appendices/qmri.md-        "derivatives": {
src/appendices/qmri.md-            "qMRLab": {
src/appendices/qmri.md:                "dataset_description.json": "",
src/appendices/qmri.md-                "sub-01": {
src/appendices/qmri.md-                    "anat": {
src/appendices/qmri.md-                        "sub-01_T1map.nii.gz": "",
src/appendices/qmri.md-                        "sub-01_T1map.json": "",
--
src/common-principles.md-    "my_dataset-1": {
src/common-principles.md-        "sourcedata": {
src/common-principles.md-            "dicoms": {},
src/common-principles.md-            "raw": {
src/common-principles.md:                "dataset_description.json": "",
src/common-principles.md-                "sub-01": {},
src/common-principles.md-                "sub-02": {},
src/common-principles.md-                "...": "",
src/common-principles.md-            },
--
src/common-principles.md-            "pipeline_1": {},
src/common-principles.md-            "pipeline_2": {},
src/common-principles.md-            "...": "",
src/common-principles.md-        },
src/common-principles.md:        "dataset_description.json": "",
src/common-principles.md-        "...": "",
src/common-principles.md-    }
src/common-principles.md-   }
src/common-principles.md-) }}
--
src/common-principles.md-            },
src/common-principles.md-            "sub-01": {},
src/common-principles.md-            "sub-02": {},
src/common-principles.md-            "...": "",
src/common-principles.md:            "dataset_description.json": "",
src/common-principles.md-            }
src/common-principles.md-        }
src/common-principles.md-    ) }}
src/common-principles.md-
--
src/common-principles.md-        },
src/common-principles.md-    "derivatives": {},
src/common-principles.md-    "README": "",
src/common-principles.md-    "participants.tsv": "",
src/common-principles.md:    "dataset_description.json": "",
src/common-principles.md-    "CHANGES": "",
src/common-principles.md-    }
src/common-principles.md-) }}
src/common-principles.md-
--
src/derivatives/common-data-types.md-    "raw/": {
src/derivatives/common-data-types.md-        "CHANGES": "",
src/derivatives/common-data-types.md-        "README": "",
src/derivatives/common-data-types.md-        "channels.tsv": "",
src/derivatives/common-data-types.md:        "dataset_description.json": "",
src/derivatives/common-data-types.md-        "participants.tsv": "",
src/derivatives/common-data-types.md-            "sub-001": {
src/derivatives/common-data-types.md-                "eeg": {
src/derivatives/common-data-types.md-                    "sub-001_task-listening_events.tsv": "",
--
src/longitudinal-and-multi-site-studies.md-            },
src/longitudinal-and-multi-site-studies.md-        "sub-control01_sessions.tsv": "",
src/longitudinal-and-multi-site-studies.md-        },
src/longitudinal-and-multi-site-studies.md-    "participants.tsv": "",
src/longitudinal-and-multi-site-studies.md:    "dataset_description.json": "",
src/longitudinal-and-multi-site-studies.md-    "README": "",
src/longitudinal-and-multi-site-studies.md-    "CHANGES": "",
src/longitudinal-and-multi-site-studies.md-    }
src/longitudinal-and-multi-site-studies.md-) }}

I will at least unify for that now in a separate commit (easy to drop if we decide not worth it)

@yarikoptic
Copy link
Collaborator Author

note: failed to upload coverage, I restarted that pipeline to get it hopefully green.

@yarikoptic
Copy link
Collaborator Author

yes, it is all green. Waiting for more blessings. @effigies @tsalo -- WDYT?

@effigies
Copy link
Collaborator

This feels more confusing to me.

└─ my_dataset-1/
   ├─ sourcedata/
   │  ├─ dicoms/
   │  ├─ raw/
   │  │  ├─ sub-01/
   │  │  ├─ sub-02/
   │  │  ├─ ... 
   │  │  └─ dataset_description.json 
   │  └─ ... 
   ├─ derivatives/
   │  ├─ pipeline_1/
   │  ├─ pipeline_2/
   │  └─ ... 
   ├─ dataset_description.json 
   └─ ... 

Why is there now a dataset_description.json? What are the contents of this outer dataset apart from sourcedata/ and derivatives/?

I still don't understand why what is present is confusing, but maybe it would be better just to remove it. I would like to move to a Diataxis approach, where we clearly separate out the helpful hints and how-tos from the reference material. Maybe it could be revived in a how-to guide in a way that causes fewer problems, if there's any value to it.

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nit-picks, if the consensus is that this is an improvement and gets cleared to merge. I don't think we want to imply validation of these new bits.

src/common-principles.md Outdated Show resolved Hide resolved
src/common-principles.md Outdated Show resolved Hide resolved
@yarikoptic
Copy link
Collaborator Author

└─ my_dataset-1/
...
Why is there now a dataset_description.json? What are the contents of this outer dataset apart from sourcedata/ and derivatives/?

IMHO in BIDS specification we SHOULD only talk about BIDS datasets. We SHOULD NOT talk about arbitrary conventions which could exist outside of BIDS specification. Hence IMHO this example SHOULD be a BIDS dataset as well -- AFAIK nothing in BIDS specification would state that it is an illegitimate BIDS dataset, or am I wrong?

I still don't understand why what is present is confusing, but maybe it would be better just to remove it. I would like to move to a Diataxis approach, where we clearly separate out the helpful hints and how-tos from the reference material. Maybe it could be revived in a how-to guide in a way that causes fewer problems, if there's any value to it.

IMHO it is useful to give users guidance. My problem with current formulation is that it is misleading. This PR IMHO clarifies it, unless it is indeed "not BIDS compliant".

@effigies
Copy link
Collaborator

Hence IMHO this example SHOULD be a BIDS dataset as well -- AFAIK nothing in BIDS specification would state that it is an illegitimate BIDS dataset, or am I wrong?

I believe the validator would currently complain that there are no validatable files. You can argue that's wrong, but from an OpenNeuro perspective, I would want to reject such a dataset as it's effectively a blanket .bidsignore. But we would also reject what you're replacing, since it wasn't a BIDS dataset, but a layout demonstrating one way to include BIDS datasets alongside their sources and derivatives.

IMHO it is useful to give users guidance. My problem with current formulation is that it is misleading. This PR IMHO clarifies it, unless it is indeed "not BIDS compliant".

I find it unhelpful at best, as it drops everything but a dataset_description.json and two opaque directories that could potentially hold nested datasets. This is why I think that a how-to guide would be better, where it discusses these options in detail, and not normatively.

@yarikoptic
Copy link
Collaborator Author

there are no validatable files

wrong. There is dataset_description.json. Might be a bug in bids-validator then ;-)

You can argue that's wrong, but from an OpenNeuro perspective, I would want to reject such a dataset as it's effectively a blanket .bidsignore.

as we discussed -- this sounds like an archive specific desire/behavior.

But we would also reject what you're replacing, since it wasn't a BIDS dataset, but a layout demonstrating one way to include BIDS datasets alongside their sources and derivatives.

ok, let's meet half-way: what if I remove dataset_specification.json for now , and mostly revert that added statement that this is a valid bids dataset? and then propose separate follow up PR on that particular aspect since I think we would want also some explicit DatasetType = "study" or alike (e.g. "project" to possibly avoid conflict with BEP035) for such cases.

Then we stay with "convention"

@effigies
Copy link
Collaborator

Sure, if we drop dataset_description.json and ... and do not say or imply that the outer dataset is a BIDS dataset, then this is basically the same as before. I suppose that's fine.

@yarikoptic
Copy link
Collaborator Author

To expedite and ease re-reviewing -- I committed that final form of suggestions .

src/common-principles.md Outdated Show resolved Hide resolved
Co-authored-by: Chris Markiewicz <effigies@gmail.com>
@sappelhoff
Copy link
Member

This is what the example now looks like:

https://bids-specification--1741.org.readthedocs.build/en/1741/common-principles.html#source-vs-raw-vs-derived-data

grafik

I personally find it confusing that the BIDS raw data is now listed under sourcedata, because the way "people" usually think about the directories is:

  1. sourcedata (what ever was recorded)
  2. BIDS formatted raw data (converted from sourcedata)
  3. derivatives (obtained from BIDS formatted raw data)

nesting "2" under "1" is technically possible, but I find it unhelpful/weird.

Having that said, @Remi-Gau and @effigies seem to have discussed this at length / thought about it, so I will defer to them.

@yarikoptic
Copy link
Collaborator Author

there is no 'ideal' setup per se, but if datalad used, then 2. (sourcedata/raw) could also have (possibly subset of only relevant ones) of sourcedata/ (like sourcedata/dicoms) be installed into its own sourcedata/raw/sourcedata . The idea here is just to make it all "flat" and separated only to sourcedata/ (bids or not) and derivatives/ (what computed from those).

@yarikoptic
Copy link
Collaborator Author

@effigies @sappelhoff so are we go on this one or more tune ups?

@effigies effigies changed the title [FIX] Move rawdata/ into sourcedata/raw, ensure example is a BIDS-compliant dataset, clarify on naming of datasets themselves [FIX] Move rawdata/ into sourcedata/raw in alternative structure example, clarify on naming of datasets themselves Apr 25, 2024
@effigies effigies merged commit 90ec07f into bids-standard:master Apr 25, 2024
25 of 26 checks passed
yarikoptic added a commit to yarikoptic/bids-specification that referenced this pull request Jun 20, 2024
This reverts commit a3c12f8 where I have tried
to introduce it in
bids-standard#1741 but it required a
little more of further detailing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
4 participants