-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178
[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178
Conversation
src/03-modality-agnostic-files.md
Outdated
Recordings with multiple parts MUST be documented with one row per file. | ||
It is RECOMMENDED that if the multiple parts of a recording stem solely from the file format, | ||
that file is listed, which would also be passed to software functions for | ||
reading (for example `.vhdr` for BrainVision and `.set` for EEGLAB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wording needs to be adjusted. In the previous sentence, file formats with multiple files are treated the same as file collections, in which each file MUST be documented separately in the scans.tsv. However, in the next sentence, it is RECOMMENDED to not do that for these file formats specifically.
I think that you need to distinguish between file collections and multi-file file formats from the beginning, so that it's clear that multi-file format files shouldn't be listed separately in the scans.tsv file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wonder whether it's not better to be consistent and also document multi-file file formats with one row per file then. That is, BrainVision recordings would have 3 rows per recording in scans.tsv
, and EEGLAB recordings that make use of their optional .fdt
file get 2 rows per recording in scans.tsv
.
It's less pretty than just listing the header file there (.vhdr
for BrainVision, .set
for EEGLAB), but not sure if it's worth introducing a special rule then 🤔 As far as I know we don't have other multi-file file formats (excluding those that are directories, .. here the whole directory gets listed)
What do you think @hoechenberger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal would be to allow filename
to be an array of strings (pity that it's scans.tsv
and not scans.json
!!), this would ensure we still have only one row per scan. Still – which file should we try to load? And what about split files?
-> Considering this, I'm leaning toward listing only the VHDR file for BV and SET file for EEGLAB.
But there shouldn't be a RECOMMENDED, it should be a MUST, and we shouldn't just propose, but enforce which file to list. I don't like all this "well yeah umm you could but you don't have do, do what you like" stuff in the standard, it makes software implementations such an unnecessary pain to develop!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still – which file should we try to load?
That is less of a problem than it seems because all data formats and their extensions are documented in BIDS and it's easy to include a logic like:
"if VMRK found, this is BrainVision, so look for VHDR, if it doesn't exist, raise error"
And what about split files?
Here we actually recommend that people write ALL split files, see:
- https://bids-specification.readthedocs.io/en/latest/99-appendices/14-glossary.html#split-entities
- implemented in [FIX] split MEG files should be listed separately in scans.tsv #735
- (after mne-bids discussion in scans.tsv for split files mne-tools/mne-bids#692)
So that further makes me think that listing all neural recordings in scans.tsv makes sense ... also for the two exceptions we have: EEGLAB (set+fdt) and BrainVision (vhdr, vmrk, eeg) files 🤔
But there shouldn't be a RECOMMENDED, it should be a MUST, and we shouldn't just propose, but enforce which file to list.
right, but this is the age-old discussion about backward compatibility and the balance we are treading ... I think things will get better in the long run with the move to the BIDS schema, which will eventually allow validation of BIDS datasets against particular BIDS schema versions (something we cannot currently do with bids-validator) --> and I think that may make us all more comfortable with breaking changes and including rules that are more strict.
(as you know I do in principle agree with you that we need more MUST and less SHOULD)
TL;DR: So I think I'll revamp this PR so that all BrainVision files and all EEGLAB files SHOULD be listed in scans.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: So I think I'll revamp this PR so that all BrainVision files and all EEGLAB files SHOULD be listed in scans.tsv
I really don't think this is a good idea. It effectively creates duplicates for those scans. If you really want to list all files, we should find a way to squeeze ONE recording into ONE row, and not spread it out across several rows just because several files belong to this recording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It effectively creates duplicates for those scans.
not really duplicates (the extension is different) ... one could argue that it shows that several files belong to that recording. After all, we document that each file belonging to a recording gets an entry in the table.
Also, we are already doing this for split files, and for qMRI file collections
I think it'd be the consistent thing to do ... and I think it's kind of implicit in the present rules already: just not obviously documented (which I'd like to fix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really duplicates (the extension is different)
Yes but that is exactly my point :) the purpose of scans.tsv is, and I state the spec:
The purpose of this file is to describe timing and other properties of each imaging acquisition sequence (each run file) within one session. Each neural recording file SHOULD be described by exactly one row.
So the intention was, originally, to have precisely one row per acquisition run, from where you could extract all the relevant metadata for that specific run, such as recording time and yes, also the name of file used to store the data.
Later on we extended the standard, and I believe I contributed or at least +1'd this change as well:
Some recordings consist of multiple parts, that span several files, for example through echo-, part-, or split- entities. Such recordings MUST be documented with one row per file.
And I believe this was a mistake. As users are free to add additional columns to the file, we end up with duplicated metadata in cases of split, ... files.
If the focus and purpose of this file really has shifted that much since its initial inception, it should be renamed to files.tsv, or an entirely new file should be introduced -- instead of squeezing more and more new stuff into a thing that was never designed to store that kind of info.
I know there's basically no chance to make this happen, but I wanted to voice that I'm very unhappy with the direction some things are developing. 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.... but since things are a total mess already anyway ... yeah, it's probably more consistent to add those additional files than just omitting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there's basically no chance to make this happen, but I wanted to voice that I'm very unhappy with the direction some things are developing. sweat
knowing your opinions to some extent I don't think this statement does the current developments justice ... the present issue is more "cleaning up" / "documenting" rather than developing anyhow. But maybe we should discuss that in person some other time :-)
8d84248
to
454fc7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertoostenveld how do you write entries to scans.tsv
in data2bids in fieldtrip?
In this PR I want to clarify that each of the three files should be written (eeg, vhdr, vmrk), because that is consistent with what we do for FIF "split files" (write all of them), or entity-linked file collections (write all of them).
I think this is already implicit in the rules, but not obvious and I'd like to improve that with this PR.
Codecov Report
@@ Coverage Diff @@
## master #1178 +/- ##
=======================================
Coverage 88.33% 88.33%
=======================================
Files 6 6
Lines 1020 1020
=======================================
Hits 901 901
Misses 119 119 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
src/04-modality-specific-files/04-intracranial-electroencephalography.md
Outdated
Show resolved
Hide resolved
I write one row per single "recording" (aka run), also if that "recording" happens to be represented over multiple files. For BrainVision I use point to the In fieldtrip we refer to such a conceptual recording construct as the "dataset" and we use https://github.com/fieldtrip/fieldtrip/blob/master/private/dataset2files.m to map between the dataset (the construct) and the actual files in the filesystem. But I honestly don't think that all 3 files (for BrainVision), 2 files (for EEGLAB) or ~15 files (for CTF) or ~10-100 files (for MEF3) should be listed in the scans file. Just consider: if you were to use BIDS with old ANALYZE data, would you also specify both the [edit: only after typing this reply, I read the whole thread. I now realize that I repeated some things that were already mentioned earlier. Still, I am in favor of one row per run in the scans.tsv, and to use a string (not a list, or a tuple, or a object, or a dict) consistent with the strings that we now specify in the different analysis software packages. E.g. in BrainVision analyzer you click only on the vhdr file to open the recording, not on the triplet.] |
How do you proceed for split FIF files that use the What you do in data2bids and what we do in mne-bids corresponds to the first commit in this PR (dba9dd8), which I have meanwhile changed based on the fact that:
But perhaps me drawing on entity-linked file collections as another argument to write each file in scans.tsv was misguided, as these are in truth separate (but related) recordings -- unlike the EEGLAB, BrainVision, and directory based (mef3, ctf, ...) file formats (is that correct @tsalo? your initial comment triggered this whole train of thought). I am happy to reconsider again, but it bothers me that for split files we write each file 😕 https://github.com/sccn/bids-matlab-tools does not seem to write (the optional) |
Technically some entity-linked file collections come from the same recording (e.g., complex-valued data broken into magnitude and phase |
I have not encountered/considered (i.e., thought about) that yet in the context of FT data2bids. Let me read up on that... |
Following @tsalo in #1178 (comment), the split FIF files contain the same type of data, at consecutive time windows that all are part of the same run (same acq params, same task, no deliberate subject movement in between). That is different from the eeg/vhdr/vmrk triplet which contain data/header/markers, all from the same experimental time window. Imagine that we would represent BOLD data as a series of 3D nii files (like in SPM), rather than one 4D nii file. Would you in that case represent that in a single row of the scans.tsv, or in different rows? I feel inclined to accept those ending up on different rows (as they are explicitly split over time), just like split FIF files. The analysis software in that case needs to be smart enough to sort them and glue the data together in memory, but - albeit less practical - could also analyze one at a time (like 2x ft_preprocessing followed by ft_appenddata followed by the rest of the analysis). |
In https://bids-specification.readthedocs.io/en/latest/99-appendices/14-glossary.html#split-entities it is written "Each of these files has an internal pointer to the next file.". I am no fif expert. Is that also the case if the split files are written by the neuromag software, e.g. during acquisition or after maxfilter? Or is that only for MNE-Python written files? edit: FieldTrip does not make use of this internal pointer in the fif file, it expects the user to specify the series and glue them together, either by specifying cfg.dataset = {fif1, fif2, etc.} during |
…many rows as files
Update src/04-modality-specific-files/04-intracranial-electroencephalography.md
67bf58b
to
57914fa
Compare
I think that's how the files are written by the neuromag software. Is that correct @larsoner? |
Thanks for the discussion so far @hoechenberger @robertoostenveld @tsalo I have considered your feedback and made a new attempt to document the situation. I'd be happy for another round of reviews. |
I looked at the revised text, it now looks good to me. |
To my knowledge, yes, this is the case. It sometimes leads to problems when users manually rename the files after retrieving them from the acquisition device / computer, as the pointer(s) still contain(s) the old filenames. |
@hoechenberger would you be able to tell me which fif tag it is, and possibly point to the mne-matlab code that deals with it (if any)? As far as I am aware, the mne-matlab reader does not deal with a tag like this and requires the user (and the FieldTrip code) to deal with the files one by one. edit: I searched myself, from open.py I guess it is |
@larsoner may be able to help; he's much more competent than I on that front :) |
I believe it's actually but never used. But it's used in MNE-CPP when writing split files: |
Thanks. I now see that teh files can be stitched together using FIFF_REF_FILE_NAME which together with FIFF_REF_ROLE (which can have the value FIFFV_ROLE_NEXT_FILE and FIFFV_ROLE_PREV_FILE) allows to construct the linked-list. |
@hoechenberger @robertoostenveld if you are both happy with the present changes, please don't forget to approve the PR for formality reasons :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the suggested changes
Do we have a validator pr to go with the new requirements? |
not yet, but we could add a warning for example when a BrainVision file is detected and anything BUT As for the file collections and split fif files, the checks are appropriately in place, if I remember correctly. |
Thanks @sappelhoff!! |
came up in:
I am adding a general recommendation that if file formats consist of several files, that file which would also be passed to I/O software should be listed in scans.tsv.