Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178

Merged
merged 6 commits into from
Aug 27, 2022

Conversation

sappelhoff
Copy link
Member

came up in:

I am adding a general recommendation that if file formats consist of several files, that file which would also be passed to I/O software should be listed in scans.tsv.

Comment on lines 458 to 461
Recordings with multiple parts MUST be documented with one row per file.
It is RECOMMENDED that if the multiple parts of a recording stem solely from the file format,
that file is listed, which would also be passed to software functions for
reading (for example `.vhdr` for BrainVision and `.set` for EEGLAB).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording needs to be adjusted. In the previous sentence, file formats with multiple files are treated the same as file collections, in which each file MUST be documented separately in the scans.tsv. However, in the next sentence, it is RECOMMENDED to not do that for these file formats specifically.

I think that you need to distinguish between file collections and multi-file file formats from the beginning, so that it's clear that multi-file format files shouldn't be listed separately in the scans.tsv file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wonder whether it's not better to be consistent and also document multi-file file formats with one row per file then. That is, BrainVision recordings would have 3 rows per recording in scans.tsv, and EEGLAB recordings that make use of their optional .fdt file get 2 rows per recording in scans.tsv.

It's less pretty than just listing the header file there (.vhdr for BrainVision, .set for EEGLAB), but not sure if it's worth introducing a special rule then 🤔 As far as I know we don't have other multi-file file formats (excluding those that are directories, .. here the whole directory gets listed)

What do you think @hoechenberger ?

Copy link
Collaborator

@hoechenberger hoechenberger Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal would be to allow filename to be an array of strings (pity that it's scans.tsv and not scans.json!!), this would ensure we still have only one row per scan. Still – which file should we try to load? And what about split files?

-> Considering this, I'm leaning toward listing only the VHDR file for BV and SET file for EEGLAB.

But there shouldn't be a RECOMMENDED, it should be a MUST, and we shouldn't just propose, but enforce which file to list. I don't like all this "well yeah umm you could but you don't have do, do what you like" stuff in the standard, it makes software implementations such an unnecessary pain to develop!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still – which file should we try to load?

That is less of a problem than it seems because all data formats and their extensions are documented in BIDS and it's easy to include a logic like:

"if VMRK found, this is BrainVision, so look for VHDR, if it doesn't exist, raise error"

And what about split files?

Here we actually recommend that people write ALL split files, see:

So that further makes me think that listing all neural recordings in scans.tsv makes sense ... also for the two exceptions we have: EEGLAB (set+fdt) and BrainVision (vhdr, vmrk, eeg) files 🤔

But there shouldn't be a RECOMMENDED, it should be a MUST, and we shouldn't just propose, but enforce which file to list.

right, but this is the age-old discussion about backward compatibility and the balance we are treading ... I think things will get better in the long run with the move to the BIDS schema, which will eventually allow validation of BIDS datasets against particular BIDS schema versions (something we cannot currently do with bids-validator) --> and I think that may make us all more comfortable with breaking changes and including rules that are more strict.

(as you know I do in principle agree with you that we need more MUST and less SHOULD)

TL;DR: So I think I'll revamp this PR so that all BrainVision files and all EEGLAB files SHOULD be listed in scans.tsv

Copy link
Collaborator

@hoechenberger hoechenberger Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: So I think I'll revamp this PR so that all BrainVision files and all EEGLAB files SHOULD be listed in scans.tsv

I really don't think this is a good idea. It effectively creates duplicates for those scans. If you really want to list all files, we should find a way to squeeze ONE recording into ONE row, and not spread it out across several rows just because several files belong to this recording.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It effectively creates duplicates for those scans.

not really duplicates (the extension is different) ... one could argue that it shows that several files belong to that recording. After all, we document that each file belonging to a recording gets an entry in the table.

Also, we are already doing this for split files, and for qMRI file collections

I think it'd be the consistent thing to do ... and I think it's kind of implicit in the present rules already: just not obviously documented (which I'd like to fix).

Copy link
Collaborator

@hoechenberger hoechenberger Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really duplicates (the extension is different)

Yes but that is exactly my point :) the purpose of scans.tsv is, and I state the spec:

The purpose of this file is to describe timing and other properties of each imaging acquisition sequence (each run file) within one session. Each neural recording file SHOULD be described by exactly one row.

So the intention was, originally, to have precisely one row per acquisition run, from where you could extract all the relevant metadata for that specific run, such as recording time and yes, also the name of file used to store the data.

Later on we extended the standard, and I believe I contributed or at least +1'd this change as well:

Some recordings consist of multiple parts, that span several files, for example through echo-, part-, or split- entities. Such recordings MUST be documented with one row per file.

And I believe this was a mistake. As users are free to add additional columns to the file, we end up with duplicated metadata in cases of split, ... files.

If the focus and purpose of this file really has shifted that much since its initial inception, it should be renamed to files.tsv, or an entirely new file should be introduced -- instead of squeezing more and more new stuff into a thing that was never designed to store that kind of info.

I know there's basically no chance to make this happen, but I wanted to voice that I'm very unhappy with the direction some things are developing. 😓

Copy link
Collaborator

@hoechenberger hoechenberger Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... but since things are a total mess already anyway ... yeah, it's probably more consistent to add those additional files than just omitting them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there's basically no chance to make this happen, but I wanted to voice that I'm very unhappy with the direction some things are developing. sweat

knowing your opinions to some extent I don't think this statement does the current developments justice ... the present issue is more "cleaning up" / "documenting" rather than developing anyhow. But maybe we should discuss that in person some other time :-)

Copy link
Member Author

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertoostenveld how do you write entries to scans.tsv in data2bids in fieldtrip?

In this PR I want to clarify that each of the three files should be written (eeg, vhdr, vmrk), because that is consistent with what we do for FIF "split files" (write all of them), or entity-linked file collections (write all of them).

I think this is already implicit in the rules, but not obvious and I'd like to improve that with this PR.

@codecov
Copy link

codecov bot commented Aug 25, 2022

Codecov Report

Merging #1178 (6298081) into master (7523fc3) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1178   +/-   ##
=======================================
  Coverage   88.33%   88.33%           
=======================================
  Files           6        6           
  Lines        1020     1020           
=======================================
  Hits          901      901           
  Misses        119      119           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@robertoostenveld
Copy link
Collaborator

robertoostenveld commented Aug 25, 2022

@robertoostenveld how do you write entries to scans.tsv in data2bids in fieldtrip?

I write one row per single "recording" (aka run), also if that "recording" happens to be represented over multiple files. For BrainVision I use point to the .vhdr file. For EEGLAB (which can comprise a .set plus a .fdt file) I point to the .set file. For CTF I point to the .ds directory, not to any of the files. Also for MEF3 iEEG data I would point to the directory, not to the files.

In fieldtrip we refer to such a conceptual recording construct as the "dataset" and we use https://github.com/fieldtrip/fieldtrip/blob/master/private/dataset2files.m to map between the dataset (the construct) and the actual files in the filesystem.

But I honestly don't think that all 3 files (for BrainVision), 2 files (for EEGLAB) or ~15 files (for CTF) or ~10-100 files (for MEF3) should be listed in the scans file. Just consider: if you were to use BIDS with old ANALYZE data, would you also specify both the .hdr and the .img file? Is there any processing software (i.e. machine readable) or human that would want to have the full list in scans.tsv? MNE Python also only expects the .vhdr filename and not the triplet, right?

[edit: only after typing this reply, I read the whole thread. I now realize that I repeated some things that were already mentioned earlier. Still, I am in favor of one row per run in the scans.tsv, and to use a string (not a list, or a tuple, or a object, or a dict) consistent with the strings that we now specify in the different analysis software packages. E.g. in BrainVision analyzer you click only on the vhdr file to open the recording, not on the triplet.]

@sappelhoff
Copy link
Member Author

I write one row per single "recording" (aka run), also if that "recording" happens to be represented over multiple files

How do you proceed for split FIF files that use the split entity? Because unfortunately (?) we have at some point decided that all split files should we written to scans.tsv (see: #1178 (comment))

What you do in data2bids and what we do in mne-bids corresponds to the first commit in this PR (dba9dd8), which I have meanwhile changed based on the fact that:

  1. the existing text reads like each file should be listed
  2. that's what we do for split files

But perhaps me drawing on entity-linked file collections as another argument to write each file in scans.tsv was misguided, as these are in truth separate (but related) recordings -- unlike the EEGLAB, BrainVision, and directory based (mef3, ctf, ...) file formats (is that correct @tsalo? your initial comment triggered this whole train of thought).

I am happy to reconsider again, but it bothers me that for split files we write each file 😕

https://github.com/sccn/bids-matlab-tools does not seem to write (the optional) scans.tsv files at all 🤔 can you confirm that @arnodelorme?

@tsalo
Copy link
Member

tsalo commented Aug 25, 2022

But perhaps me drawing on entity-linked file collections as another argument to write each file in scans.tsv was misguided, as these are in truth separate (but related) recordings -- unlike the EEGLAB, BrainVision, and directory based (mef3, ctf, ...) file formats (is that correct @tsalo? your initial comment triggered this whole train of thought).

Technically some entity-linked file collections come from the same recording (e.g., complex-valued data broken into magnitude and phase parts), while other collections can actually come from distinct scans (e.g., a lot of the file collections used to calculate quantitative MRI maps). However, I suppose it could be said that they all contain the same type of data (i.e., use the same extensions), but different actual data housed within the files. That seems like a meaningful difference from what you're talking about with multi-file formats.

@robertoostenveld
Copy link
Collaborator

How do you proceed for split FIF files that use the split entity? Because unfortunately (?) we have at some point decided that all split files should we written to scans.tsv

I have not encountered/considered (i.e., thought about) that yet in the context of FT data2bids. Let me read up on that...

@robertoostenveld
Copy link
Collaborator

robertoostenveld commented Aug 26, 2022

Following @tsalo in #1178 (comment), the split FIF files contain the same type of data, at consecutive time windows that all are part of the same run (same acq params, same task, no deliberate subject movement in between). That is different from the eeg/vhdr/vmrk triplet which contain data/header/markers, all from the same experimental time window.

Imagine that we would represent BOLD data as a series of 3D nii files (like in SPM), rather than one 4D nii file. Would you in that case represent that in a single row of the scans.tsv, or in different rows? I feel inclined to accept those ending up on different rows (as they are explicitly split over time), just like split FIF files. The analysis software in that case needs to be smart enough to sort them and glue the data together in memory, but - albeit less practical - could also analyze one at a time (like 2x ft_preprocessing followed by ft_appenddata followed by the rest of the analysis).

@robertoostenveld
Copy link
Collaborator

robertoostenveld commented Aug 26, 2022

In https://bids-specification.readthedocs.io/en/latest/99-appendices/14-glossary.html#split-entities it is written "Each of these files has an internal pointer to the next file.". I am no fif expert. Is that also the case if the split files are written by the neuromag software, e.g. during acquisition or after maxfilter? Or is that only for MNE-Python written files?

edit: FieldTrip does not make use of this internal pointer in the fif file, it expects the user to specify the series and glue them together, either by specifying cfg.dataset = {fif1, fif2, etc.} during ft_preprocessing, or by preprocessing each separately and then explicitly pasting them together with ft_appenddata.

@sappelhoff
Copy link
Member Author

"Each of these files has an internal pointer to the next file.". I am no fif expert. Is that also the case if the split files are written by the neuromag software, e.g. during acquisition or after maxfilter? Or is that only for MNE-Python written files?

I think that's how the files are written by the neuromag software. Is that correct @larsoner?

@sappelhoff
Copy link
Member Author

Thanks for the discussion so far @hoechenberger @robertoostenveld @tsalo I have considered your feedback and made a new attempt to document the situation. I'd be happy for another round of reviews.

@robertoostenveld
Copy link
Collaborator

I looked at the revised text, it now looks good to me.

Co-authored-by: Robert Oostenveld <r.oostenveld@gmail.com>
@hoechenberger
Copy link
Collaborator

hoechenberger commented Aug 26, 2022

@robertoostenveld

"Each of these files has an internal pointer to the next file.". I am no fif expert. Is that also the case if the split files are written by the neuromag software, e.g. during acquisition or after maxfilter?

To my knowledge, yes, this is the case. It sometimes leads to problems when users manually rename the files after retrieving them from the acquisition device / computer, as the pointer(s) still contain(s) the old filenames.

@robertoostenveld
Copy link
Collaborator

robertoostenveld commented Aug 26, 2022

@hoechenberger would you be able to tell me which fif tag it is, and possibly point to the mne-matlab code that deals with it (if any)? As far as I am aware, the mne-matlab reader does not deal with a tag like this and requires the user (and the FieldTrip code) to deal with the files one by one.

edit: I searched myself, from open.py I guess it is FIFF_REF_FILE_NAME, which is not used when reading fif files with the MATLAB implementation https://github.com/mne-tools/mne-matlab/search?q=ref_file_name, nor with the CPP implementation https://github.com/mne-tools/mne-cpp.

@hoechenberger
Copy link
Collaborator

@larsoner may be able to help; he's much more competent than I on that front :)

@hoechenberger
Copy link
Collaborator

@robertoostenveld

edit: I searched myself, from open.py I guess it is FIFF_REF_FILE_NAME, which is not used when reading fif files with the MATLAB implementation https://github.com/mne-tools/mne-matlab/search?q=ref_file_name, nor with the CPP implementation https://github.com/mne-tools/mne-cpp.

I believe it's actually FIFFV_ROLE_NEXT_FILE, which is being defined here for MNE-MATLAB
https://github.com/mne-tools/mne-matlab/blob/8fb7d04fbf1f2c15e40d14582aab6719486178f7/matlab/fiff_define_constants.m#L277-L278

but never used.

But it's used in MNE-CPP when writing split files:
https://github.com/mne-tools/mne-cpp/blob/fe015fdd225de7c080a1fad4749debcc54eb97f7/applications/mne_scan/plugins/writetofile/writetofile.cpp#L443

@robertoostenveld
Copy link
Collaborator

Thanks. I now see that teh files can be stitched together using FIFF_REF_FILE_NAME which together with FIFF_REF_ROLE (which can have the value FIFFV_ROLE_NEXT_FILE and FIFFV_ROLE_PREV_FILE) allows to construct the linked-list.
I already saw that it is used in writing with CPP, but it is not used in reading. So reading the data with mne-cpp requires the user looping over the filenames, just as with mne-matlab. That speaks in favor of having all split fif files represented in the scans.tsv, as now further clarified in this PR. Any software that does automatically read the sequence (like mne-python) will then have to take care in not reading the same split fif file multiple times.

@sappelhoff
Copy link
Member Author

@hoechenberger @robertoostenveld if you are both happy with the present changes, please don't forget to approve the PR for formality reasons :)

Copy link
Collaborator

@robertoostenveld robertoostenveld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the suggested changes

@effigies
Copy link
Collaborator

Do we have a validator pr to go with the new requirements?

@sappelhoff
Copy link
Member Author

Do we have a validator pr to go with the new requirements?

not yet, but we could add a warning for example when a BrainVision file is detected and anything BUT .vhdr is listed in scans.tsv (same for EEGLAB). 🤔

As for the file collections and split fif files, the checks are appropriately in place, if I remember correctly.

@hoechenberger hoechenberger merged commit 594ef95 into bids-standard:master Aug 27, 2022
@hoechenberger
Copy link
Collaborator

Thanks @sappelhoff!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants