feat(schema): Add rule to check files listed in scans.tsv exist (#1881)

* feat(schema): Update README with detail on exists(), add file-relative paths * doc: Add return types to count() and index() * feat(schema): Add check for the filename column of scans.tsv
bids-standard · Aug 8, 2024 · cb00fd6 · cb00fd6
1 parent 1235e5b
commit cb00fd6
Show file tree

Hide file tree

Showing 2 changed files with 42 additions and 14 deletions.
diff --git a/src/schema/README.md b/src/schema/README.md
@@ -259,20 +259,36 @@ The following operators should be defined by an interpreter:
 
 The following functions should be defined by an interpreter:
 
-| Function                                        | Definition                                                                                                                                                 | Example                                                | Note                                                                           |
-| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
-| `count(arg: array, val: any)`                   | Number of elements in an array equal to `val`                                                                                                              | `count(columns.type, "EEG")`                           | The number of times "EEG" appears in the column "type" of the current TSV file |
-| `exists(arg: str \| array, rule: str) -> int`   | Count of files in an array that exist in the dataset. String is array with length 1. Rules include `"bids-uri"`, `"dataset"`, `"subject"` and `"stimuli"`. | `exists(sidecar.IntendedFor, "subject")`               | True if all files in `IntendedFor` exist, relative to the subject directory.   |
-| `index(arg: array, val: any)`                   | Index of first element in an array equal to `val`, `null` if not found                                                                                     | `index(["i", "j", "k"], axis)`                         | The number, from 0-2 corresponding to the string `axis`                        |
-| `intersects(a: array, b: array) -> bool`        | `true` if arguments contain any shared elements                                                                                                            | `intersects(dataset.modalities, ["pet", "mri"])`       | True if either PET or MRI data is found in dataset                             |
-| `allequal(a: array, b: array) -> bool`          | `true` if arrays have the same length and paired elements are equal                                                                                        | `intersects(dataset.modalities, ["pet", "mri"])`       | True if either PET or MRI data is found in dataset                             |
-| `length(arg: array) -> int`                     | Number of elements in an array                                                                                                                             | `length(columns.onset) > 0`                            | True if there is at least one value in the onset column                        |
-| `match(arg: str, pattern: str) -> bool`         | `true` if `arg` matches the regular expression `pattern` (anywhere in string)                                                                              | `match(extension, ".gz$")`                             | True if the file extension ends with `.gz`                                     |
-| `max(arg: array) -> number`                     | The largest non-`n/a` value in an array                                                                                                                    | `max(columns.onset)`                                   | The time of the last onset in an events.tsv file                               |
-| `min(arg: array) -> number`                     | The smallest non-`n/a` value in an array                                                                                                                   | `min(sidecar.SliceTiming) == 0`                        | A check that the onset of the first slice is 0s                                |
-| `sorted(arg: array, method: str) -> array`      | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort.                  | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted                                       |
-| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position                                                                               | `substr(path, 0, length(path) - 3)`                    | `path` with the last three characters dropped                                  |
-| `type(arg: Any) -> str`                         | The name of the type, including `"array"`, `"object"`, `"null"`                                                                                            | `type(datatypes)`                                      | Returns `"array"`                                                              |
+| Function                                        | Definition                                                                                                                                | Example                                                | Note                                                                           |
+| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
+| `count(arg: array, val: any) -> int`            | Number of elements in an array equal to `val`                                                                                             | `count(columns.type, "EEG")`                           | The number of times "EEG" appears in the column "type" of the current TSV file |
+| `exists(arg: str \| array, rule: str) -> int`   | Count of files in an array that exist in the dataset. String is array with length 1. See following section for the meanings of rules.     | `exists(sidecar.IntendedFor, "subject")`               | True if all files in `IntendedFor` exist, relative to the subject directory.   |
+| `index(arg: array, val: any) -> int`            | Index of first element in an array equal to `val`, `null` if not found                                                                    | `index(["i", "j", "k"], axis)`                         | The number, from 0-2 corresponding to the string `axis`                        |
+| `intersects(a: array, b: array) -> bool`        | `true` if arguments contain any shared elements                                                                                           | `intersects(dataset.modalities, ["pet", "mri"])`       | True if either PET or MRI data is found in dataset                             |
+| `allequal(a: array, b: array) -> bool`          | `true` if arrays have the same length and paired elements are equal                                                                       | `intersects(dataset.modalities, ["pet", "mri"])`       | True if either PET or MRI data is found in dataset                             |
+| `length(arg: array) -> int`                     | Number of elements in an array                                                                                                            | `length(columns.onset) > 0`                            | True if there is at least one value in the onset column                        |
+| `match(arg: str, pattern: str) -> bool`         | `true` if `arg` matches the regular expression `pattern` (anywhere in string)                                                             | `match(extension, ".gz$")`                             | True if the file extension ends with `.gz`                                     |
+| `max(arg: array) -> number`                     | The largest non-`n/a` value in an array                                                                                                   | `max(columns.onset)`                                   | The time of the last onset in an events.tsv file                               |
+| `min(arg: array) -> number`                     | The smallest non-`n/a` value in an array                                                                                                  | `min(sidecar.SliceTiming) == 0`                        | A check that the onset of the first slice is 0s                                |
+| `sorted(arg: array, method: str) -> array`      | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort. | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted                                       |
+| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position                                                              | `substr(path, 0, length(path) - 3)`                    | `path` with the last three characters dropped                                  |
+| `type(arg: Any) -> str`                         | The name of the type, including `"array"`, `"object"`, `"null"`                                                                           | `type(datatypes)`                                      | Returns `"array"`                                                              |
+
+#### The `exists()` function
+
+In various places, BIDS datasets may declare links between files.
+In order to validate these links,
+the `exists()` function returns a count of files that can be found within the dataset.
+To accommodate the various ways of declaring these links,
+the following rules are defined:
+
+| `rule`       | Definition                                                                                                | Example                                                                               |
+| ------------ | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| `"dataset"`  | A path relative to the root of the dataset.                                                               | `exists('participants.tsv', 'dataset')`                                               |
+| `"subject"`  | A path relative to the current subject directory.                                                         | `exists('ses-1/anat/sub-01_ses-1_T1w.nii.gz', 'subject')`                             |
+| `"stimuli"`  | A path relative to the `/stimuli` directory.                                                              | For `events.tsv`: `exists(columns.stim_file, 'stimuli') == length(columns.stim_file)` |
+| `"file"`     | A path relative to the directory containing the current file.                                             | For `scans.tsv`: `exists(columns.filename, 'file') == length(columns.stim_file)`      |
+| `"bids-uri"` | A URI of the form `bids:<dataset>:<relative-path>`. If `<dataset>` is empty, the current dataset is used. | `exists('bids::participants.tsv', 'bids-uri')`                                        |
 
 #### The special value `null`
 

diff --git a/src/schema/rules/checks/dataset.yaml b/src/schema/rules/checks/dataset.yaml
@@ -91,3 +91,15 @@ SingleSourceCitationFields:
     - '!("HowToAcknowledge" in dataset.dataset_description)'
     - '!("License" in dataset.dataset_description)'
     - '!("ReferencesAndLinks" in dataset.dataset_description)'
+
+ScansTSVScans:
+  issue:
+    code: SCANS_FILENAME_NOT_MATCH_DATASET
+    level: error
+    message: |
+      Filenames in scans.tsv file do not match what is present in the BIDS dataset.
+  selectors:
+    - suffix == 'scans'
+    - extension == '.tsv'
+  checks:
+    - exists(columns.filename, "file") == length(columns.filename)