Skip to content

Commit

Permalink
feat(schema): Add rule to check files listed in scans.tsv exist (#1881)
Browse files Browse the repository at this point in the history
* feat(schema): Update README with detail on exists(), add file-relative paths

* doc: Add return types to count() and index()

* feat(schema): Add check for the filename column of scans.tsv
  • Loading branch information
effigies authored Aug 8, 2024
1 parent 1235e5b commit cb00fd6
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 14 deletions.
44 changes: 30 additions & 14 deletions src/schema/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,20 +259,36 @@ The following operators should be defined by an interpreter:

The following functions should be defined by an interpreter:

| Function | Definition | Example | Note |
| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `count(arg: array, val: any)` | Number of elements in an array equal to `val` | `count(columns.type, "EEG")` | The number of times "EEG" appears in the column "type" of the current TSV file |
| `exists(arg: str \| array, rule: str) -> int` | Count of files in an array that exist in the dataset. String is array with length 1. Rules include `"bids-uri"`, `"dataset"`, `"subject"` and `"stimuli"`. | `exists(sidecar.IntendedFor, "subject")` | True if all files in `IntendedFor` exist, relative to the subject directory. |
| `index(arg: array, val: any)` | Index of first element in an array equal to `val`, `null` if not found | `index(["i", "j", "k"], axis)` | The number, from 0-2 corresponding to the string `axis` |
| `intersects(a: array, b: array) -> bool` | `true` if arguments contain any shared elements | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `allequal(a: array, b: array) -> bool` | `true` if arrays have the same length and paired elements are equal | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `length(arg: array) -> int` | Number of elements in an array | `length(columns.onset) > 0` | True if there is at least one value in the onset column |
| `match(arg: str, pattern: str) -> bool` | `true` if `arg` matches the regular expression `pattern` (anywhere in string) | `match(extension, ".gz$")` | True if the file extension ends with `.gz` |
| `max(arg: array) -> number` | The largest non-`n/a` value in an array | `max(columns.onset)` | The time of the last onset in an events.tsv file |
| `min(arg: array) -> number` | The smallest non-`n/a` value in an array | `min(sidecar.SliceTiming) == 0` | A check that the onset of the first slice is 0s |
| `sorted(arg: array, method: str) -> array` | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort. | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted |
| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position | `substr(path, 0, length(path) - 3)` | `path` with the last three characters dropped |
| `type(arg: Any) -> str` | The name of the type, including `"array"`, `"object"`, `"null"` | `type(datatypes)` | Returns `"array"` |
| Function | Definition | Example | Note |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `count(arg: array, val: any) -> int` | Number of elements in an array equal to `val` | `count(columns.type, "EEG")` | The number of times "EEG" appears in the column "type" of the current TSV file |
| `exists(arg: str \| array, rule: str) -> int` | Count of files in an array that exist in the dataset. String is array with length 1. See following section for the meanings of rules. | `exists(sidecar.IntendedFor, "subject")` | True if all files in `IntendedFor` exist, relative to the subject directory. |
| `index(arg: array, val: any) -> int` | Index of first element in an array equal to `val`, `null` if not found | `index(["i", "j", "k"], axis)` | The number, from 0-2 corresponding to the string `axis` |
| `intersects(a: array, b: array) -> bool` | `true` if arguments contain any shared elements | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `allequal(a: array, b: array) -> bool` | `true` if arrays have the same length and paired elements are equal | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `length(arg: array) -> int` | Number of elements in an array | `length(columns.onset) > 0` | True if there is at least one value in the onset column |
| `match(arg: str, pattern: str) -> bool` | `true` if `arg` matches the regular expression `pattern` (anywhere in string) | `match(extension, ".gz$")` | True if the file extension ends with `.gz` |
| `max(arg: array) -> number` | The largest non-`n/a` value in an array | `max(columns.onset)` | The time of the last onset in an events.tsv file |
| `min(arg: array) -> number` | The smallest non-`n/a` value in an array | `min(sidecar.SliceTiming) == 0` | A check that the onset of the first slice is 0s |
| `sorted(arg: array, method: str) -> array` | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort. | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted |
| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position | `substr(path, 0, length(path) - 3)` | `path` with the last three characters dropped |
| `type(arg: Any) -> str` | The name of the type, including `"array"`, `"object"`, `"null"` | `type(datatypes)` | Returns `"array"` |

#### The `exists()` function

In various places, BIDS datasets may declare links between files.
In order to validate these links,
the `exists()` function returns a count of files that can be found within the dataset.
To accommodate the various ways of declaring these links,
the following rules are defined:

| `rule` | Definition | Example |
| ------------ | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| `"dataset"` | A path relative to the root of the dataset. | `exists('participants.tsv', 'dataset')` |
| `"subject"` | A path relative to the current subject directory. | `exists('ses-1/anat/sub-01_ses-1_T1w.nii.gz', 'subject')` |
| `"stimuli"` | A path relative to the `/stimuli` directory. | For `events.tsv`: `exists(columns.stim_file, 'stimuli') == length(columns.stim_file)` |
| `"file"` | A path relative to the directory containing the current file. | For `scans.tsv`: `exists(columns.filename, 'file') == length(columns.stim_file)` |
| `"bids-uri"` | A URI of the form `bids:<dataset>:<relative-path>`. If `<dataset>` is empty, the current dataset is used. | `exists('bids::participants.tsv', 'bids-uri')` |

#### The special value `null`

Expand Down
12 changes: 12 additions & 0 deletions src/schema/rules/checks/dataset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,15 @@ SingleSourceCitationFields:
- '!("HowToAcknowledge" in dataset.dataset_description)'
- '!("License" in dataset.dataset_description)'
- '!("ReferencesAndLinks" in dataset.dataset_description)'

ScansTSVScans:
issue:
code: SCANS_FILENAME_NOT_MATCH_DATASET
level: error
message: |
Filenames in scans.tsv file do not match what is present in the BIDS dataset.
selectors:
- suffix == 'scans'
- extension == '.tsv'
checks:
- exists(columns.filename, "file") == length(columns.filename)

0 comments on commit cb00fd6

Please sign in to comment.