Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(schema): Add rule to check files listed in scans.tsv exist #1881

Merged
merged 3 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 30 additions & 14 deletions src/schema/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,20 +259,36 @@ The following operators should be defined by an interpreter:

The following functions should be defined by an interpreter:

| Function | Definition | Example | Note |
| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `count(arg: array, val: any)` | Number of elements in an array equal to `val` | `count(columns.type, "EEG")` | The number of times "EEG" appears in the column "type" of the current TSV file |
| `exists(arg: str \| array, rule: str) -> int` | Count of files in an array that exist in the dataset. String is array with length 1. Rules include `"bids-uri"`, `"dataset"`, `"subject"` and `"stimuli"`. | `exists(sidecar.IntendedFor, "subject")` | True if all files in `IntendedFor` exist, relative to the subject directory. |
| `index(arg: array, val: any)` | Index of first element in an array equal to `val`, `null` if not found | `index(["i", "j", "k"], axis)` | The number, from 0-2 corresponding to the string `axis` |
| `intersects(a: array, b: array) -> bool` | `true` if arguments contain any shared elements | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `allequal(a: array, b: array) -> bool` | `true` if arrays have the same length and paired elements are equal | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `length(arg: array) -> int` | Number of elements in an array | `length(columns.onset) > 0` | True if there is at least one value in the onset column |
| `match(arg: str, pattern: str) -> bool` | `true` if `arg` matches the regular expression `pattern` (anywhere in string) | `match(extension, ".gz$")` | True if the file extension ends with `.gz` |
| `max(arg: array) -> number` | The largest non-`n/a` value in an array | `max(columns.onset)` | The time of the last onset in an events.tsv file |
| `min(arg: array) -> number` | The smallest non-`n/a` value in an array | `min(sidecar.SliceTiming) == 0` | A check that the onset of the first slice is 0s |
| `sorted(arg: array, method: str) -> array` | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort. | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted |
| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position | `substr(path, 0, length(path) - 3)` | `path` with the last three characters dropped |
| `type(arg: Any) -> str` | The name of the type, including `"array"`, `"object"`, `"null"` | `type(datatypes)` | Returns `"array"` |
| Function | Definition | Example | Note |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `count(arg: array, val: any) -> int` | Number of elements in an array equal to `val` | `count(columns.type, "EEG")` | The number of times "EEG" appears in the column "type" of the current TSV file |
| `exists(arg: str \| array, rule: str) -> int` | Count of files in an array that exist in the dataset. String is array with length 1. See following section for the meanings of rules. | `exists(sidecar.IntendedFor, "subject")` | True if all files in `IntendedFor` exist, relative to the subject directory. |
| `index(arg: array, val: any) -> int` | Index of first element in an array equal to `val`, `null` if not found | `index(["i", "j", "k"], axis)` | The number, from 0-2 corresponding to the string `axis` |
| `intersects(a: array, b: array) -> bool` | `true` if arguments contain any shared elements | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `allequal(a: array, b: array) -> bool` | `true` if arrays have the same length and paired elements are equal | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
| `length(arg: array) -> int` | Number of elements in an array | `length(columns.onset) > 0` | True if there is at least one value in the onset column |
| `match(arg: str, pattern: str) -> bool` | `true` if `arg` matches the regular expression `pattern` (anywhere in string) | `match(extension, ".gz$")` | True if the file extension ends with `.gz` |
| `max(arg: array) -> number` | The largest non-`n/a` value in an array | `max(columns.onset)` | The time of the last onset in an events.tsv file |
| `min(arg: array) -> number` | The smallest non-`n/a` value in an array | `min(sidecar.SliceTiming) == 0` | A check that the onset of the first slice is 0s |
| `sorted(arg: array, method: str) -> array` | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort. | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted |
| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position | `substr(path, 0, length(path) - 3)` | `path` with the last three characters dropped |
| `type(arg: Any) -> str` | The name of the type, including `"array"`, `"object"`, `"null"` | `type(datatypes)` | Returns `"array"` |

#### The `exists()` function

In various places, BIDS datasets may declare links between files.
In order to validate these links,
the `exists()` function returns a count of files that can be found within the dataset.
To accommodate the various ways of declaring these links,
the following rules are defined:

| `rule` | Definition | Example |
| ------------ | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| `"dataset"` | A path relative to the root of the dataset. | `exists('participants.tsv', 'dataset')` |
| `"subject"` | A path relative to the current subject directory. | `exists('ses-1/anat/sub-01_ses-1_T1w.nii.gz', 'subject')` |
| `"stimuli"` | A path relative to the `/stimuli` directory. | For `events.tsv`: `exists(columns.stim_file, 'stimuli') == length(columns.stim_file)` |
| `"file"` | A path relative to the directory containing the current file. | For `scans.tsv`: `exists(columns.filename, 'file') == length(columns.stim_file)` |
| `"bids-uri"` | A URI of the form `bids:<dataset>:<relative-path>`. If `<dataset>` is empty, the current dataset is used. | `exists('bids::participants.tsv', 'bids-uri')` |

#### The special value `null`

Expand Down
12 changes: 12 additions & 0 deletions src/schema/rules/checks/dataset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,15 @@ SingleSourceCitationFields:
- '!("HowToAcknowledge" in dataset.dataset_description)'
- '!("License" in dataset.dataset_description)'
- '!("ReferencesAndLinks" in dataset.dataset_description)'

ScansTSVScans:
issue:
code: SCANS_FILENAME_NOT_MATCH_DATASET
level: error
message: |
Filenames in scans.tsv file do not match what is present in the BIDS dataset.
selectors:
- suffix == 'scans'
- extension == '.tsv'
checks:
- exists(columns.filename, "file") == length(columns.filename)