Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Clearly define "entity" in common principles #947

Merged
merged 10 commits into from
Jun 8, 2022
171 changes: 107 additions & 64 deletions src/02-common-principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ misunderstanding we clarify them here.
purpose of a particular study. A dataset consists of data acquired from one
or more subjects, possibly from multiple sessions.

1. **Subject** - a person or animal participating in the study. Used
1. **Subject** - a person or animal participating in the study. Used
interchangeably with term **Participant**.

1. **Session** - a logical grouping of neuroimaging and behavioral data
Expand All @@ -34,10 +34,8 @@ misunderstanding we clarify them here.

1. **Sample** - a sample pertaining to a subject such as tissue, primary cell
or cell-free sample.
The `sample-<label>` key/value pair is used to distinguish between different
samples from the same subject.
The label MUST be unique per subject and is RECOMMENDED to be unique
throughout the dataset.
Sample labels MUST be unique within a subject and it is RECOMMENDED that
they be unique throughout the dataset.

1. **Data acquisition** - a continuous uninterrupted block of time during which
a brain scanning instrument was acquiring data according to particular
Expand Down Expand Up @@ -106,30 +104,68 @@ misunderstanding we clarify them here.
The modality may overlap with, but should not be confused with
the **data type**.

1. **`<index>`** - a nonnegative integer, possibly prefixed with arbitrary number of
0s for consistent indentation, for example, it is `01` in `run-01` following
`run-<index>` specification.

1. **`<label>`** - an alphanumeric value, possibly prefixed with arbitrary
number of 0s for consistent indentation, for example, it is `rest` in `task-rest`
following `task-<label>` specification. Note that labels MUST not collide when
casing is ignored (see [Case collision intolerance](#case-collision-intolerance)).

1. **`suffix`** - an alphanumeric value, located after the `key-value_` pairs (thus after
the final `_`), right before the **File extension**, for example, it is `eeg` in
`sub-05_task-matchingpennies_eeg.vhdr`.
1. **Suffix** - an alphanumeric string that forms part of a filename, located
after all [entities](#entities) and following a final `_`, right before the
**file extension**; for example, it is `eeg` in `sub-05_task-matchingpennies_eeg.vhdr`.

1. **File extension** - a portion of the filename after the left-most
period (`.`) preceded by any other alphanumeric. For example, `.gitignore` does
not have a file extension, but the file extension of `test.nii.gz` is `.nii.gz`.
Note that the left-most period is included in the file extension.

1. **DEPRECATED** - A "deprecated" entity or metadata field SHOULD NOT be used in the
1. **DEPRECATED** - A "deprecated" [entity](#entities) or metadata field SHOULD NOT be used in the
generation of new datasets.
It remains in the standard in order to preserve the interpretability of existing datasets.
Validating software SHOULD warn when deprecated practices are detected and provide a
suggestion for updating the dataset to preserve the curator's intent.

## Entities

An "entity" is an attribute that can be associated with a file, contributing
to the identification of that file as a component of its filename in the
form of a hyphen-separated key-value pair.

Each entity has the following attributes:

1. *Name*: A comprehensive name describing the context of information
to be provided via the entity.

1. *Key*: A short string, typically a compression of the entity name,
which uniquely identifies the entity when part of a filename.

1. *Value type*: The requisite form of the value that gets specified
alongside the key whenever the entity appears in a filename.
For each entity, the value is of one of two possible types:

1. *Index*: A non-negative integer, potentially zero-padded for
consistent width.

1. *Label*: An alphanumeric string.
Note that labels MUST not collide when casing is ignored
(see [Case collision intolerance](#case-collision-intolerance)).

The entity *format* is a string that prescribes how the entity appears within
any given filename.
For a hypothetical entity with key "`key`", the format can be either
"`key-<index>`" or "`key-<label>`", depending on the value type of that entity.

An entity *instance* is the specific manifestation of an entity within the
name of a specific file, based on the format of the entity but with a value
that provides identifying information to the particular file in whose name
it appears.

Depending on context, any one of the entity name, key, format, or a specific
entity instance, may be referred to as simply an "entity".

"Subject", "session", "sample", "task", and "run" from the list of definitions
above are all examples of entities.
The comprehensive list of supported entities is defined in
[Appendix IX](99-appendices/09-entities.md);
further, whether each is OPTIONAL, REQUIRED, or MUST NOT be provided for
various data files, as well as their relative ordering in a filename, are
defined in the Entity Table in
[Appendix IV](99-appendices/04-entity-table.md).

## Compulsory, optional, and additional data and metadata

The following standard describes a way of arranging data and writing down
Expand All @@ -148,51 +184,58 @@ The solutions will change from case to case and publicly available datasets will
be reviewed to include common data types in the future releases of the BIDS
specification.

## File name structure

A filename consists of a chain of *entities*, or key-value pairs, a *suffix* and an
*extension*.
Two prominent examples of entities are `subject` and `session`.

For a data file that was collected in a given `session` from a given
`subject`, the filename MUST begin with the string `sub-<label>_ses-<label>`.
If the `session` level is omitted in the directory structure, the filename MUST begin
with the string `sub-<label>`, without `ses-<label>`.

Note that `sub-<label>` corresponds to the `subject` entity because it has
the `sub-` "key" and`<label>` "value", where `<label>` would in a real data file
correspond to a unique identifier of that subject, such as `01`.
The same holds for the `session` entity with its `ses-` key and its `<label>`
value.

The extra session layer (at least one `/ses-<label>` subdirectory) SHOULD
be added for all subjects if at least one subject in the dataset has more than
one session.
If a `/ses-<label>` subdirectory is included as part of the directory hierarchy,
then the same [`ses-<label>`](./99-appendices/09-entities.md#ses)
key/value pair MUST also be included as part of the filenames themselves.
Acquisition time of session can
be defined in the [sessions file](03-modality-agnostic-files.md#sessions-file).

A chain of entities, followed by a suffix, connected by underscores (`_`)
produces a human readable filename, such as `sub-01_task-rest_eeg.edf`.
It is evident from the filename alone that the file contains resting state
data from subject `01`.
The suffix `eeg` and the extension `.edf` depend on the imaging modality and
the data format and indicate further details of the file's contents.

Entities within a filename MUST be unique.
For example, the following filename is not valid because it uses the `acq`
entity twice:
`sub-01_acq-laser_acq-uneven_electrodes.tsv`

In cases where entities duplicate metadata,
the presence of an entity should not be used as a replacement for
## Filesystem structure

Data for each subject are placed in sub-directories named "`sub-<label>`",
where string "`<label>`" is substituted with the unique identification
label of each subject.
Additional information on each participant MAY be provided in a
[participants file](03-modality-agnostic-files.md#participants-file)
in the root directory of the dataset.

If data for the subject were acquired across multiple sessions, then within
the subject directory resides sub-directories named "`ses-<label>`",
where string "`<label>`" is substituted with a unique identification
label for each session.
In datasets where at least one subject has more than one session, this
additional sub-directory later SHOULD be added for all subjects in the dataset.
Additional information on each session MAY be provided in a
[sessions file](03-modality-agnostic-files.md#sessions-file)
within the subject directory.

Within the session sub-directory (or the subject sub-directory if no
session sub-directories are present) are sub-directories named according to
data type as defined above.
A data type directory SHOULD NOT be defined if there are no files to be placed
in that directory.

## Filenames

A filename consists of a chain of *entity instances* and a *suffix*
all separated by underscores, and an *extension*.
This pattern forms filenames that are both human- and machine-readable.
For instance, file "`sub-01_task-rest_eeg.edf`" contains instances of the
"subject" and "task" entities, making it evident from the filename alone that it
contains resting-state data from subject `01`;
the suffix `eeg` and extension `.edf` depend on the imaging modality and the data
format, and can therefore convey further details of the file's contents.

For a data file that was collected in a given session from a given
subject, the filename MUST begin with the string `sub-<label>_ses-<label>`.
Conversely, if the session level is omitted in the directory structure, the file
name MUST begin with the string `sub-<label>`, without `ses-<label>`.

Any given entity MUST NOT appear more than once in any filename. For example,
filename "`sub-01_acq-laser_acq-uneven_electrodes.tsv`" is invalid because
it uses the "acquisition" entity twice.

In cases where an entity and a metadata field convey similar contextual
information, the presence of an entity should not be used as a replacement for
the corresponding metadata field.
For instance, in echo-planar imaging MRI,
the [`dir-<label>`](./99-appendices/09-entities.md#dir) entity MAY be used
For instance, in echo-planar imaging MRI, the
[`dir-<label>`](./99-appendices/09-entities.md#dir) entity MAY be used
to distinguish files with different phase-encoding directions,
but the file's `PhaseEncodingDirection` can only be specified as metadata.
but the file's `PhaseEncodingDirection` MUST be specified as metadata.

A summary of all entities in BIDS and the order in which they MUST be
specified is available in the [entity table](./99-appendices/04-entity-table.md)
Expand Down Expand Up @@ -514,9 +557,9 @@ Example:
}
```

### Key/value files (dictionaries)
### Key-value files (dictionaries)

JavaScript Object Notation (JSON) files MUST be used for storing key/value
JavaScript Object Notation (JSON) files MUST be used for storing key-value
pairs. JSON files MUST be in UTF-8 encoding. Extensive documentation of the
format can be found at [https://www.json.org/](https://www.json.org/),
and at [https://tools.ietf.org/html/std90](https://tools.ietf.org/html/std90).
Expand Down Expand Up @@ -733,7 +776,7 @@ of `<index>`es.
Please note that a given label or index is distinct from the "prefix"
it refers to. For example `sub-01` refers to the `sub` entity (a
subject) with the label `01`. The `sub-` prefix is not part of the subject
label, but must be included in filenames (similarly to other key names).
label, but must be included in filenames (similarly to other entities).

## Specification of paths

Expand Down
Loading