Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(schema): Provide default JSON column definition for "conventional" columns #1838

Merged
merged 3 commits into from
Jul 25, 2024

Conversation

effigies
Copy link
Collaborator

@effigies effigies commented May 24, 2024

In BIDS we have three types of columns:

  1. Fully-defined columns, such as participant_id or the type column in electrodes.tsv. These are ones where having JSON-schema style definitions of values is convenient, as we can rely on a well-established way of validating values. There is no good use case for overriding these, since the meaning should not vary from dataset to dataset.
  2. Example columns used in the specification, what I am calling "conventional" columns, such as age or handedness. These are used in the spec and it was decided some time ago that we will let people use them without complaining about being ill-defined, as long as they use them consistently with the specification.
  3. User-defined columns are not provided in the specification, and the descriptions of the columns SHOULD go in the JSON sidecar.

For (1) there is no good use case of overriding, since the meaning should not vary from dataset to dataset. For (2) these definitions may be broader or narrower than is really useful, and we would like people to define what they mean. For convenience, tools should fall back to the BIDS spec for interpretation. For (3) the data are unlikely to be usable in an automated fashion without a JSON definition.

Currently (1) and (2) are defined in the same way in the schema, which leaves a gap because our column description object is not as powerful as JSON schema. When overriding conventional columns, the user has no obvious way to produce an equivalent representation, and tools have no way to help the user by injecting the default definition and inviting the user to refine it.

This PR moves type (2) to align with type (3), so that conventional columns are validated the same way as user-defined columns, and examples are available in the specification. We do this by adding an optional definition field to entries in objects.columns which is the JSON default. It is intentionally written as a JSON object inside YAML. This change also allows a validator to identify (1) and (2) columns purely through structure, rather than adding a new flag.

We have already implemented this in the schema validator in https://github.com/bids-standard/bids-validator/pull/1987 and https://github.com/bids-standard/bids-validator/pull/2003, so this is known to work with the existing BIDS examples.

The logic is as follows:

  1. If a column has a schema definition (type 1), validate using that schema. Warn on attempted override.
  2. If a column has a JSON definition (type 2), use JSON column definition validation. Use an overriding definition from the JSON sidecar, if found, but fall back to the in-schema JSON definition.
  3. If a column has no JSON definition, but there is a JSON definition in the sidecar, validate using JSON validation.
  4. If a column has none of the above, warn/error about missing definitions, depending on whether the file type permits undefined additional columns.
This is an edited post. Expand for original text.

Some TSV columns are very well defined, while others are conventional.

As an example, the type column of electrodes.tsv has a very explicit meaning and list of possible values; there's no reason that someone would add a description in electrodes.json, and if they did, the validator should not allow it to override the specification definition.

In contrast the handedness column of participants.tsv is conventional in that we provide suggested values and their interpretations, but we would really prefer if people provide a full definition in participants.json. There are handedness definitions that are not refinements or subsets; for example, the Edinburgh Handedness Inventory is a numeric scale, not a set of labels.

I would like to propose that for conventional columns, we provide an explicit example JSON definition for the column. Validators would be expected to look first for a sidecar definition, fall back to the schema-defined example, and use whatever definition is available to validate the values in the column. In the case where neither is available, we would raise a warning, as we currently do. If I get some agreement here, I will do the same thing for all other conventional columns I can find.

For fully-defined columns, we will continue using JSON schema-style type definitions, which are more explicit.

cc @sappelhoff who I believe was involved in helping to formalize these last time.

@effigies effigies force-pushed the rf/schema-column-definitions branch from e254ca6 to 1a62d12 Compare May 24, 2024 17:57
@effigies effigies added schema Issues related to the YAML schema representation of the specification. Patch version release. schema-code Updates or changes to the code used to parse, filter, and render the schema. exclude-from-changelog This item will not feature in the automatically generated changelog labels May 24, 2024
@effigies
Copy link
Collaborator Author

This column definition was initially defined in #827 by @tsalo which was also approved by @Remi-Gau. Feedback from any of you would be appreciated.

Copy link

codecov bot commented May 24, 2024

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.06%. Comparing base (8367dbe) to head (2e58636).

Files Patch % Lines
tools/schemacode/bidsschematools/render/utils.py 86.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1838      +/-   ##
==========================================
+ Coverage   88.04%   88.06%   +0.02%     
==========================================
  Files          16       16              
  Lines        1380     1391      +11     
==========================================
+ Hits         1215     1225      +10     
- Misses        165      166       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sappelhoff
Copy link
Member

I would like to propose that for conventional columns, we provide an explicit example JSON definition for the column. Validators would be expected to look first for a sidecar definition, fall back to the schema-defined example, and use whatever definition is available to validate the values in the column. In the case where neither is available, we would raise a warning, as we currently do. If I get some agreement here, I will do the same thing for all other conventional columns I can find.

For fully-defined columns, we will continue using JSON schema-style type definitions, which are more explicit.

That sounds reasonable to me. Shall there be some "info" logging for these cases? 👇

fall back to the schema-defined example

rwblair added a commit to rwblair/legacy-validator that referenced this pull request Jun 7, 2024
…ified in sidecar and schema rules for that column are in the 'json schema' defined style outlined in bids-standard/bids-specification#1838
@effigies effigies force-pushed the rf/schema-column-definitions branch 2 times, most recently from 7d8f448 to ab41d81 Compare June 7, 2024 20:04
effigies added a commit to bids-standard/legacy-validator that referenced this pull request Jun 7, 2024
* Modify tsv column value type checks to use user defined types if specified in sidecar and schema rules for that column are in the 'json schema' defined style outlined in bids-standard/bids-specification#1838

* Fix merge

* Remove unused ts error suppression

* missed a merge markers

* Update bids-validator/src/schema/applyRules.ts

Co-authored-by: Chris Markiewicz <effigies@gmail.com>

---------

Co-authored-by: Chris Markiewicz <effigies@gmail.com>
@effigies
Copy link
Collaborator Author

Shall there be some "info" logging for these cases? 👇

fall back to the schema-defined example

I don't think so. debug logging perhaps, but the standing expectation is that if you use approved values for these columns, the validator won't complain.

@effigies effigies force-pushed the rf/schema-column-definitions branch from df1559a to 40ca7bb Compare July 25, 2024 16:54
Copy link
Member

@tsalo tsalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal makes sense to me, as does the change to the schema.

@rwblair rwblair merged commit 1d465f5 into bids-standard:master Jul 25, 2024
24 of 25 checks passed
@effigies effigies deleted the rf/schema-column-definitions branch July 25, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-from-changelog This item will not feature in the automatically generated changelog needs review schema Issues related to the YAML schema representation of the specification. Patch version release. schema-code Updates or changes to the code used to parse, filter, and render the schema.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants