Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add the ability of users to specify an explicit HED.xml schema for validation. #527

Merged
merged 59 commits into from
Aug 6, 2020

Conversation

VisLab
Copy link
Member

@VisLab VisLab commented Jul 10, 2020

closes #537

We have refactored the HED validation in BIDS so that it does not depend on a specific HED schema version. The HED schema itself is under active development by a hed-standard group. As this schema specification evolves, people will be tagging their data at a specific point in time. We do not want their tagging to become invalid because we changed the schema. They just need to specify the vocabulary they used.

This change will require some modifications of the BIDS validator. Our group is working on a draft pull request for those changes. When it is ready, we will do a regular request, but we would appreciate comments/suggestions during the process.

@effigies
Copy link
Collaborator

This seems like a good approach. I like that you can specify a version or an explicit schema. I can make more specific suggestions once this is marked ready for review, but here are some initial thoughts:

  • I would suggest HEDVersion instead of HEDSchemaVersion to be consistent with BIDSVersion.
  • Putting these values in dataset_description.json rather than _events.json would ensure that only one schema can be used within a dataset. I don't think we want to open up the possibility of multiple.
  • XML generally permits local or remote URIs for its schemata, so unless we have an explicit desire to disallow that, I would suggest permitting a URI. If a dataset wants to place it inside sourcedata, then it could use file://sourcedata/HED7.2.1.xml.

Also, does it make sense to have two separate keys HEDVersion (or HEDSchemaVersion) and HEDSchemaFile? It should be pretty easy to disambiguate a version string from a URI, if we want to combine, but I don't feel very strongly here.

@VisLab
Copy link
Member Author

VisLab commented Jul 10, 2020

Agree with all suggestions....thanks.....

Moved the specification of which HED schema to validate against to the dataset_description.json file as suggested.
This optional field is used to specify the HED version or specific XML schema file used for HED validation.
Updated dataset_description to include HEDSchema. Fixed table spacing error this introduced.
@@ -27,6 +27,7 @@ Every dataset MUST include this file with the following fields:
| EthicsApprovals | OPTIONAL. List of ethics committee approvals of the research protocols and/or protocol identifiers. |
| ReferencesAndLinks | OPTIONAL. List of references to publication that contain information on the dataset, or links. |
| DatasetDOI | OPTIONAL. The Document Object Identifier of the dataset (not the corresponding paper). |
| HEDVersion | OPTIONAL. HED version number or the name of HED XML schema file used to validate HED tags for study. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a significant likelihood of backwards-incompatible changes, then this should probably be RECOMMENDED:

Suggested change
| HEDVersion | OPTIONAL. HED version number or the name of HED XML schema file used to validate HED tags for study. |
| HEDVersion | RECOMMENDED if HED tags are used. HED version number or the name of HED XML schema file used to validate HED tags for study. |

If it's mostly a way to test new features, then OPTIONAL seems fine.

@VisLab
Copy link
Member Author

VisLab commented Jul 15, 2020 via email

Not sure whether this should be qualified by some statement that it is recommended if you are using HED tags to annotate.
@happy5214
Copy link
Contributor

This seems like a good approach. I like that you can specify a version or an explicit schema. I can make more specific suggestions once this is marked ready for review, but here are some initial thoughts:

* I would suggest `HEDVersion` instead of `HEDSchemaVersion` to be consistent with `BIDSVersion`.

* Putting these values in `dataset_description.json` rather than `_events.json` would ensure that only one schema can be used within a dataset. I don't think we want to open up the possibility of multiple.

* XML generally permits local or remote URIs for its schemata, so unless we have an explicit desire to disallow that, I would suggest permitting a URI. If a dataset wants to place it inside `sourcedata`, then it could use `file://sourcedata/HED7.2.1.xml`.

Also, does it make sense to have two separate keys HEDVersion (or HEDSchemaVersion) and HEDSchemaFile? It should be pretty easy to disambiguate a version string from a URI, if we want to combine, but I don't feel very strongly here.

Making a local path a URI would require the BIDS validator to parse it into a local path before passing it to the HED validator, as the latter does not accept URIs (including for remote schemata). Even if the HED validator were to accept URIs, it would not be able to find the file with the URI because the proposed URI is relative to the dataset directory rather than the execution directory, so the HED validator would have the wrong origin for the relative path. Keeping the filename under sourcedata/ as a simple name is much easier to implement.

I have implemented the other proposed changes (merging the two fields in _events.json into a single HEDVersion field in dataset_description.json). The pull request is at bids-standard/bids-validator#1015.

We don't need the URI syntax. The file will always be in sourcedata.
We aren't using URIs, just filenames in the sourcedata directory.
@VisLab
Copy link
Member Author

VisLab commented Aug 2, 2020

@happy5214 and @VisLab would like to move forward with the merge. There will be additional suggestions/fixes but this merge is needed to move forward.

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fine with me to iterate on the examples, but then I suggest to make the brief intro sentences before the examples a bit more clear, so that readers know there is more to define in the JSONS, and that the examples are not "full examples".

@@ -65,6 +65,28 @@ onset duration trial_type response_time stim_file
5.6 0.6 stop 1.739 images/blue_square.jpg
```

The accompanying JSON side car might be:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The accompanying JSON side car might be:
In the accompanying JSON sidecar, the `trial_type` column might look as follows:

@sappelhoff
Copy link
Member

sappelhoff commented Aug 2, 2020

Finally, please have a look at the "Checks" section in the GitHub user interface of this Pull Request (see screenshot):
image

You currently have once check failing: Travis CI.

Please click on "Details" for more information

image

This will lead you to an overview, where you can click on "The build" to get information on WHY the check has failed

image

That will finally show you line by line what the outstanding issues are:

image

In your case, these are related to Markdown style. Please fix these issues as well, so that we can keep a consistent style throughout the BIDS specification. If you have issues or problems with that, just let me know and I can help you.

EDIT: in the failure report, the syntax like 66:1 means that for a given file there is an issue in line 66 at character 1

@VisLab
Copy link
Member Author

VisLab commented Aug 2, 2020 via email

@sappelhoff
Copy link
Member

sure, a markdown table has fences made out of pipes | and dashes -

A good table looks like this:

| header1 | a very long header2 | header3          |
| ------- | ------------------- | ---------------- |
| val1    | val2                | a very long val3 |
| val4    | val5                | val6             |

a bad table (although technically correct syntax) looks like this (for example):

| header1 | a very long header2 | header3 |
| ------- | ------------------- | ------- |
| val1 | val2 | a very long val3 |
| val4 | val5 | val6 |

In a rendered state, both will look like this:

header1 a very long header2 header3
val1 val2 a very long val3
val4 val5 val6

@VisLab
Copy link
Member Author

VisLab commented Aug 2, 2020

Okay, I have attempted to fix this.

@sappelhoff
Copy link
Member

I see you fixed all errors, thanks! Now there remain two small outstanding issues (linked below) from my side and then a final review from @effigies (if he has time and energy):

Note, I think there was an correct duplication of text code.  I removed it, but someone should check it.
Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting and wording suggestions.

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved
Comment on lines 48 to 49
hypothesis extensions of BIDS. Note that the `trial_type` and any additional
columns in a TSV file SHOULD be documented in an accompanying JSON sidecar file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a newline to keep future changes confined to fewer lines.

Suggested change
hypothesis extensions of BIDS. Note that the `trial_type` and any additional
columns in a TSV file SHOULD be documented in an accompanying JSON sidecar file.
hypothesis extensions of BIDS.
Note that the `trial_type` and any additional columns in a TSV file SHOULD be
documented in an accompanying JSON sidecar file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I got the correction right on this one. Were there supposed to be two spaces after BIDS.

src/04-modality-specific-files/05-task-events.md Outdated Show resolved Hide resolved
relevant HED tags will then be associated with the event instance.

The tags in the `HED` column of the `*_events.tsv` file are often specific to the individual event instances,
while the common properties are represented by categorial values appearing in other columns.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while the common properties are represented by categorial values appearing in other columns.
while the common properties are represented by categorical values appearing in other columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected the spelling of categorical. Not sure about the removing of the space?

Comment on lines 92 to 94
the categorical specifications, but should form the union before analysis. Further,
the normal BIDS inheritance principle applies, so the data dictionaries can
appear higher in the BIDS hierarchy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the categorical specifications, but should form the union before analysis. Further,
the normal BIDS inheritance principle applies, so the data dictionaries can
appear higher in the BIDS hierarchy.
the categorical specifications, but should form the union before analysis.
Further, the [inheritance principle](../02-common-principles.md#the-inheritance-principle) applies,
so the data dictionaries can appear higher in the BIDS hierarchy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected, but again, I am not sure about the rules that should be followed for inserting new lines and the extra spaces after periods.

@VisLab
Copy link
Member Author

VisLab commented Aug 6, 2020

Is there anything else that needs attention on this?

@sappelhoff
Copy link
Member

I think we can merge this now 👍

Just a note for you --> when a reviewer makes "code suggestions", you see these two buttons (screenshot):

image

You can use them to commit these suggestions directly to the code, which can often be easier and sometimes cleaner for the review process.

Regarding line breaks and spaces in the spec:

  • try to start a new line after each sentence
  • try to break very long lines at reasonable spots (up to you)
  • single spaces (no double spaces)
  • no trailing whitespace

@sappelhoff sappelhoff merged commit aa98a2d into bids-standard:master Aug 6, 2020
@sappelhoff
Copy link
Member

Thanks for your work and patience @VisLab @happy5214

@VisLab
Copy link
Member Author

VisLab commented Aug 6, 2020 via email

@effigies
Copy link
Collaborator

effigies commented Aug 6, 2020

Thanks @VisLab! I appreciate your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update spec examples (HED)
4 participants