Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMA] Add patterns for format validation #885

Merged
merged 34 commits into from
Mar 24, 2022

Conversation

tsalo
Copy link
Member

@tsalo tsalo commented Sep 24, 2021

Closes #832.

This currently doesn't include any code that would use the patterns, except for tests.

Changes proposed:

  • Add formats.yaml file to schema, with regular expressions with which to validate different entity values, metadata types, and metadata string formats. Individual metadata fields should still be allowed to supply a specific "pattern" keyword.

To do:

src/schema/formats.yaml Outdated Show resolved Hide resolved
Comment on lines 47 to 66
dataset_relative:
description: |
A path to a file, relative to the dataset folder.

The validation for this format is minimal.
pattern: ^[0-9a-zA-Z/_-\.]+$
participant_relative:
description: |
A path to a file, relative to the participant's folder in the dataset.

The validation for this format is minimal.
pattern: ^[0-9a-zA-Z/_-\.]+$
uri:
description: |
A uniform resource indicator.
pattern: ^[a-zA-Z]+:[0-9a-zA-Z/_-\.]+$
bids_uri:
description: |
A BIDS uniform resource indicator.
pattern: ^bids:[0-9a-zA-Z/_-\.]+$
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patterns for all four of the path/URI formats should be revised. I don't know enough about the rules to define solid patterns just yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same goes for stimuli_relative (added after the initial comment).

src/schema/formats.yaml Outdated Show resolved Hide resolved
@tsalo tsalo marked this pull request as ready for review September 24, 2021 18:59
@tsalo tsalo requested a review from rwblair September 24, 2021 19:00
Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will for sure be superuseful. I don't speak regex fluently so not sure I am the best to review this though.

Copy link
Member

@rwblair rwblair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure makes sense to me. Eyeballing the regex it seems ok, but the best way to verify it is for it to be used by the validator in tests.

@tsalo
Copy link
Member Author

tsalo commented Oct 5, 2021

Thanks for taking a look. I'm thinking that I should try writing some unit tests for the schema code, which will necessitate reorganizing it into a proper package. I will open a PR to do that after #883 is merged, so that there aren't too many merge conflicts.

EDIT: I've opened #892.

Comment on lines 15 to 18
description: |
The basic string type (not a specific format).
This should allow any free-form string *except* "n/a".
pattern: ^(?!(n/a)$).*$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to #876

This should allow any free-form string except "n/a".

I am not sure whether disallowing n/a as a string for free-form text fields would break some datasets. For example for the required EEGReference field, users could currently pass n/a according to bids-validator:

https://github.com/bids-standard/bids-validator/blob/75c54e4f7a17bf58c7cfc47c9698eb4e0841315b/bids-validator/validators/json/schemas/eeg.json#L25

Whether or not that's a good idea should be discussed in #876, this is just a note that what you introduce here has not been discussed and has not been formally part of the spec 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope is that any cases where "n/a" is acceptable will have "n/a" as an explicit option within the schema. I've tried to add it as an option in all of the cases that I've seen like that, but that doesn't mean we're close to having complete coverage. I can drop any fancy "n/a" handling from this PR if you'd rather wait to tackle it within #876 though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped the special "n/a" handling in dfd3057, but I do think we will want to distinguish between "n/a"s and freeform strings in the long run.

@yarikoptic
Copy link
Collaborator

I was wondering on why formats.yaml and not patterns.yaml if that is what it contains really?

@tsalo
Copy link
Member Author

tsalo commented Oct 14, 2021

Currently, we have pattern in the schema when there's an idiosyncratic regular expression associated with a single field, e.g.,

pattern: ^(?:2[0-3]|[01][0-9]):[0-5][0-9]:[0-5][0-9]$

Whereas common "patterns" are consolidated as "formats." I guess you could say that this file defines the "patterns" associated with the "formats", but I feel like naming it patterns.yaml might be confusing since the existing "patterns" in the schema are not included in this file.

@yarikoptic
Copy link
Collaborator

I guess the next step for this pr is adding some tests as @tsalo was planning todo (and #833 is merged) or that would be postponed till later?

@tsalo tsalo requested a review from yarikoptic January 5, 2022 18:52
@sappelhoff sappelhoff modified the milestone: 1.7.0 Feb 1, 2022
@codecov
Copy link

codecov bot commented Feb 1, 2022

Codecov Report

Merging #885 (4a0ad3f) into master (51e9f1b) will increase coverage by 2.30%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #885      +/-   ##
==========================================
+ Coverage   34.05%   36.35%   +2.30%     
==========================================
  Files           8        8              
  Lines         834      850      +16     
==========================================
+ Hits          284      309      +25     
+ Misses        550      541       -9     
Impacted Files Coverage Δ
tools/schemacode/schemacode/tests/test_schema.py 100.00% <100.00%> (ø)
tools/schemacode/schemacode/_version.py 38.90% <0.00%> (+2.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e53a367...4a0ad3f. Read the comment docs.

Copy link
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just love testing boundary conditions I guess ;)

src/schema/objects/formats.yaml Outdated Show resolved Hide resolved
tools/schemacode/schemacode/tests/test_schema.py Outdated Show resolved Hide resolved
tools/schemacode/schemacode/tests/test_schema.py Outdated Show resolved Hide resolved
tools/schemacode/schemacode/tests/test_schema.py Outdated Show resolved Hide resolved
tools/schemacode/schemacode/tests/test_schema.py Outdated Show resolved Hide resolved
tools/schemacode/schemacode/tests/test_schema.py Outdated Show resolved Hide resolved
tsalo and others added 3 commits February 2, 2022 14:30
Thanks for the suggestions @yarikoptic. They're great!

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
@tsalo
Copy link
Member Author

tsalo commented Feb 2, 2022

It looks like ftp:// passes validation for the uri format, unfortunately. Not sure how to update the regex appropriately.

EDIT: URIs can't really be validated well with regular expressions, apparently, so I think the best solution would be to do only light testing for the regular expression, but mention in the description that folks should use an appropriate function in practice.

@yarikoptic
Copy link
Collaborator

yarikoptic commented Feb 3, 2022

It looks like ftp:// passes validation for the uri format, unfortunately. Not sure how to update the regex appropriately.

I can try later to create one following the "Syntax" of https://en.wikipedia.org/wiki/Uniform_Resource_Identifier which would demand to have a non-degenerate authority after //

edit: although not sure if it would come up stringent enough to identify errors. There is also a regex given on https://www.ietf.org/rfc/rfc3986.txt as ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? but that one is not stringent enough either:

In [3]: re.match("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?", "ftp://")
Out[3]: <re.Match object; span=(0, 6), match='ftp://'>

In [4]: re.match("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?", "ftp://").groups()
Out[4]: ('ftp:', 'ftp', '//', '', '', None, None, None, None)

and making authority non-degenerate shifts // somewhere deeper:

In [5]: re.match("^(([^:/?#]+):)?(//([^/?#]+))?([^?#]*)(\?([^#]*))?(#(.*))?", "ftp://").groups()
Out[5]: ('ftp:', 'ftp', None, None, '//', None, None, None, None)

didn't analyze further. Feel free to just add a note on possible need for improvement or other way to specify extra rules.

@tsalo
Copy link
Member Author

tsalo commented Mar 24, 2022

Now that I have two approvals, I'm going to merge. I think we can keep improving these patterns (i.e., #999) before the next release.

@tsalo tsalo merged commit 9dbab0e into bids-standard:master Mar 24, 2022
@tsalo tsalo deleted the format-rules branch March 24, 2022 19:46
@tsalo tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Apr 11, 2022
@effigies effigies added the exclude-from-changelog This item will not feature in the automatically generated changelog label Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-from-changelog This item will not feature in the automatically generated changelog schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define valid value formats in schema
6 participants