Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doi registrant code is too restrictive in the schema #910

Closed
schwehr opened this issue Oct 12, 2020 · 3 comments · Fixed by #964
Closed

doi registrant code is too restrictive in the schema #910

schwehr opened this issue Oct 12, 2020 · 3 comments · Fixed by #964
Labels
json schema prio: should-have would be very good to have in the release
Milestone

Comments

@schwehr
Copy link
Contributor

schwehr commented Oct 12, 2020

In extensions/scientific/json-schema/schema.json

"sci:doi": {
          "type": "string",
          "title": "Data DOI",
          "pattern": "^(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?![%\"#? ])\\S)+)$"
        }, 

This is too narrow: [0-9]{4,}

https://www.doi.org/overview/DOI_article_ELIS3.pdf

a unique alphanumeric string assigned to an organization
that wishes to register DOI names (four digit numeric codes
are currently used though this is not a compulsory syntax).
The registrant code is assigned through a DOI registration
agency, and a registrant may have multiple-registrant
codes. 

https://www.doi.org/doi_handbook/2_Numbering.html#2.2.2

The registrant code is a unique string assigned to a registrant.

So my best guess at what the doi regex should be is this based on the alphanumeric statement in the pdf.

          "pattern": "^(10[.][0-9a-zA-Z]+(?:[.][0-9a-zA-Z]+)*/(?:(?![%\"#? ])\\S)+)$"

So this should be a valid doi if the prefix was registered: 10.123abc.foo.bar/issn.1476-4687/this/is/nuts

I'm not sure what the suffix part of the pattern will match: (?:(?![%\"#? ])\\S)+)

https://json-schema.org/understanding-json-schema/reference/regular_expressions.html

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 20, 2020

That seems to be a valid point. The original reg exp we used is from here: https://www.regextester.com/93795
Maybe we can also learn a bit from here, although they also just use numbers: https://www.crossref.org/blog/dois-and-matching-regular-expressions/
Would you be up to contribute a new version in a PR?

@schwehr
Copy link
Contributor Author

schwehr commented Oct 20, 2020

I will definitely cook up a PR. I have been exploring DOIs more while working on the scientific extension in stac-utils/pystac#199. e.g. https://gist.github.com/schwehr/22ce6080eb9e730ef04fccfa25072e3a

I am wondering if anyone has valid DOIs that fail to validate?

@m-mohr m-mohr added this to the 1.0.0-beta.3 milestone Dec 16, 2020
@m-mohr m-mohr self-assigned this Jan 4, 2021
@cholmes cholmes added the prio: should-have would be very good to have in the release label Jan 14, 2021
@m-mohr
Copy link
Collaborator

m-mohr commented Feb 2, 2021

With the CrossRef article in mind, I have relaxed the DOI regex to ^10\.[0-9a-zA-Z]{4,}/[^\s]+$ That should capture mostly all regexp out there, I think. See PR #964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
json schema prio: should-have would be very good to have in the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants