Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uri format regexp is validating invalid URI #408

Closed
vmaurin opened this issue Jun 15, 2021 · 10 comments
Closed

uri format regexp is validating invalid URI #408

vmaurin opened this issue Jun 15, 2021 · 10 comments

Comments

@vmaurin
Copy link
Contributor

vmaurin commented Jun 15, 2021

The regexp used by the "uri" format validation doesn't really validate that the string is a proper URI, i.e there is a chance passing a validated string to the java URI class will result into an exception

An example invalid URI that passes the validation is https://gitlab.com/?args=invalid|value

Should the regexp be more complex ? Should the validation be made with the java URI class (then catching the exception to validate) ?

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 15, 2021

The regexp in question is this one

COMMON_BUILTIN_FORMATS.add(pattern("uri", "(^[a-zA-Z][a-zA-Z0-9+-.]*:[^\\s]*$)|(^//[^\\s]*$)"));

@stevehu
Copy link
Contributor

stevehu commented Jun 16, 2021

@vmaurin The regexp definitely has room to improve. The reason I didn't use the Java URI class to verify the URL is due to performance. Most users are using this library to validate the HTTP request and it needs to be as fast as possible. I would suggest improving the regexp to make it cover more validation scenarios. The other option is to provide both options but default to regexp and with a configuration flag switch to Java URL class. What do you think?

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 16, 2021

I totally agree with the performance issue on relying on try/catch for validation, and also after a quick research, it sounds that URI java class is not perfect regarding validation either. My personal pick would be as you suggested, to enrich a bit the regexp. There are several example online, with this one as the most complete validation http://jmrware.com/articles/2009/uri_regexp/URI_regex.html but it might be also overkill. There are also some middle ground, like this one ^([a-z][a-z0-9+.-]+):(\/\/([^@]+@)?([a-z0-9.\-_~]+)(:\d+)?)?((?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])*)*|(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+)*)?(\?(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?(\#(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?$ that could be implemented, I just don't know what should be the position of json-schema-validator on this regards

@stevehu
Copy link
Contributor

stevehu commented Jun 16, 2021

It looks good to me. Would you like to submit a PR to replace the old one? Thanks.

vmaurin pushed a commit to vmaurin/json-schema-validator that referenced this issue Jun 16, 2021
The json schema uri format should follow rfc3986 for uri validation. I
have replaced the current regex with a more restrictive one in order to
have a stronger validation according the RFC

refs networknt#408
@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 16, 2021

I just did but I ran in an other use case. So far, the protocol relative URL were considered valid, that is not the case anymore with the new regex. According json schema, we should follow the URI rfc, then I guess forbid protocol relative URL, but not sure it is something you desire

@stevehu
Copy link
Contributor

stevehu commented Jun 16, 2021

What is the protocol-relative URL? Could you please provide an example? Thanks. We want to make sure it is backward compatible if possible. Otherwise, some users will have a broken application after upgrade the library.

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 16, 2021

It is actually a breaking change, see in my PR here https://github.com/networknt/json-schema-validator/pull/409/files#diff-cd18df339efd8831f215e37b5248aeb7dede17bac8013db8830dee30cd59b593L200 . The value asserted "valid" is not actually a valid URI, so there are two options :

  • breaking backward compatibility but being more respectful of the json schema specs
  • keeping backward compatibility but validating a URI that is not a correct one according the RFC

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 18, 2021

@stevehu I am waiting on your call. Any more restrictive regexp won't be backward compatible by design, it will invalidate value that were considered valid before. Either you want the library to become stricter along time, ensuring good validation but no backward compatibility regarding validation, or you want to keep the library as lax as today, but give options to be more strict if needed. Depending on your decision, I will adapt the PR

@stevehu
Copy link
Contributor

stevehu commented Jun 18, 2021

We want to maintain backward compatibility; however, we cannot stop enhancing the library. Please submit a PR with your updated regexp. Thanks.

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 18, 2021

Ok so the PR is open here #409

stevehu pushed a commit that referenced this issue Jun 18, 2021
The json schema uri format should follow rfc3986 for uri validation. I
have replaced the current regex with a more restrictive one in order to
have a stronger validation according the RFC

refs #408

Co-authored-by: Vincent Maurin <vincent.maurin@vectaury.io>
@stevehu stevehu changed the title "uri" format regexp is validating invalid URI uri format regexp is validating invalid URI Jun 18, 2021
@stevehu stevehu closed this as completed Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants