Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to represent regular expressions #22

Closed
akonradi opened this issue Nov 27, 2017 · 9 comments
Closed

How to represent regular expressions #22

akonradi opened this issue Nov 27, 2017 · 9 comments
Labels
Help Wanted Community support requested Question Question about the library

Comments

@akonradi
Copy link
Contributor

The current defacto regular expression implementation is the one used by Go, which uses the re2 syntax. It isn't POSIX-compliant, nor is it immediately compatible with C++'s std::basic_regex and friends. This shows up most obviously when trying to use flags (. matches newline, case-insensitive matching, etc.) to modify the matching behavior: Go encodes these as part of the expression string while C++ uses a separate bitmask.

@htuch
Copy link
Contributor

htuch commented Nov 27, 2017

JSON schema uses ECMAscript regexes (https://spacetelescope.github.io/understanding-json-schema/reference/regular_expressions.html), which is what C++ uses more or less. So, we should probably use that. Is there a Go lib for this?

@akonradi
Copy link
Contributor Author

It's definitely not supported by the built-in regexp package. ECMAscript supports backreferences and re2 doesn't. I don't know about third party libraries, though.

@rodaine
Copy link
Member

rodaine commented Nov 27, 2017

Currently, PGV is documented to support re2. Ideally, none of the generated code (any lang) will have dependencies outside of the stdlib. So...

We can limit to the POSIX ERE syntax, if that's something we can support out-of-the-box in C++?

@htuch
Copy link
Contributor

htuch commented Nov 28, 2017

C++ can do something "similar" to ERE, see https://www.regular-expressions.info/stdregex.html for the caveats which mostly relate to non-ASCII and embedded line breaks. http://en.cppreference.com/w/cpp/regex/basic_regex as well.

@akonradi
Copy link
Contributor Author

I suspect it won't be possible to avoid all dependencies outside of the standard libraries. UTF-8 support, which is required by some string validations, is not supported in the C++ standard library. I don't think URL or IP validation are either (though I may be mistaken). Go just happens to have a standard library with substantially more breadth than C++. That being said, re2 wouldn't be the worst thing to depend on, since it seems to have bindings for a reasonable number of languages.

@rodaine rodaine added the Question Question about the library label Dec 1, 2017
@akonradi
Copy link
Contributor Author

Now that we're adding more languages, I think it's time to revisit this. Both Java and Python support re2, and while I like not having dependencies outside the standard libraries, this seems like a good exception to make.

@akonradi
Copy link
Contributor Author

@rodaine @htuch any thoughts here?

@htuch
Copy link
Contributor

htuch commented Jun 24, 2019

I think we could live with re2 as an Envoy dependency, so if we have Go/Java/Python with out-of-the-box support, let's go with that.

@stale
Copy link

stale bot commented Jul 26, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Stale Activity has stalled on this issue/pull-request label Jul 26, 2019
@rodaine rodaine added the Help Wanted Community support requested label Jul 26, 2019
@stale stale bot removed the Stale Activity has stalled on this issue/pull-request label Jul 26, 2019
htuch pushed a commit that referenced this issue Dec 3, 2019
Envoy now uses RE2 as a safe regex engine instead of std::regex (envoyproxy/envoy#7878). Because PGV already requires patterns to use RE2 syntax, one option is to use RE2 for C++ patterns as well. This implements it, for use in strings, bytes, repeated items, and may key/value pattern validation.

Implements #22

WIP: I ran in to difficulty creating the regex because a regex containing a null character would get cut off... for example, the ascii character test used the pattern, ^[\x00-x7f]+$, and consuming this as a string resulted in creating a null-terminated string pattern ^[ instead of the actual pattern. I think this might be a problem across most of the C++ code? That's why there's a terrible string construction in the pattern creation.

Signed-off-by: Asra Ali <asraa@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help Wanted Community support requested Question Question about the library
Projects
None yet
Development

No branches or pull requests

4 participants