Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expression evaluation should be fixed. #243

Closed
wants to merge 4 commits into from

Conversation

rhpvorderman
Copy link
Contributor

@rhpvorderman rhpvorderman commented Aug 10, 2018

See broadinstitute/cromwell#3990 .
Even within the same execution engine, not having specified how the regex should be evaluated leads to portability problems. Not to mention between execution engines.

This should be fixed to a certain standard. I opted for PERL-style here because PERL is supposed to be the king of regex. But It does not really matter which standard is chosen, as long as there is a standard. Given the discussion below POSIX ERE are now set as standard.

@EvanTheB
Copy link
Contributor

👍 , while java and python inbuilt regex both have minor differences to perl, they could both use libpcre. Alternative would be posix regex, but I don't know that the libraries are as well supported.

@geoffjentry
Copy link
Member

I would really prefer to see this standardized on POSIX basic regexes but I think this is a great thing to do in general

@cjllanwarne
Copy link
Contributor

cjllanwarne commented Aug 10, 2018

Note that examples should take heed of the universal string escaping rules in https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#whitespace-strings-identifiers-constants which should happen before we being the regex parsing itself

EDIT:

  • this appears to be true in the "examples" section immediately below your change.

@cjllanwarne
Copy link
Contributor

Not a particularly strong opinion but I prefer perl over posix because it uses "escape to suppress metacharacters" rather than "escape to enable metacharacters" which is (a) more natural to my way of thinking and (b) closer to the current Cromwell behavior.

I realize that "what we happen to do" is a pretty weak argument, but there it is... 😄

@geoffjentry
Copy link
Member

The reason I prefer posix basic is that it is more standard and more common, i.e. less surprising to people.

And you're right that "what we happen to do [in one implementation]" is a pretty weak argument :)

@cjllanwarne
Copy link
Contributor

cjllanwarne commented Aug 10, 2018

After a little looking into this, I'd be:

  • 👍 on PCRE
  • 👍 on POSIX ERE
  • Not so much 👎 on POSIX BRE as nervous that people would hate us for choosing it... so not 👍 either

I'd push back a little on "less surprising" since I personally find it much more surprising to have to use \( \) as group delimiters and ( ) as the paren literals, for example - and if a search pattern is non-trivial then egrep is significantly more intuitive for me to use than grep:

  • BRE (ie grep):
    • (123)* matches (123))))
    • \(123\)* matches 123123123
  • ERE (ie egrep):
    • (123)* matches 123123123
    • \(123\)* matches (123))))

FWIW in case anyone else is trying to find one, while looking around I found a nice page on the POSIX ERE vs BRE differences: https://www.regular-expressions.info/posix.html

@geoffjentry
Copy link
Member

Heads up that I'm going to vote against anything other than basic POSIX, but I'm just one vote.

@rhpvorderman
Copy link
Contributor Author

I agree with @cjllanwarne that POSIX ERE would be the way to go. The syntax is most similar to the way regex works in languages I know (Python, Scala) and considering the users come from a scientific background they probably have some experience with python regexes. Just my 2 cents.

@patmagee
Copy link
Member

I also agree the POSIX ERE is the best option. Having to escape things like group delimiters seems counter-intuitive for me (although people used to using other REGEX Engines find it fine). Extending on what @rhpvorderman mentioned, other comonly used tools, like egrep, are quite similar in syntax to ERE, making it I think the best choice moving forward

@rhpvorderman
Copy link
Contributor Author

Since we seem to be gravitating towards POSIX ERE (or BRE) I have changed the PR to reflect this.

@geoffjentry
Copy link
Member

Changing my tune re my hard stance on BRE, I'm cool with ERE

@rhpvorderman
Copy link
Contributor Author

Wow, it seems everyone is in favor of POSIX ERE now. So are there any changes that need to be made to the contents of the pull request before we can start voting?

@@ -2998,7 +2998,8 @@ Varieties of the `size` function also exist for the following compound types. Th
## String sub(String, String, String)

Given 3 String parameters `input`, `pattern`, `replace`, this function will replace any occurrence matching `pattern` in `input` by `replace`.
`pattern` is expected to be a [regular expression](https://en.wikipedia.org/wiki/Regular_expression). Details of regex evaluation will depend on the execution engine running the WDL.
`pattern` is expected to be a [regular expression](https://en.wikipedia.org/wiki/Regular_expression).
The regular expression will be evaluated as a [POSIX Extended Regular Expression (ERE)](https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended).

Example 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not already the case can you make sure the examples are valid ERE. And if nothing exercises particulars of ERE, perhaps throw in an example of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the existing examples and threw in some new ones. I hope they are up to standards.

@geoffjentry
Copy link
Member

Cool, since I was the only real dissenter anyways and that's cleared up let's open up for voting

@EvanTheB
Copy link
Contributor

👍 can anyone comment on the current cromwell compliance?

@DavyCats
Copy link
Contributor

👍

@LeeTL1220
Copy link

ERE
👍

@cjllanwarne
Copy link
Contributor

👍

2 similar comments
@patmagee
Copy link
Member

👍

@dheiman
Copy link

dheiman commented Sep 28, 2018

👍

@geoffjentry
Copy link
Member

Passes unanimously and is waiting for implementation

@patmagee patmagee added this to the v2.0 milestone Nov 20, 2019
Comment on lines +3012 to +3013
String chocoearlylate = sub(chocolike, "[^ ]late", "early") # I like chocoearly when it's late
String choco4 = sub(chocolike, " [:alpha:]{4} ", " 4444 ") # I 4444 chocolate 4444 it's late
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this has already passed voting, but I briefly looked into how this might be implemented in miniwdl and noticed that there are some typos in these examples. They should look like the following:

 String chocoearlylate = sub(chocolike, "[^ ]late", "early") # I like chocearly when it's late

The second o will also be replaced, due to the [^ ].

 String choco4 = sub(chocolike, " [[:alpha:]]{4} ", " 4444 ") # I 4444 chocolate 4444 it's late

Character classes require and additional set of [].

DavyCats added a commit to DavyCats/miniwdl that referenced this pull request Jan 7, 2020
mlin pushed a commit to chanzuckerberg/miniwdl that referenced this pull request Jan 31, 2020
@DavyCats
Copy link
Contributor

This is now implemented on miniwdl's master branch (through chanzuckerberg/miniwdl#321).

@mlin
Copy link
Member

mlin commented May 28, 2020

To pick low-hanging fruit, I think this PR can merged along with an appropriate entry in CHANGELOG. Any objections?

@illusional
Copy link
Contributor

I have no objections because it meets the criteria for merge (one implemented engine). What is the status of this in other engines like Cromwell or DnaNexus?

@rhpvorderman
Copy link
Contributor Author

Regex evaluation has been stable for quite some time in Cromwell. If it is correct according to the spec I don't know. But @cjllanwarne or @aednichols probably knows more.

@mlin
Copy link
Member

mlin commented Jun 2, 2020

I've heard (out-of-band) a concern about lack of trustworthy POSIX ERE implementation for JVM stacks. I'm not too familiar with the ecosystem myself, does anyone have a suggestion?

@rhpvorderman
Copy link
Contributor Author

If that is the case we can still change the spec 😇 ... The point of this change is to have all execution use the same regex evaluation. Which regex evaluation is of much lesser importance, as long as it is something easy to get into. If POSIX ERE is hard to implement in scala/java code we can look for an alternative that is easy in Python, Java and Scala. (I do not know of execution engines in other languages as of yet.)

@cjllanwarne
Copy link
Contributor

cjllanwarne commented Jun 3, 2020

Thanks @mlin and @patmagee. Looking back at the history of this thread, I think perl was the other main contender (and indeed, a library was mooted: libpcre). Cheat sheet: https://www.geeksforgeeks.org/perl-regex-cheat-sheet/

It's not quite as nice-looking as POSIX ERE for the reasons given further up in the thread, but I agree that "standard across all languages" is a bigger plus than specific syntax IMO (unless we want to get into the business or implementing our own regex support, which I think we really probably don't...)

@rhpvorderman
Copy link
Contributor Author

rhpvorderman commented Jun 4, 2020

@cjllanwarne @mlin @patmagee
I propose we use the re2 regex engine. It was explicitly designed with speed in mind and is now maintained by google: https://github.com/google/re2.

The speed is not necessary for WDL necessarily given the amount of regexes in the average .wdl file, but re2's speed advantage has led to bindings for this regex engine in a lot of languages:

Golang's regexp library implements re2's syntax: https://golang.org/pkg/regexp/.

It seems like re2 has got all our bases covered as the Java library can be easily implemented in the current Scala implementations and there is a Python library available that provides python bindings for re2, which works for the current python implementations. Furthermore this will create no barriers for people wanting to implement a new engine or wdl-syntax tool in C, C++, Python, Java, Scala, Golang, R, Ruby, Node.js or Perl. I think that is a wide enough language base.

EDIT: The syntax for re2 can be found here: https://github.com/google/re2/wiki/Syntax
EDIT2: The syntax document is really good. It documents all the rules + rules that were added by popular engines such as PCRE. For these extra rules it explicitly states whether these are supported or not. I feel confident that we can point WDL users to this.
EDIT3: It will mean we have to revote...

@mlin
Copy link
Member

mlin commented Jun 4, 2020

Thank you @rhpvorderman for doing the research. This sounds promising indeed. I will try to do a little due diligence on the re2 Python bindings as there seem to be numerous versions of it.

Even with appropriate language bindings I presume all users would need to install the re2 OS package (apt/yum/brew/etc.), which might present a small roadbump compared to the ideal of standard syntax implemented natively in each language. Of course that ideal might be unattainable here, making the comparison quite moot 😅

@rhpvorderman
Copy link
Contributor Author

@mlin. Damn, you are correct. Unfortunately the same holds true for PCRE bindings. But re2 is much more supported.
I guess it is possible to come a long with conda packaging. Cromwell and miniwdl are both packaged in conda and the re2 library could be installed via that channel.

@jdidion
Copy link
Collaborator

jdidion commented Nov 3, 2020

This is implemented in wdlTools

@jdidion
Copy link
Collaborator

jdidion commented Feb 1, 2021

This is in v1.1 and will be propagated to development.

@jdidion jdidion closed this Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.