Regular expression evaluation should be fixed. #243

rhpvorderman · 2018-08-10T07:17:50Z

See broadinstitute/cromwell#3990 .
Even within the same execution engine, not having specified how the regex should be evaluated leads to portability problems. Not to mention between execution engines.

This should be fixed to a certain standard. ~~I opted for PERL-style here because PERL is supposed to be the king of regex. But It does not really matter which standard is chosen, as long as there is a standard.~~ Given the discussion below POSIX ERE are now set as standard.

EvanTheB · 2018-08-10T08:49:59Z

👍 , while java and python inbuilt regex both have minor differences to perl, they could both use libpcre. Alternative would be posix regex, but I don't know that the libraries are as well supported.

geoffjentry · 2018-08-10T14:03:25Z

I would really prefer to see this standardized on POSIX basic regexes but I think this is a great thing to do in general

cjllanwarne · 2018-08-10T17:26:22Z

Note that examples should take heed of the universal string escaping rules in https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#whitespace-strings-identifiers-constants which should happen before we being the regex parsing itself

EDIT:

this appears to be true in the "examples" section immediately below your change.

cjllanwarne · 2018-08-10T17:49:21Z

Not a particularly strong opinion but I prefer perl over posix because it uses "escape to suppress metacharacters" rather than "escape to enable metacharacters" which is (a) more natural to my way of thinking and (b) closer to the current Cromwell behavior.

I realize that "what we happen to do" is a pretty weak argument, but there it is... 😄

geoffjentry · 2018-08-10T18:16:43Z

The reason I prefer posix basic is that it is more standard and more common, i.e. less surprising to people.

And you're right that "what we happen to do [in one implementation]" is a pretty weak argument :)

cjllanwarne · 2018-08-10T19:51:47Z

After a little looking into this, I'd be:

👍 on PCRE
👍 on POSIX ERE
Not so much 👎 on POSIX BRE as nervous that people would hate us for choosing it... so not 👍 either

I'd push back a little on "less surprising" since I personally find it much more surprising to have to use \( \) as group delimiters and ( ) as the paren literals, for example - and if a search pattern is non-trivial then egrep is significantly more intuitive for me to use than grep:

BRE (ie grep):
- (123)* matches (123))))
- \(123\)* matches 123123123
ERE (ie egrep):
- (123)* matches 123123123
- \(123\)* matches (123))))

FWIW in case anyone else is trying to find one, while looking around I found a nice page on the POSIX ERE vs BRE differences: https://www.regular-expressions.info/posix.html

geoffjentry · 2018-08-10T19:52:49Z

Heads up that I'm going to vote against anything other than basic POSIX, but I'm just one vote.

rhpvorderman · 2018-08-13T06:15:48Z

I agree with @cjllanwarne that POSIX ERE would be the way to go. The syntax is most similar to the way regex works in languages I know (Python, Scala) and considering the users come from a scientific background they probably have some experience with python regexes. Just my 2 cents.

patmagee · 2018-08-17T15:55:07Z

I also agree the POSIX ERE is the best option. Having to escape things like group delimiters seems counter-intuitive for me (although people used to using other REGEX Engines find it fine). Extending on what @rhpvorderman mentioned, other comonly used tools, like egrep, are quite similar in syntax to ERE, making it I think the best choice moving forward

rhpvorderman · 2018-08-20T06:49:17Z

Since we seem to be gravitating towards POSIX ERE (or BRE) I have changed the PR to reflect this.

geoffjentry · 2018-08-21T18:35:59Z

Changing my tune re my hard stance on BRE, I'm cool with ERE

rhpvorderman · 2018-08-24T06:37:55Z

Wow, it seems everyone is in favor of POSIX ERE now. So are there any changes that need to be made to the contents of the pull request before we can start voting?

geoffjentry · 2018-08-28T21:02:49Z

versions/development/SPEC.md

@@ -2998,7 +2998,8 @@ Varieties of the `size` function also exist for the following compound types. Th
 ## String sub(String, String, String)

 Given 3 String parameters `input`, `pattern`, `replace`, this function will replace any occurrence matching `pattern` in `input` by `replace`.
-`pattern` is expected to be a [regular expression](https://en.wikipedia.org/wiki/Regular_expression). Details of regex evaluation will depend on the execution engine running the WDL.
+`pattern` is expected to be a [regular expression](https://en.wikipedia.org/wiki/Regular_expression). 
+The regular expression will be evaluated as a [POSIX  Extended Regular Expression (ERE)](https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended).

 Example 1:


If it's not already the case can you make sure the examples are valid ERE. And if nothing exercises particulars of ERE, perhaps throw in an example of that?

I checked the existing examples and threw in some new ones. I hope they are up to standards.

geoffjentry · 2018-09-18T15:57:17Z

Cool, since I was the only real dissenter anyways and that's cleared up let's open up for voting

EvanTheB · 2018-09-19T00:24:43Z

👍 can anyone comment on the current cromwell compliance?

DavyCats · 2018-09-19T11:57:14Z

👍

LeeTL1220 · 2018-09-19T15:40:31Z

ERE
👍

cjllanwarne · 2018-09-19T19:48:57Z

👍

patmagee · 2018-09-20T01:27:18Z

👍

dheiman · 2018-09-28T14:17:47Z

👍

geoffjentry · 2018-10-02T07:45:44Z

Passes unanimously and is waiting for implementation

DavyCats · 2020-01-07T09:52:17Z

versions/development/SPEC.md

+  String chocoearlylate = sub(chocolike, "[^ ]late", "early") # I like chocoearly when it's late
+  String choco4 = sub(chocolike, " [:alpha:]{4} ", " 4444 ") # I 4444 chocolate 4444 it's late


I realize this has already passed voting, but I briefly looked into how this might be implemented in miniwdl and noticed that there are some typos in these examples. They should look like the following:

String chocoearlylate = sub(chocolike, "[^ ]late", "early") # I like chocearly when it's late

The second o will also be replaced, due to the [^ ].

String choco4 = sub(chocolike, " [[:alpha:]]{4} ", " 4444 ") # I 4444 chocolate 4444 it's late

Character classes require and additional set of [].

…dl#243)

implements openwdl/wdl#243

DavyCats · 2020-01-31T13:18:58Z

This is now implemented on miniwdl's master branch (through chanzuckerberg/miniwdl#321).

mlin · 2020-05-28T07:36:59Z

To pick low-hanging fruit, I think this PR can merged along with an appropriate entry in CHANGELOG. Any objections?

illusional · 2020-05-28T08:00:13Z

I have no objections because it meets the criteria for merge (one implemented engine). What is the status of this in other engines like Cromwell or DnaNexus?

rhpvorderman · 2020-05-28T08:18:46Z

Regex evaluation has been stable for quite some time in Cromwell. If it is correct according to the spec I don't know. But @cjllanwarne or @aednichols probably knows more.

mlin · 2020-06-02T20:05:26Z

I've heard (out-of-band) a concern about lack of trustworthy POSIX ERE implementation for JVM stacks. I'm not too familiar with the ecosystem myself, does anyone have a suggestion?

rhpvorderman · 2020-06-03T05:39:47Z

If that is the case we can still change the spec 😇 ... The point of this change is to have all execution use the same regex evaluation. Which regex evaluation is of much lesser importance, as long as it is something easy to get into. If POSIX ERE is hard to implement in scala/java code we can look for an alternative that is easy in Python, Java and Scala. (I do not know of execution engines in other languages as of yet.)

cjllanwarne · 2020-06-03T14:30:33Z

Thanks @mlin and @patmagee. Looking back at the history of this thread, I think perl was the other main contender (and indeed, a library was mooted: libpcre). Cheat sheet: https://www.geeksforgeeks.org/perl-regex-cheat-sheet/

It's not quite as nice-looking as POSIX ERE for the reasons given further up in the thread, but I agree that "standard across all languages" is a bigger plus than specific syntax IMO (unless we want to get into the business or implementing our own regex support, which I think we really probably don't...)

rhpvorderman · 2020-06-04T05:48:11Z

@cjllanwarne @mlin @patmagee
I propose we use the re2 regex engine. It was explicitly designed with speed in mind and is now maintained by google: https://github.com/google/re2.

The speed is not necessary for WDL necessarily given the amount of regexes in the average .wdl file, but re2's speed advantage has led to bindings for this regex engine in a lot of languages:

C++ (native)
Python (https://pypi.org/project/re2/, https://github.com/axiak/pyre2/)
Java (https://github.com/google/re2j, also maintained by Google!, no dependency on the native library)
The home page also mentions wrappers for C, Erlang, Node.js, Inferno, OCaml, R and Ruby

Golang's regexp library implements re2's syntax: https://golang.org/pkg/regexp/.

It seems like re2 has got all our bases covered as the Java library can be easily implemented in the current Scala implementations and there is a Python library available that provides python bindings for re2, which works for the current python implementations. Furthermore this will create no barriers for people wanting to implement a new engine or wdl-syntax tool in C, C++, Python, Java, Scala, Golang, R, Ruby, Node.js or Perl. I think that is a wide enough language base.

EDIT: The syntax for re2 can be found here: https://github.com/google/re2/wiki/Syntax
EDIT2: The syntax document is really good. It documents all the rules + rules that were added by popular engines such as PCRE. For these extra rules it explicitly states whether these are supported or not. I feel confident that we can point WDL users to this.
EDIT3: It will mean we have to revote...

mlin · 2020-06-04T07:53:30Z

Thank you @rhpvorderman for doing the research. This sounds promising indeed. I will try to do a little due diligence on the re2 Python bindings as there seem to be numerous versions of it.

Even with appropriate language bindings I presume all users would need to install the re2 OS package (apt/yum/brew/etc.), which might present a small roadbump compared to the ideal of standard syntax implemented natively in each language. Of course that ideal might be unattainable here, making the comparison quite moot 😅

rhpvorderman · 2020-06-04T08:58:18Z

@mlin. Damn, you are correct. Unfortunately the same holds true for PCRE bindings. But re2 is much more supported.
I guess it is possible to come a long with conda packaging. Cromwell and miniwdl are both packaged in conda and the re2 library could be installed via that channel.

jdidion · 2020-11-03T13:48:59Z

This is implemented in wdlTools

jdidion · 2021-02-01T18:41:03Z

This is in v1.1 and will be propagated to development.

Regular expression evaluation should be fixed.

b526c86

change perl to posix

a7ba3e8

rhpvorderman mentioned this pull request Aug 24, 2018

Regression: Regex replacement in sub command broken in WDL 1.0 broadinstitute/cromwell#3990

Closed

geoffjentry reviewed Aug 28, 2018

View reviewed changes

rhpvorderman added 2 commits September 13, 2018 17:22

remove redundant escape character

1413daa

add examples

8e91963

geoffjentry added the Voting Active label Sep 18, 2018

geoffjentry added Waiting for implementation and removed Voting Active labels Oct 2, 2018

patmagee mentioned this pull request Nov 20, 2019

Grammar Remake #342

Merged

patmagee added the Spec Change label Nov 20, 2019

patmagee added this to the v2.0 milestone Nov 20, 2019

DavyCats reviewed Jan 7, 2020

View reviewed changes

DavyCats added a commit to DavyCats/miniwdl that referenced this pull request Jan 7, 2020

Use POSIX regex for sub instead of python regex (implements openwdl/w…

e8087f1

…dl#243)

DavyCats mentioned this pull request Jan 7, 2020

Use POSIX ERE for sub function chanzuckerberg/miniwdl#321

Merged

5 tasks

mlin pushed a commit to chanzuckerberg/miniwdl that referenced this pull request Jan 31, 2020

Use POSIX ERE for sub function (#321)

1e59170

implements openwdl/wdl#243

jdidion changed the base branch from master to main June 29, 2020 14:06

jdidion mentioned this pull request Nov 3, 2020

Switch to RE2 for regular expression substitution in the sub() method dnanexus/wdlTools#115

Closed

patmagee added Ready to merge and removed Waiting for implementation labels Nov 3, 2020

rhpvorderman mentioned this pull request Jan 12, 2021

escape sequences in regex biowdl/tasks#270

Open

jdidion closed this Feb 1, 2021

		String chocoearlylate = sub(chocolike, "[^ ]late", "early") # I like chocoearly when it's late
		String choco4 = sub(chocolike, " [:alpha:]{4} ", " 4444 ") # I 4444 chocolate 4444 it's late

Regular expression evaluation should be fixed. #243

Regular expression evaluation should be fixed. #243

Conversation

rhpvorderman commented Aug 10, 2018 • edited Loading

EvanTheB commented Aug 10, 2018

geoffjentry commented Aug 10, 2018

cjllanwarne commented Aug 10, 2018 • edited Loading

cjllanwarne commented Aug 10, 2018

geoffjentry commented Aug 10, 2018

cjllanwarne commented Aug 10, 2018 • edited Loading

geoffjentry commented Aug 10, 2018

rhpvorderman commented Aug 13, 2018

patmagee commented Aug 17, 2018

rhpvorderman commented Aug 20, 2018

geoffjentry commented Aug 21, 2018

rhpvorderman commented Aug 24, 2018

geoffjentry Aug 28, 2018

Choose a reason for hiding this comment

rhpvorderman Sep 13, 2018

Choose a reason for hiding this comment

geoffjentry commented Sep 18, 2018

EvanTheB commented Sep 19, 2018

DavyCats commented Sep 19, 2018

LeeTL1220 commented Sep 19, 2018

cjllanwarne commented Sep 19, 2018

patmagee commented Sep 20, 2018

dheiman commented Sep 28, 2018

geoffjentry commented Oct 2, 2018

DavyCats Jan 7, 2020

Choose a reason for hiding this comment

DavyCats commented Jan 31, 2020

mlin commented May 28, 2020

illusional commented May 28, 2020

rhpvorderman commented May 28, 2020

mlin commented Jun 2, 2020

rhpvorderman commented Jun 3, 2020

cjllanwarne commented Jun 3, 2020 • edited Loading

rhpvorderman commented Jun 4, 2020 • edited by jdidion Loading

mlin commented Jun 4, 2020 • edited Loading

rhpvorderman commented Jun 4, 2020

jdidion commented Nov 3, 2020

jdidion commented Feb 1, 2021

rhpvorderman commented Aug 10, 2018 •

edited

Loading

cjllanwarne commented Aug 10, 2018 •

edited

Loading

cjllanwarne commented Aug 10, 2018 •

edited

Loading

cjllanwarne commented Jun 3, 2020 •

edited

Loading

rhpvorderman commented Jun 4, 2020 •

edited by jdidion

Loading

mlin commented Jun 4, 2020 •

edited

Loading