POSIX submatch semantics is broken in two (rare) cases. #12

skvadrik · 2017-08-12T17:46:23Z

As I've been working on the implementation of POSIX-compliant capturing groups in RE2C (http://re2c.org), I discovered a couple of bugs in Regex-TDFA. I found them when fuzz-testing my implementation against Regex-TDFA-1.2.2. The algorithm used in RE2C is described in detail in the following paper:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

It is a slightly modified version of Laurikari algorithm. POSIX submatch semantics is due to Kuklewicz: https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475 ,
but I made an attempt on formalizing Kuklewicz algorithm (also described in the paper). The reported bugs are rare (fuzzer found them approximately once in 50000 runs), so they are probably caused by some mis-optimization.

The first bug can be triggered by regular expression (((a*)|b)|b)+ and input string ab: Regex-TDFA returns incorrect submatch result for second capturing group ((a*)|b) (no match instead of b at offset 1). Some alternative regular expressions that cause the same error: (((a*)|b)|b){1,2}, ((b|(a*))|b)+.

$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/  :? for help
Prelude> import Text.Regex.TDFA as T

The error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)*" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(-1,0))]]

The same error:

Prelude T> "ab"  T.=~ "^((b|(a*))|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

Again, the same error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b){1,2}" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

The second bug can be triggered by regular expression ((a?)(())*|a)+ and input string aa. Incorrect result is for second group (a?) (no match instead of a at offset 1), third group (()) and fourth group () (no match instead of empty match at offset 2). Alternative variant that also fails: ((a?()?)|a)+.

The error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a)+" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0)),(4,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "aa"  T.=~ "^((a?)(())*|a)*" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a){1,2}" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?()?)|a)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

The text was updated successfully, but these errors were encountered:

neongreen · 2018-03-10T14:11:00Z

(NB. I'm not the author of the algorithm and unfortunately I don't have the time to delve into this. PRs are welcome.)

ChrisKuklewicz · 2018-03-11T19:44:07Z

Orignal author here. That is a very competant fuzzer! I may have time next weekend to reproduce this, and to check against my old almost finished OCaml variant. "+" is equivalent to "{1,}" and a non-zero lower bound is significantly different than a zero lower bounds (as "*" is equivant to "{0,}"), especially around accepting zero characters. The patterns all look like (X|Y) where X is able to accept zero characters. Hmmm... too many years ago so I have no clear guess yet.

…

On Sat, Mar 10, 2018 at 2:11 PM, Artyom Kazak ***@***.***> wrote: (NB. I'm not the author of the algorithm and unfortunately I don't have the time to delve into this. PRs are welcome.) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXVB_v90D0vmZkzd46Se2p1hgYOwmlHks5tc970gaJpZM4O1eY6> .

neongreen · 2019-10-19T08:56:26Z

Moving over to haskell-hvr/regex-tdfa#2.

neongreen added the bug label Oct 1, 2019

neongreen mentioned this issue Oct 19, 2019

POSIX submatch semantics is broken in two (rare) cases haskell-hvr/regex-tdfa#2

Open

neongreen closed this as completed Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POSIX submatch semantics is broken in two (rare) cases. #12

POSIX submatch semantics is broken in two (rare) cases. #12

skvadrik commented Aug 12, 2017

neongreen commented Mar 10, 2018

ChrisKuklewicz commented Mar 11, 2018 via email

neongreen commented Oct 19, 2019

POSIX submatch semantics is broken in two (rare) cases. #12

POSIX submatch semantics is broken in two (rare) cases. #12

Comments

skvadrik commented Aug 12, 2017

neongreen commented Mar 10, 2018

ChrisKuklewicz commented Mar 11, 2018 via email

neongreen commented Oct 19, 2019