Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POSIX submatch semantics is broken in two (rare) cases. #12

Closed
skvadrik opened this issue Aug 12, 2017 · 3 comments
Closed

POSIX submatch semantics is broken in two (rare) cases. #12

skvadrik opened this issue Aug 12, 2017 · 3 comments
Labels

Comments

@skvadrik
Copy link

As I've been working on the implementation of POSIX-compliant capturing groups in RE2C (http://re2c.org), I discovered a couple of bugs in Regex-TDFA. I found them when fuzz-testing my implementation against Regex-TDFA-1.2.2. The algorithm used in RE2C is described in detail in the following paper:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

It is a slightly modified version of Laurikari algorithm. POSIX submatch semantics is due to Kuklewicz: https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475 ,
but I made an attempt on formalizing Kuklewicz algorithm (also described in the paper). The reported bugs are rare (fuzzer found them approximately once in 50000 runs), so they are probably caused by some mis-optimization.

The first bug can be triggered by regular expression (((a*)|b)|b)+ and input string ab: Regex-TDFA returns incorrect submatch result for second capturing group ((a*)|b) (no match instead of b at offset 1). Some alternative regular expressions that cause the same error: (((a*)|b)|b){1,2}, ((b|(a*))|b)+.

$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/  :? for help
Prelude> import Text.Regex.TDFA as T

The error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)*" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(-1,0))]]

The same error:

Prelude T> "ab"  T.=~ "^((b|(a*))|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

Again, the same error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b){1,2}" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

The second bug can be triggered by regular expression ((a?)(())*|a)+ and input string aa. Incorrect result is for second group (a?) (no match instead of a at offset 1), third group (()) and fourth group () (no match instead of empty match at offset 2). Alternative variant that also fails: ((a?()?)|a)+.

The error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a)+" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0)),(4,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "aa"  T.=~ "^((a?)(())*|a)*" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a){1,2}" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?()?)|a)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]
@neongreen
Copy link
Collaborator

(NB. I'm not the author of the algorithm and unfortunately I don't have the time to delve into this. PRs are welcome.)

@ChrisKuklewicz
Copy link
Owner

ChrisKuklewicz commented Mar 11, 2018 via email

@neongreen
Copy link
Collaborator

Moving over to haskell-hvr/regex-tdfa#2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants