You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I've been working on the implementation of POSIX-compliant capturing groups in RE2C (http://re2c.org), I discovered a couple of bugs in Regex-TDFA. I found them when fuzz-testing my implementation against Regex-TDFA-1.2.2. The algorithm used in RE2C is described in detail in the following paper:
It is a slightly modified version of Laurikari algorithm. POSIX submatch semantics is due to Kuklewicz: https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475 ,
but I made an attempt on formalizing Kuklewicz algorithm (also described in the paper). The reported bugs are rare (fuzzer found them approximately once in 50000 runs), so they are probably caused by some mis-optimization.
The first bug can be triggered by regular expression (((a*)|b)|b)+ and input string ab: Regex-TDFA returns incorrect submatch result for second capturing group ((a*)|b) (no match instead of b at offset 1). Some alternative regular expressions that cause the same error: (((a*)|b)|b){1,2}, ((b|(a*))|b)+.
$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/ :? for help
Prelude> import Text.Regex.TDFA as T
The second bug can be triggered by regular expression ((a?)(())*|a)+ and input string aa. Incorrect result is for second group (a?) (no match instead of a at offset 1), third group (()) and fourth group () (no match instead of empty match at offset 2). Alternative variant that also fails: ((a?()?)|a)+.
Orignal author here. That is a very competant fuzzer!
I may have time next weekend to reproduce this, and to check against my old
almost finished OCaml variant.
"+" is equivalent to "{1,}" and a non-zero lower bound is significantly
different than a zero lower bounds (as "*" is equivant to "{0,}"),
especially around accepting zero characters.
The patterns all look like (X|Y) where X is able to accept zero
characters. Hmmm... too many years ago so I have no clear guess yet.
As I've been working on the implementation of POSIX-compliant capturing groups in RE2C (http://re2c.org), I discovered a couple of bugs in Regex-TDFA. I found them when fuzz-testing my implementation against Regex-TDFA-1.2.2. The algorithm used in RE2C is described in detail in the following paper:
http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf
It is a slightly modified version of Laurikari algorithm. POSIX submatch semantics is due to Kuklewicz: https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475 ,
but I made an attempt on formalizing Kuklewicz algorithm (also described in the paper). The reported bugs are rare (fuzzer found them approximately once in 50000 runs), so they are probably caused by some mis-optimization.
The first bug can be triggered by regular expression
(((a*)|b)|b)+
and input stringab
: Regex-TDFA returns incorrect submatch result for second capturing group((a*)|b)
(no match instead ofb
at offset 1). Some alternative regular expressions that cause the same error:(((a*)|b)|b){1,2}
,((b|(a*))|b)+
.The error:
But not with
*
(the example below works correctly!):The same error:
Again, the same error:
The second bug can be triggered by regular expression
((a?)(())*|a)+
and input stringaa
. Incorrect result is for second group(a?)
(no match instead ofa
at offset 1), third group(())
and fourth group()
(no match instead of empty match at offset 2). Alternative variant that also fails:((a?()?)|a)+
.The error:
But not with
*
(the example below works correctly!):The same error:
The same error:
The text was updated successfully, but these errors were encountered: