partition ASCII and non-ASCII byte classes when a Unicode word boundary is used #652

BurntSushi · 2020-03-10T23:58:41Z

From this issue BurntSushi/ripgrep#1513, it was discovered that the DFA would quit when the regex contains a Unicode word boundary even when the input was purely ASCII. It turns out that characters like | and { and } would get lumped into the same byte equivalence class as non-ASCII bytes, which would cause the DFA's non-ASCII circuit to trip and quit the DFA.

This should be easyish to fix. If the regex has a Unicode word boundary, then ensure that ASCII bytes are never lumped into the same equivalence class as non-ASCII bytes.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2022-04-08T18:32:41Z

This appears to have been fixed in #768. (It is also fixed in regex-automata.)

BurntSushi added the bug label Mar 10, 2020

BurntSushi mentioned this issue Mar 11, 2020

Poor performance with Searcher::search_slice BurntSushi/ripgrep#1513

Closed

BurntSushi closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partition ASCII and non-ASCII byte classes when a Unicode word boundary is used #652

partition ASCII and non-ASCII byte classes when a Unicode word boundary is used #652

BurntSushi commented Mar 10, 2020

BurntSushi commented Apr 8, 2022

partition ASCII and non-ASCII byte classes when a Unicode word boundary is used #652

partition ASCII and non-ASCII byte classes when a Unicode word boundary is used #652

Comments

BurntSushi commented Mar 10, 2020

BurntSushi commented Apr 8, 2022