-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on short strings that request captures, don't run the DFA #348
Comments
If a regex is "anchored" (starting with a |
@arielb1 That's a good observation! We should add that to the heuristic as well. One possible downside I just thought of: the DFA is still useful in the non-match scenario, since it can detect a match failure much more quickly than the NFA engines can. |
For anyone interested in working on this, the relevant code is in the core implementation of the Line 513 in d813518
In particular, right now, it will always run the DFA engine if the DFA engine was selected by the compiler. The optimization in question here would be to add a conditional in the DFA case that checks if a string is below a certain length, and if it is, just skip directly to running the NFA. To pick the right length, I'd suggest writing a micro benchmark that:
You may find it useful to construct the regex and the search text such that the regex matches the entire text. In particular, there are cases where running the DFA and the NFA is actually faster because the DFA can determine "not a match" more quickly than the NFA can. An alternative/additional implementation path is to take @arielb1's advice and check if the regex is completely anchored. You can do that using |
The DFA can't produce captures, but is still faster than the Pike VM NFA, so the normal approach to finding capture groups is to look for the entire match with the DFA and then run the NFA on the substring of the input that matched. In cases where the regex in anchored, the match always starts at the beginning of the input, so there is never any point to trying the DFA first. The DFA can still be useful for rejecting inputs which are not in the language of the regular expression, but anchored regex with capture groups are most commonly used in a parsing context, so it seems like a fair trade-off. For a more in depth discussion see github issue rust-lang#348.
The DFA can't produce captures, but is still faster than the Pike VM NFA, so the normal approach to finding capture groups is to look for the entire match with the DFA and then run the NFA on the substring of the input that matched. In cases where the regex in anchored, the match always starts at the beginning of the input, so there is never any point to trying the DFA first. The DFA can still be useful for rejecting inputs which are not in the language of the regular expression, but anchored regex with capture groups are most commonly used in a parsing context, so it seems like a fair trade-off. For a more in depth discussion see github issue rust-lang#348.
The DFA can't produce captures, but is still faster than the Pike VM NFA, so the normal approach to finding capture groups is to look for the entire match with the DFA and then run the NFA on the substring of the input that matched. In cases where the regex in anchored, the match always starts at the beginning of the input, so there is never any point to trying the DFA first. The DFA can still be useful for rejecting inputs which are not in the language of the regular expression, but anchored regex with capture groups are most commonly used in a parsing context, so it seems like a fair trade-off. For a more in depth discussion see github issue rust-lang#348.
The DFA can't produce captures, but is still faster than the Pike VM NFA, so the normal approach to finding capture groups is to look for the entire match with the DFA and then run the NFA on the substring of the input that matched. In cases where the regex in anchored, the match always starts at the beginning of the input, so there is never any point to trying the DFA first. The DFA can still be useful for rejecting inputs which are not in the language of the regular expression, but anchored regex with capture groups are most commonly used in a parsing context, so it seems like a fair trade-off. Fixes #348
The DFA can't produce captures, but is still faster than the Pike VM NFA, so the normal approach to finding capture groups is to look for the entire match with the DFA and then run the NFA on the substring of the input that matched. In cases where the regex in anchored, the match always starts at the beginning of the input, so there is never any point to trying the DFA first. The DFA can still be useful for rejecting inputs which are not in the language of the regular expression, but anchored regex with capture groups are most commonly used in a parsing context, so it seems like a fair trade-off. Fixes #348
Currently, if a caller requests captures, then we always run the DFA to determine the extent of the match and then we run either the Pike VM or the bounded backtracker on only the extent of the match to fill in the capture locations. This works well when searching long strings because the DFA can save the NFA engines from doing a lot of work. But on short strings, the DFA probably doesn't pay for itself, so we should just run one of the NFA engines if the string is short enough. (More precisely, in cases where the match is roughly the same length as the entire string, then the DFA isn't helping us at all since the NFA engine will still run the length of the match. But, obviously, this case isn't possible to know up front, so we use "short strings" as a likely predictor of that case.)
There should be some experimentation to determine where this boundary lies, so that we can invent a heuristic for when the string is "short enough."
The text was updated successfully, but these errors were encountered: