Improve TOKENIZER by 23% #668

kbrock · 2023-06-13T02:39:18Z

From what I can see, this is done in linear time: 4*O(n)
This tokenizer change converts that to something a little quicker: 3*O(n)

Seems that not using a capture group and something other than split would be a big win. Other than that, the changes were meager.

I used https://regex101.com/ (and pcre2) to evaluate the cost of the TOKENIZER. I verified with cruby 3.0.6 (by eyeball - nothing too extensive)

I tried a few changes to the regular expression and the example in the issue. I was able to speed up by 23% with minimal changes to the codebase. Other savings were to be had but request feedback before going that route.

 /(%%\{[^\}]+\}|%\{[^\}]+\})/ =~ '%{{'*9999)+'}'

/(%%\{[^\}]+\}|%\{[^\}]+\})/ ==> 129,990 steps
/(%?%\{[^\}]+\})/            ==> 129,990 steps
/(%%?\{[^\}]+\})/            ==>  99,992 steps (simple savings of 25%) <===
/(%%?\{[^%}{]+\})/           ==>  89,993 steps (limiting variable contents has minimal gains)

There really isn't much room for improvement overall. The null/simple cases seem to speak for themselves:

/x/ =~ '%{{'*9999)+'}'
/x/                          ==>  29,998 steps
/(x)/                        ==>  59,996 steps
/%{x/                        ==>  49,998 steps
/(%%?{x)/                    ==>  89,993 steps

And the plain string that doesn't fair too much worse than the specially crafted string. So this suggests that if there is a vulnerability in the regular expression, it is not expressed by this example. (especially since they all seem to be linear)

/x/ =~ 'abb'*9999+'c'

/x/                          ==>  29,999
/(%%?{x)/                    ==>  59,998
/(%%?\{[^\}]+\})/            ==>  59,998
/(%%\{[^\}]+\}|%\{[^\}]+\})/ ==>  89,997

per #667

From what I can see, this is done in linear time: 4*O(n) This tokenizer change converts that to something a little quicker: 3*O(n) Seems that not using a capture group and something other than split would be a big win. Other than that, the changes were meager. I used https://regex101.com/ (and pcre2) to evaluate the cost of the TOKENIZER. I verified with cruby 3.0.6 ``` /(%%\{[^\}]+\}|%\{[^\}]+\})/ =~ '%{{'*9999)+'}' /(%%\{[^\}]+\}|%\{[^\}]+\})/ ==> 129,990 steps /(%?%\{[^\}]+\})/ ==> 129,990 steps /(%%?\{[^\}]+\})/ ==> 99,992 steps (simple savings of 25%) <=== /(%%?\{[^%}{]+\})/ ==> 89,993 steps (limiting variable contents has minimal gains) ``` Also of note are the null/simple cases: ``` /x/ =~ '%{{'*9999)+'}' /x/ ==> 29,998 steps /(x)/ ==> 59,996 steps /%{x/ ==> 49,998 steps /(%%?{x)/ ==> 89,993 steps ``` And comparing against a the plain string of the same length. ``` /x/ =~ 'abb'*9999+'c' /x/ ==> 29,999 /(%%?{x)/ ==> 59,998 /(%%?\{[^\}]+\})/ ==> 59,998 /(%%\{[^\}]+\}|%\{[^\}]+\})/ ==> 89,997 ``` per ruby-i18n#667

radar · 2023-06-21T10:28:33Z

Thank you very much :)

kbrock force-pushed the regex branch from 0113426 to c5f9f49 Compare June 13, 2023 03:23

kbrock mentioned this pull request Jun 13, 2023

Regex part deux - INTERPOLATION_SYNTAX #669

Merged

radar merged commit 0b07e58 into ruby-i18n:master Jun 21, 2023

radar mentioned this pull request Jun 21, 2023

[BUG] Possible Denial of Service #667

Closed

kbrock deleted the regex branch July 13, 2023 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TOKENIZER by 23% #668

Improve TOKENIZER by 23% #668

kbrock commented Jun 13, 2023 •

edited

Loading

radar commented Jun 21, 2023

Improve TOKENIZER by 23% #668

Improve TOKENIZER by 23% #668

Conversation

kbrock commented Jun 13, 2023 • edited Loading

radar commented Jun 21, 2023

kbrock commented Jun 13, 2023 •

edited

Loading