Performance tips for grammars with case insensitive keywords? #17

jsm174 · 2019-12-10T16:47:00Z

We are using node-ebnf to transpile vbscript into javascript in an attempt to run Visual Pinball in a web browser.

Currently we go through a two pass approach. The first pass format's the original script file to remove comments, standardize keyword case, and remove extra whitespace. The second pass converts the node-ebnf ast tree into a js ast tree.

To standardize the keyword case, we end up taking a rule like:

and converting it to:

Keywords ::= ('A'|'a')('N'|'n')('D'|'d')|('B'|'b')('Y'|'y')('V'|'v')('A'|'a')('L'|'l')|('B'|'b')('Y'|'y')('R'|'r')('E'|'e')('F'|'f')|('C'|'c')('A'|'a')('S'|'s')('E'|'e')|('C'|'c')('A'|'a')('L'|'l')('L'|'l')|('C'|'c')('L'|'l')('A'|'a')('S'|'s')('S'|'s')|('C'|'c')('O'|'o')('N'|'n')('S'|'s')('T'|'t')|('D'|'d')('E'|'e')('F'|'f')('A'|'a')('U'|'u')('L'|'l')('T'|'t')|('D'|'d')('I'|'i')('M'|'m')|('D'|'d')('O'|'o')|('E'|'e')('A'|'a')('C'|'c')('H'|'h')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')('I'|'i')('F'|'f')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')|('E'|'e')('M'|'m')('P'|'p')('T'|'t')('Y'|'y')|('E'|'e')('N'|'n')('D'|'d')|('E'|'e')('R'|'r')('A'|'a')('S'|'s')('E'|'e')|('E'|'e')('R'|'r')('R'|'r')('O'|'o')('R'|'r')|('E'|'e')('Q'|'q')('V'|'v')|('E'|'e')('X'|'x')('I'|'i')('T'|'t')|('E'|'e')('X'|'x')('P'|'p')('L'|'l')('I'|'i')('C'|'c')('I'|'i')('T'|'t')|('F'|'f')('A'|'a')('L'|'l')('S'|'s')('E'|'e')|('F'|'f')('O'|'o')('R'|'r')|('F'|'f')('U'|'u')('N'|'n')('C'|'c')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('G'|'g')('E'|'e')('T'|'t')|('G'|'g')('O'|'o')('T'|'t')('O'|'o')|('I'|'i')('F'|'f')|('I'|'i')('N'|'n')|('I'|'i')('S'|'s')|('L'|'l')('E'|'e')('T'|'t')|('L'|'l')('O'|'o')('O'|'o')('P'|'p')|('M'|'m')('O'|'o')('D'|'d')|('N'|'n')('E'|'e')('W'|'w')|('N'|'n')('E'|'e')('X'|'x')('T'|'t')|('N'|'n')('O'|'o')('T'|'t')('H'|'h')('I'|'i')('N'|'n')('G'|'g')|('N'|'n')('O'|'o')('T'|'t')|('N'|'n')('U'|'u')('L'|'l')('L'|'l')|('O'|'o')('N'|'n')|('O'|'o')('P'|'p')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('O'|'o')('R'|'r')|('P'|'p')('R'|'r')('E'|'e')('S'|'s')('E'|'e')('R'|'r')('V'|'v')('E'|'e')|('P'|'p')('R'|'r')('I'|'i')('V'|'v')('A'|'a')('T'|'t')('E'|'e')|('P'|'p')('R'|'r')('O'|'o')('P'|'p')('E'|'e')('R'|'r')('T'|'t')('Y'|'y')|('P'|'p')('U'|'u')('B'|'b')('L'|'l')('I'|'i')('C'|'c')|('R'|'r')('E'|'e')('D'|'d')('I'|'i')('M'|'m')|('R'|'r')('E'|'e')('S'|'s')('U'|'u')('M'|'m')('E'|'e')|('S'|'s')('E'|'e')('L'|'l')('E'|'e')('C'|'c')('T'|'t')|('S'|'s')('E'|'e')('T'|'t')|('S'|'s')('T'|'t')('E'|'e')('P'|'p')|('S'|'s')('U'|'u')('B'|'b')|('T'|'t')('H'|'h')('E'|'e')('N'|'n')|('T'|'t')('O'|'o')|('T'|'t')('R'|'r')('U'|'u')('E'|'e')|('U'|'u')('N'|'n')('T'|'t')('I'|'i')('L'|'l')|('W'|'w')('H'|'h')('I'|'i')('L'|'l')('E'|'e')|('W'|'w')('E'|'e')('N'|'n')('D'|'d')|('W'|'w')('I'|'i')('T'|'t')('H'|'h')|('X'|'x')('O'|'o')('R'|'r')

If we could assume the case of the script was correct, not splitting the keywords like this gives us a 40% gain in performance.

Is there any tips or suggestions you could offer on case insensitive grammars?

The text was updated successfully, but these errors were encountered:

menduz · 2019-12-10T17:21:03Z

Hi, thank you for sharing the use case.

There are a couple of things we can do to optimize this kind of parsers:

Look ahead of the first letter to reduce the full scan

Keyword        ::= 'And'
                 | &('B') ('ByVal' | 'ByRef)
                 | &('C') ('Case' | 'Call' | 'Class' | 'Const')

https://github.com/lys-lang/lys/blob/master/src/grammar.ts#L117-L121

Use regular expressions for performance

Individual constructions are a good mechanism to build grammars, to write nominal options it may be faster to write

Keywords ::= [Aa][Nn][Dd]
          |[Bb][Yy][Vv][Aa][Ll]
          |[Bb][Yy][Rr][Ee][Ff]
          |[Cc][Aa][Ss][Ee]
          |[Cc][Aa][Ll][Ll]
          |[Cc][Ll][Aa][Ss][Ss]
          |[Cc][Oo][Nn][Ss][Tt]
          |[Dd][Ee][Ff][Aa][Uu][Ll][Tt]
          |[Dd][Ii][Mm]
          |[Dd][Oo]
          |[Ee][Aa][Cc][Hh]
...

Usually languages like Basic use a tokenizer as a first step and then the parsing is performed over tokens instead of text. This library generates PEG grammars without a tokenization phase, which in some cases can be unfortunate.

Regarding the performance of the case sensitiveness of this library, it can be added as an option of the construction like

Keywords := 'and' | 'byval' {ignoreCase=true}

There are a couple of options to base the code on those, it should be an easy thing to do since it is only a parameter i to the regular expressions. And this library relies entirely on regular expressions.

jsm174 · 2019-12-10T18:09:01Z

Thank you for responding so fast! Believe it or not, we had just tried:

Keywords ::= [Aa][Nn][Dd]
          |[Bb][Yy][Vv][Aa][Ll]
          |[Bb][Yy][Rr][Ee][Ff]
          |[Cc][Aa][Ss][Ee]

And it is extremely fast even more so than the when they keywords are explicitly defined.

Keywords ::= ('A'|'a')('N'|'n')('D'|'d')|('B'|'b')('Y'|'y')('V'|'v')('A'|'a')('L'|'l')|('B'|'b')('Y'|'y')('R'|'r')('E'|'e')('F'|'f')|('C'|'c')('A'|'a')('S'|'s')('E'|'e')|('C'|'c')('A'|'a')('L'|'l')('L'|'l')|('C'|'c')('L'|'l')('A'|'a')('S'|'s')('S'|'s')|('C'|'c')('O'|'o')('N'|'n')('S'|'s')('T'|'t')|('D'|'d')('E'|'e')('F'|'f')('A'|'a')('U'|'u')('L'|'l')('T'|'t')|('D'|'d')('I'|'i')('M'|'m')|('D'|'d')('O'|'o')|('E'|'e')('A'|'a')('C'|'c')('H'|'h')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')('I'|'i')('F'|'f')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')|('E'|'e')('M'|'m')('P'|'p')('T'|'t')('Y'|'y')|('E'|'e')('N'|'n')('D'|'d')|('E'|'e')('R'|'r')('A'|'a')('S'|'s')('E'|'e')|('E'|'e')('R'|'r')('R'|'r')('O'|'o')('R'|'r')|('E'|'e')('Q'|'q')('V'|'v')|('E'|'e')('X'|'x')('I'|'i')('T'|'t')|('E'|'e')('X'|'x')('P'|'p')('L'|'l')('I'|'i')('C'|'c')('I'|'i')('T'|'t')|('F'|'f')('A'|'a')('L'|'l')('S'|'s')('E'|'e')|('F'|'f')('O'|'o')('R'|'r')|('F'|'f')('U'|'u')('N'|'n')('C'|'c')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('G'|'g')('E'|'e')('T'|'t')|('G'|'g')('O'|'o')('T'|'t')('O'|'o')|('I'|'i')('F'|'f')|('I'|'i')('N'|'n')|('I'|'i')('S'|'s')|('L'|'l')('E'|'e')('T'|'t')|('L'|'l')('O'|'o')('O'|'o')('P'|'p')|('M'|'m')('O'|'o')('D'|'d')|('N'|'n')('E'|'e')('W'|'w')|('N'|'n')('E'|'e')('X'|'x')('T'|'t')|('N'|'n')('O'|'o')('T'|'t')('H'|'h')('I'|'i')('N'|'n')('G'|'g')|('N'|'n')('O'|'o')('T'|'t')|('N'|'n')('U'|'u')('L'|'l')('L'|'l')|('O'|'o')('N'|'n')|('O'|'o')('P'|'p')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('O'|'o')('R'|'r')|('P'|'p')('R'|'r')('E'|'e')('S'|'s')('E'|'e')('R'|'r')('V'|'v')('E'|'e')|('P'|'p')('R'|'r')('I'|'i')('V'|'v')('A'|'a')('T'|'t')('E'|'e')|('P'|'p')('R'|'r')('O'|'o')('P'|'p')('E'|'e')('R'|'r')('T'|'t')('Y'|'y')|('P'|'p')('U'|'u')('B'|'b')('L'|'l')('I'|'i')('C'|'c')|('R'|'r')('E'|'e')('D'|'d')('I'|'i')('M'|'m')|('R'|'r')('E'|'e')('S'|'s')('U'|'u')('M'|'m')('E'|'e')|('S'|'s')('E'|'e')('L'|'l')('E'|'e')('C'|'c')('T'|'t')|('S'|'s')('E'|'e')('T'|'t')|('S'|'s')('T'|'t')('E'|'e')('P'|'p')|('S'|'s')('U'|'u')('B'|'b')|('T'|'t')('H'|'h')('E'|'e')('N'|'n')|('T'|'t')('O'|'o')|('T'|'t')('R'|'r')('U'|'u')('E'|'e')|('U'|'u')('N'|'n')('T'|'t')('I'|'i')('L'|'l')|('W'|'w')('H'|'h')('I'|'i')('L'|'l')('E'|'e')|('W'|'w')('E'|'e')('N'|'n')('D'|'d')|('W'|'w')('I'|'i')('T'|'t')('H'|'h')|('X'|'x')('O'|'o')('R'|'r') {fragment=true}
Keyword ::= Keywords ![A-Za-z0-9_]

[Grammar.format] trim and ast time in 1301ms

Keywords ::= 'And' | 'ByVal' | 'ByRef' | 'Case' | 'Call' | 'Class' | 'Const' | 'Default' | 'Dim' | 'Do' | 'Each' | 'ElseIf' | 'Else' | 'Empty' | 'End' | 'Erase' | 'Error' | 'Eqv' | 'Exit' | 'Explicit' | 'False' | 'For' | 'Function' | 'Get' | 'GoTo' | 'If' | 'In' | 'Is' | 'Let' | 'Loop' | 'Mod' | 'New' | 'Next' | 'Nothing' | 'Not' | 'Null' | 'On' | 'Option' | 'Or' | 'Preserve' | 'Private' | 'Property' | 'Public' | 'ReDim' | 'Resume' | 'Select' | 'Set' | 'Step' | 'Sub' | 'Then' | 'To' | 'True' | 'Until' | 'While' | 'WEnd' | 'With' | 'Xor' {fragment=true}
Keyword ::= Keywords ![A-Za-z0-9_]

[Grammar.format] trim and ast time in 875ms

Keywords ::= [Aa][Nn][Dd]|[Bb][Yy][Vv][Aa][Ll]|[Bb][Yy][Rr][Ee][Ff]|[Cc][Aa][Ss][Ee]|[Cc][Aa][Ll][Ll]|[Cc][Ll][Aa][Ss][Ss]|[Cc][Oo][Nn][Ss][Tt]|[Dd][Ee][Ff][Aa][Uu][Ll][Tt]|[Dd][Ii][Mm]|[Dd][Oo]|[Ee][Aa][Cc][Hh]|[Ee][Ll][Ss][Ee][Ii][Ff]|[Ee][Ll][Ss][Ee]|[Ee][Mm][Pp][Tt][Yy]|[Ee][Nn][Dd]|[Ee][Rr][Aa][Ss][Ee]|[Ee][Rr][Rr][Oo][Rr]|[Ee][Qq][Vv]|[Ee][Xx][Ii][Tt]|[Ee][Xx][Pp][Ll][Ii][Cc][Ii][Tt]|[Ff][Aa][Ll][Ss][Ee]|[Ff][Oo][Rr]|[Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn]|[Gg][Ee][Tt]|[Gg][Oo][Tt][Oo]|[Ii][Ff]|[Ii][Nn]|[Ii][Ss]|[Ll][Ee][Tt]|[Ll][Oo][Oo][Pp]|[Mm][Oo][Dd]|[Nn][Ee][Ww]|[Nn][Ee][Xx][Tt]|[Nn][Oo][Tt][Hh][Ii][Nn][Gg]|[Nn][Oo][Tt]|[Nn][Uu][Ll][Ll]|[Oo][Nn]|[Oo][Pp][Tt][Ii][Oo][Nn]|[Oo][Rr]|[Pp][Rr][Ee][Ss][Ee][Rr][Vv][Ee]|[Pp][Rr][Ii][Vv][Aa][Tt][Ee]|[Pp][Rr][Oo][Pp][Ee][Rr][Tt][Yy]|[Pp][Uu][Bb][Ll][Ii][Cc]|[Rr][Ee][Dd][Ii][Mm]|[Rr][Ee][Ss][Uu][Mm][Ee]|[Ss][Ee][Ll][Ee][Cc][Tt]|[Ss][Ee][Tt]|[Ss][Tt][Ee][Pp]|[Ss][Uu][Bb]|[Tt][Hh][Ee][Nn]|[Tt][Oo]|[Tt][Rr][Uu][Ee]|[Uu][Nn][Tt][Ii][Ll]|[Ww][Hh][Ii][Ll][Ee]|[Ww][Ee][Nn][Dd]|[Ww][Ii][Tt][Hh]|[Xx][Oo][Rr][''][  ][{{][Ff][Rr][Aa][Gg][Mm][Ee][Nn][Tt][==][Tt][Rr][Uu][Ee]{fragment=true}
Keyword ::= Keywords ![A-Za-z0-9_]

[Grammar.format] trim and ast time in 257ms

freezy · 2019-12-10T20:41:36Z

Looking at your BNF grammar, it seems like it's already compiled. Is there a way to produce such a JavaScript array that we can load into the parser directly? That would allow us to 1) skip compiling the grammar each time we transpile, and 2) tweak the regexes so they are more performant?

menduz · 2019-12-10T21:47:47Z

Yes, there is a way to export the parsed grammar. Every grammar was scaffolded using the previous one.

BNF was hand made.
Then W3CEBNF was created using BNF.
And Custom was created using W3CEBNF.

You can cache and fine tune your grammar by executing:

ebnf Grammar.ebnf >> myFile.js

https://github.com/lys-lang/node-ebnf/blob/master/src/bin.ts

jsm174 · 2019-12-11T04:21:20Z

I just ran this utility on our grammar at:

https://github.com/vpdb/vpx-js/blob/scripting/ebnf/lib/scripting/grammar/grammar.bnf

It appears to not dump all the elements and ends early with ... 160 more items:

808     name: '%Parameter15',
809     bnf: [ [ 'Equals', 'ConstantExpression' ] ],
810     fragment: true,
811     implicitWs: false,
812     simplifyWhenOneChildren: false
813   },
814   {
815     name: 'Parameter',
816     bnf: [ [ '%Parameter14?', 'ParameterIdentifier', '%Parameter15?' ] ],
817     implicitWs: false,
818     fragment: false,
819     simplifyWhenOneChildren: false
820   },
821   ... 160 more items
822 ];

freezy · 2019-12-11T16:29:54Z

Looks like there's a util.inspect.defaultOptions.maxArrayLength = null; missing (default is 100). You could also provide it for just that call:

module.exports = ${util.inspect(RULES, { depth: 20, maxArrayLength: null })};`);

@menduz would you accept a PR for this?

menduz · 2019-12-11T18:31:26Z

Yes please!

jsm174 · 2019-12-27T23:28:00Z

Just to post an update on this, in #20 we have been seeing that regexp's are much faster than string literals.

I think it would be beneficial to have an {ignoreCase=true} option.

My only thought is from doing testing [Aa][Nn][Dd] appears to be much faster than /And/i.

I can help author this option, based on feedback on #20.

jsm174 · 2019-12-29T15:32:01Z

@menduz Now that #20 is in place (thanks, btw), we started looking at adding {ignoreCase=true}

Given the rule:

Keyword ::= ('And' | 'ByVal' | 'ByRef' | 'Case' | 'Call' | 'Class' | 'Const' | 'Default' | 'Dim' | 'Do' | 'Each' | 'ElseIf' | 'Else' | 'Empty' | 'End' | 'Erase' | 'Error' | 'Eqv' | 'Exit' | 'Explicit' | 'False' | 'For' | 'Function' | 'Get' | 'GoTo' | 'If' | 'In' | 'Is' | 'Let' | 'Loop' | 'Mod' | 'New' | 'Next' | 'Nothing' | 'Not' | 'Null' | 'On' | 'Option' | 'Or' | 'Preserve' | 'Private' | 'Property' | 'Public' | 'ReDim' | 'Resume' | 'Select' | 'Set' | 'Sub' | 'Then' | 'To' | 'True' | 'Until' | 'While' | 'WEnd' | 'With' | 'Xor') ![A-Za-z0-9_\.] {ignoreCase=true}

Would the appropriate place be the parser or the grammar?

In custom.ts we tried passing the attributes into getSubItems figuring when generating the regexps, we could just change /A/ to /[Aa]/. However since getSubItems is getting called for every keyword, we don't have the ignoreCase attribute.

We looked into placing it the parser, by adding an ignoreCase?: boolean; in the iRule, and rule.ignoreCase = attributes['ignoreCase'] == 'true'; in the grammar.

TBH, I am getting lost where these regular expressions are actually being testing.

jsm174 · 2020-01-03T02:36:11Z

@menduz Thanks for 1.7.4!

Quick question. Does ebnf-highlighter get updated with the latest lib? That website has been absolutely invaluable to vpx-js.

menduz · 2020-01-03T13:01:11Z

No, it wasn't updated since 2016 https://github.com/menduz/ebnf-highlighter

freezy mentioned this issue Dec 11, 2019

Generate entire array when compiling grammar. #18

Merged

jsm174 mentioned this issue Dec 27, 2019

Custom grammar with 'StringLiteral' much faster when using RegExps. #20

Closed

jsm174 mentioned this issue Dec 31, 2019

Add support for ignoreCase attribute for StringLiterals #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tips for grammars with case insensitive keywords? #17

Performance tips for grammars with case insensitive keywords? #17

jsm174 commented Dec 10, 2019

menduz commented Dec 10, 2019

jsm174 commented Dec 10, 2019

freezy commented Dec 10, 2019

menduz commented Dec 10, 2019 •

edited

Loading

jsm174 commented Dec 11, 2019 •

edited

Loading

freezy commented Dec 11, 2019

menduz commented Dec 11, 2019

jsm174 commented Dec 27, 2019

jsm174 commented Dec 29, 2019

jsm174 commented Jan 3, 2020

menduz commented Jan 3, 2020

Performance tips for grammars with case insensitive keywords? #17

Performance tips for grammars with case insensitive keywords? #17

Comments

jsm174 commented Dec 10, 2019

menduz commented Dec 10, 2019

Look ahead of the first letter to reduce the full scan

Use regular expressions for performance

jsm174 commented Dec 10, 2019

freezy commented Dec 10, 2019

menduz commented Dec 10, 2019 • edited Loading

jsm174 commented Dec 11, 2019 • edited Loading

freezy commented Dec 11, 2019

menduz commented Dec 11, 2019

jsm174 commented Dec 27, 2019

jsm174 commented Dec 29, 2019

jsm174 commented Jan 3, 2020

menduz commented Jan 3, 2020

menduz commented Dec 10, 2019 •

edited

Loading

jsm174 commented Dec 11, 2019 •

edited

Loading