Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tips for grammars with case insensitive keywords? #17

Open
jsm174 opened this issue Dec 10, 2019 · 11 comments
Open

Performance tips for grammars with case insensitive keywords? #17

jsm174 opened this issue Dec 10, 2019 · 11 comments

Comments

@jsm174
Copy link
Contributor

jsm174 commented Dec 10, 2019

We are using node-ebnf to transpile vbscript into javascript in an attempt to run Visual Pinball in a web browser.

Currently we go through a two pass approach. The first pass format's the original script file to remove comments, standardize keyword case, and remove extra whitespace. The second pass converts the node-ebnf ast tree into a js ast tree.

To standardize the keyword case, we end up taking a rule like:

Keywords ::= 'And' | 'ByVal' | 'ByRef' | 'Case' | 'Call' | 'Class' | 'Const' | 'Default' | 'Dim' | 'Do' | 'Each' | 'ElseIf' | 'Else' | 'Empty' | 'End' | 'Erase' | 'Error' | 'Eqv' | 'Exit' | 'Explicit' | 'False' | 'For' | 'Function' | 'Get' | 'GoTo' | 'If' | 'In' | 'Is' | 'Let' | 'Loop' | 'Mod' | 'New' | 'Next' | 'Nothing' | 'Not' | 'Null' | 'On' | 'Option' | 'Or' | 'Preserve' | 'Private' | 'Property' | 'Public' | 'ReDim' | 'Resume' | 'Select' | 'Set' | 'Step' | 'Sub' | 'Then' | 'To' | 'True' | 'Until' | 'While' | 'WEnd' | 'With' | 'Xor' Keyword ::= Keywords ![A-Za-z0-9_]

and converting it to:

Keywords ::= ('A'|'a')('N'|'n')('D'|'d')|('B'|'b')('Y'|'y')('V'|'v')('A'|'a')('L'|'l')|('B'|'b')('Y'|'y')('R'|'r')('E'|'e')('F'|'f')|('C'|'c')('A'|'a')('S'|'s')('E'|'e')|('C'|'c')('A'|'a')('L'|'l')('L'|'l')|('C'|'c')('L'|'l')('A'|'a')('S'|'s')('S'|'s')|('C'|'c')('O'|'o')('N'|'n')('S'|'s')('T'|'t')|('D'|'d')('E'|'e')('F'|'f')('A'|'a')('U'|'u')('L'|'l')('T'|'t')|('D'|'d')('I'|'i')('M'|'m')|('D'|'d')('O'|'o')|('E'|'e')('A'|'a')('C'|'c')('H'|'h')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')('I'|'i')('F'|'f')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')|('E'|'e')('M'|'m')('P'|'p')('T'|'t')('Y'|'y')|('E'|'e')('N'|'n')('D'|'d')|('E'|'e')('R'|'r')('A'|'a')('S'|'s')('E'|'e')|('E'|'e')('R'|'r')('R'|'r')('O'|'o')('R'|'r')|('E'|'e')('Q'|'q')('V'|'v')|('E'|'e')('X'|'x')('I'|'i')('T'|'t')|('E'|'e')('X'|'x')('P'|'p')('L'|'l')('I'|'i')('C'|'c')('I'|'i')('T'|'t')|('F'|'f')('A'|'a')('L'|'l')('S'|'s')('E'|'e')|('F'|'f')('O'|'o')('R'|'r')|('F'|'f')('U'|'u')('N'|'n')('C'|'c')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('G'|'g')('E'|'e')('T'|'t')|('G'|'g')('O'|'o')('T'|'t')('O'|'o')|('I'|'i')('F'|'f')|('I'|'i')('N'|'n')|('I'|'i')('S'|'s')|('L'|'l')('E'|'e')('T'|'t')|('L'|'l')('O'|'o')('O'|'o')('P'|'p')|('M'|'m')('O'|'o')('D'|'d')|('N'|'n')('E'|'e')('W'|'w')|('N'|'n')('E'|'e')('X'|'x')('T'|'t')|('N'|'n')('O'|'o')('T'|'t')('H'|'h')('I'|'i')('N'|'n')('G'|'g')|('N'|'n')('O'|'o')('T'|'t')|('N'|'n')('U'|'u')('L'|'l')('L'|'l')|('O'|'o')('N'|'n')|('O'|'o')('P'|'p')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('O'|'o')('R'|'r')|('P'|'p')('R'|'r')('E'|'e')('S'|'s')('E'|'e')('R'|'r')('V'|'v')('E'|'e')|('P'|'p')('R'|'r')('I'|'i')('V'|'v')('A'|'a')('T'|'t')('E'|'e')|('P'|'p')('R'|'r')('O'|'o')('P'|'p')('E'|'e')('R'|'r')('T'|'t')('Y'|'y')|('P'|'p')('U'|'u')('B'|'b')('L'|'l')('I'|'i')('C'|'c')|('R'|'r')('E'|'e')('D'|'d')('I'|'i')('M'|'m')|('R'|'r')('E'|'e')('S'|'s')('U'|'u')('M'|'m')('E'|'e')|('S'|'s')('E'|'e')('L'|'l')('E'|'e')('C'|'c')('T'|'t')|('S'|'s')('E'|'e')('T'|'t')|('S'|'s')('T'|'t')('E'|'e')('P'|'p')|('S'|'s')('U'|'u')('B'|'b')|('T'|'t')('H'|'h')('E'|'e')('N'|'n')|('T'|'t')('O'|'o')|('T'|'t')('R'|'r')('U'|'u')('E'|'e')|('U'|'u')('N'|'n')('T'|'t')('I'|'i')('L'|'l')|('W'|'w')('H'|'h')('I'|'i')('L'|'l')('E'|'e')|('W'|'w')('E'|'e')('N'|'n')('D'|'d')|('W'|'w')('I'|'i')('T'|'t')('H'|'h')|('X'|'x')('O'|'o')('R'|'r')

If we could assume the case of the script was correct, not splitting the keywords like this gives us a 40% gain in performance.

Is there any tips or suggestions you could offer on case insensitive grammars?

@menduz
Copy link
Member

menduz commented Dec 10, 2019

Hi, thank you for sharing the use case.

There are a couple of things we can do to optimize this kind of parsers:

Look ahead of the first letter to reduce the full scan

Keyword        ::= 'And'
                 | &('B') ('ByVal' | 'ByRef)
                 | &('C') ('Case' | 'Call' | 'Class' | 'Const')

https://github.com/lys-lang/lys/blob/master/src/grammar.ts#L117-L121

Use regular expressions for performance

Individual constructions are a good mechanism to build grammars, to write nominal options it may be faster to write

Keywords ::= [Aa][Nn][Dd]
          |[Bb][Yy][Vv][Aa][Ll]
          |[Bb][Yy][Rr][Ee][Ff]
          |[Cc][Aa][Ss][Ee]
          |[Cc][Aa][Ll][Ll]
          |[Cc][Ll][Aa][Ss][Ss]
          |[Cc][Oo][Nn][Ss][Tt]
          |[Dd][Ee][Ff][Aa][Uu][Ll][Tt]
          |[Dd][Ii][Mm]
          |[Dd][Oo]
          |[Ee][Aa][Cc][Hh]
...

Usually languages like Basic use a tokenizer as a first step and then the parsing is performed over tokens instead of text. This library generates PEG grammars without a tokenization phase, which in some cases can be unfortunate.

Regarding the performance of the case sensitiveness of this library, it can be added as an option of the construction like

Keywords := 'and' | 'byval' {ignoreCase=true}

There are a couple of options to base the code on those, it should be an easy thing to do since it is only a parameter i to the regular expressions. And this library relies entirely on regular expressions.

@jsm174
Copy link
Contributor Author

jsm174 commented Dec 10, 2019

Thank you for responding so fast! Believe it or not, we had just tried:

Keywords ::= [Aa][Nn][Dd]
          |[Bb][Yy][Vv][Aa][Ll]
          |[Bb][Yy][Rr][Ee][Ff]
          |[Cc][Aa][Ss][Ee]

And it is extremely fast even more so than the when they keywords are explicitly defined.

Keywords ::= ('A'|'a')('N'|'n')('D'|'d')|('B'|'b')('Y'|'y')('V'|'v')('A'|'a')('L'|'l')|('B'|'b')('Y'|'y')('R'|'r')('E'|'e')('F'|'f')|('C'|'c')('A'|'a')('S'|'s')('E'|'e')|('C'|'c')('A'|'a')('L'|'l')('L'|'l')|('C'|'c')('L'|'l')('A'|'a')('S'|'s')('S'|'s')|('C'|'c')('O'|'o')('N'|'n')('S'|'s')('T'|'t')|('D'|'d')('E'|'e')('F'|'f')('A'|'a')('U'|'u')('L'|'l')('T'|'t')|('D'|'d')('I'|'i')('M'|'m')|('D'|'d')('O'|'o')|('E'|'e')('A'|'a')('C'|'c')('H'|'h')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')('I'|'i')('F'|'f')|('E'|'e')('L'|'l')('S'|'s')('E'|'e')|('E'|'e')('M'|'m')('P'|'p')('T'|'t')('Y'|'y')|('E'|'e')('N'|'n')('D'|'d')|('E'|'e')('R'|'r')('A'|'a')('S'|'s')('E'|'e')|('E'|'e')('R'|'r')('R'|'r')('O'|'o')('R'|'r')|('E'|'e')('Q'|'q')('V'|'v')|('E'|'e')('X'|'x')('I'|'i')('T'|'t')|('E'|'e')('X'|'x')('P'|'p')('L'|'l')('I'|'i')('C'|'c')('I'|'i')('T'|'t')|('F'|'f')('A'|'a')('L'|'l')('S'|'s')('E'|'e')|('F'|'f')('O'|'o')('R'|'r')|('F'|'f')('U'|'u')('N'|'n')('C'|'c')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('G'|'g')('E'|'e')('T'|'t')|('G'|'g')('O'|'o')('T'|'t')('O'|'o')|('I'|'i')('F'|'f')|('I'|'i')('N'|'n')|('I'|'i')('S'|'s')|('L'|'l')('E'|'e')('T'|'t')|('L'|'l')('O'|'o')('O'|'o')('P'|'p')|('M'|'m')('O'|'o')('D'|'d')|('N'|'n')('E'|'e')('W'|'w')|('N'|'n')('E'|'e')('X'|'x')('T'|'t')|('N'|'n')('O'|'o')('T'|'t')('H'|'h')('I'|'i')('N'|'n')('G'|'g')|('N'|'n')('O'|'o')('T'|'t')|('N'|'n')('U'|'u')('L'|'l')('L'|'l')|('O'|'o')('N'|'n')|('O'|'o')('P'|'p')('T'|'t')('I'|'i')('O'|'o')('N'|'n')|('O'|'o')('R'|'r')|('P'|'p')('R'|'r')('E'|'e')('S'|'s')('E'|'e')('R'|'r')('V'|'v')('E'|'e')|('P'|'p')('R'|'r')('I'|'i')('V'|'v')('A'|'a')('T'|'t')('E'|'e')|('P'|'p')('R'|'r')('O'|'o')('P'|'p')('E'|'e')('R'|'r')('T'|'t')('Y'|'y')|('P'|'p')('U'|'u')('B'|'b')('L'|'l')('I'|'i')('C'|'c')|('R'|'r')('E'|'e')('D'|'d')('I'|'i')('M'|'m')|('R'|'r')('E'|'e')('S'|'s')('U'|'u')('M'|'m')('E'|'e')|('S'|'s')('E'|'e')('L'|'l')('E'|'e')('C'|'c')('T'|'t')|('S'|'s')('E'|'e')('T'|'t')|('S'|'s')('T'|'t')('E'|'e')('P'|'p')|('S'|'s')('U'|'u')('B'|'b')|('T'|'t')('H'|'h')('E'|'e')('N'|'n')|('T'|'t')('O'|'o')|('T'|'t')('R'|'r')('U'|'u')('E'|'e')|('U'|'u')('N'|'n')('T'|'t')('I'|'i')('L'|'l')|('W'|'w')('H'|'h')('I'|'i')('L'|'l')('E'|'e')|('W'|'w')('E'|'e')('N'|'n')('D'|'d')|('W'|'w')('I'|'i')('T'|'t')('H'|'h')|('X'|'x')('O'|'o')('R'|'r') {fragment=true}
Keyword ::= Keywords ![A-Za-z0-9_]

[Grammar.format] trim and ast time in 1301ms
Keywords ::= 'And' | 'ByVal' | 'ByRef' | 'Case' | 'Call' | 'Class' | 'Const' | 'Default' | 'Dim' | 'Do' | 'Each' | 'ElseIf' | 'Else' | 'Empty' | 'End' | 'Erase' | 'Error' | 'Eqv' | 'Exit' | 'Explicit' | 'False' | 'For' | 'Function' | 'Get' | 'GoTo' | 'If' | 'In' | 'Is' | 'Let' | 'Loop' | 'Mod' | 'New' | 'Next' | 'Nothing' | 'Not' | 'Null' | 'On' | 'Option' | 'Or' | 'Preserve' | 'Private' | 'Property' | 'Public' | 'ReDim' | 'Resume' | 'Select' | 'Set' | 'Step' | 'Sub' | 'Then' | 'To' | 'True' | 'Until' | 'While' | 'WEnd' | 'With' | 'Xor' {fragment=true}
Keyword ::= Keywords ![A-Za-z0-9_]

[Grammar.format] trim and ast time in 875ms
Keywords ::= [Aa][Nn][Dd]|[Bb][Yy][Vv][Aa][Ll]|[Bb][Yy][Rr][Ee][Ff]|[Cc][Aa][Ss][Ee]|[Cc][Aa][Ll][Ll]|[Cc][Ll][Aa][Ss][Ss]|[Cc][Oo][Nn][Ss][Tt]|[Dd][Ee][Ff][Aa][Uu][Ll][Tt]|[Dd][Ii][Mm]|[Dd][Oo]|[Ee][Aa][Cc][Hh]|[Ee][Ll][Ss][Ee][Ii][Ff]|[Ee][Ll][Ss][Ee]|[Ee][Mm][Pp][Tt][Yy]|[Ee][Nn][Dd]|[Ee][Rr][Aa][Ss][Ee]|[Ee][Rr][Rr][Oo][Rr]|[Ee][Qq][Vv]|[Ee][Xx][Ii][Tt]|[Ee][Xx][Pp][Ll][Ii][Cc][Ii][Tt]|[Ff][Aa][Ll][Ss][Ee]|[Ff][Oo][Rr]|[Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn]|[Gg][Ee][Tt]|[Gg][Oo][Tt][Oo]|[Ii][Ff]|[Ii][Nn]|[Ii][Ss]|[Ll][Ee][Tt]|[Ll][Oo][Oo][Pp]|[Mm][Oo][Dd]|[Nn][Ee][Ww]|[Nn][Ee][Xx][Tt]|[Nn][Oo][Tt][Hh][Ii][Nn][Gg]|[Nn][Oo][Tt]|[Nn][Uu][Ll][Ll]|[Oo][Nn]|[Oo][Pp][Tt][Ii][Oo][Nn]|[Oo][Rr]|[Pp][Rr][Ee][Ss][Ee][Rr][Vv][Ee]|[Pp][Rr][Ii][Vv][Aa][Tt][Ee]|[Pp][Rr][Oo][Pp][Ee][Rr][Tt][Yy]|[Pp][Uu][Bb][Ll][Ii][Cc]|[Rr][Ee][Dd][Ii][Mm]|[Rr][Ee][Ss][Uu][Mm][Ee]|[Ss][Ee][Ll][Ee][Cc][Tt]|[Ss][Ee][Tt]|[Ss][Tt][Ee][Pp]|[Ss][Uu][Bb]|[Tt][Hh][Ee][Nn]|[Tt][Oo]|[Tt][Rr][Uu][Ee]|[Uu][Nn][Tt][Ii][Ll]|[Ww][Hh][Ii][Ll][Ee]|[Ww][Ee][Nn][Dd]|[Ww][Ii][Tt][Hh]|[Xx][Oo][Rr][''][  ][{{][Ff][Rr][Aa][Gg][Mm][Ee][Nn][Tt][==][Tt][Rr][Uu][Ee]{fragment=true}
Keyword ::= Keywords ![A-Za-z0-9_]

[Grammar.format] trim and ast time in 257ms

@freezy
Copy link
Contributor

freezy commented Dec 10, 2019

Looking at your BNF grammar, it seems like it's already compiled. Is there a way to produce such a JavaScript array that we can load into the parser directly? That would allow us to 1) skip compiling the grammar each time we transpile, and 2) tweak the regexes so they are more performant?

@menduz
Copy link
Member

menduz commented Dec 10, 2019

Yes, there is a way to export the parsed grammar. Every grammar was scaffolded using the previous one.

  1. BNF was hand made.
  2. Then W3CEBNF was created using BNF.
  3. And Custom was created using W3CEBNF.

You can cache and fine tune your grammar by executing:

ebnf Grammar.ebnf >> myFile.js

https://github.com/lys-lang/node-ebnf/blob/master/src/bin.ts

@jsm174
Copy link
Contributor Author

jsm174 commented Dec 11, 2019

I just ran this utility on our grammar at:

https://github.com/vpdb/vpx-js/blob/scripting/ebnf/lib/scripting/grammar/grammar.bnf

It appears to not dump all the elements and ends early with ... 160 more items:

808     name: '%Parameter15',
809     bnf: [ [ 'Equals', 'ConstantExpression' ] ],
810     fragment: true,
811     implicitWs: false,
812     simplifyWhenOneChildren: false
813   },
814   {
815     name: 'Parameter',
816     bnf: [ [ '%Parameter14?', 'ParameterIdentifier', '%Parameter15?' ] ],
817     implicitWs: false,
818     fragment: false,
819     simplifyWhenOneChildren: false
820   },
821   ... 160 more items
822 ];

@freezy
Copy link
Contributor

freezy commented Dec 11, 2019

Looks like there's a util.inspect.defaultOptions.maxArrayLength = null; missing (default is 100). You could also provide it for just that call:

module.exports = ${util.inspect(RULES, { depth: 20, maxArrayLength: null })};`);

@menduz would you accept a PR for this?

@menduz
Copy link
Member

menduz commented Dec 11, 2019

Yes please!

@jsm174
Copy link
Contributor Author

jsm174 commented Dec 27, 2019

Just to post an update on this, in #20 we have been seeing that regexp's are much faster than string literals.

I think it would be beneficial to have an {ignoreCase=true} option.

My only thought is from doing testing [Aa][Nn][Dd] appears to be much faster than /And/i.

I can help author this option, based on feedback on #20.

@jsm174
Copy link
Contributor Author

jsm174 commented Dec 29, 2019

@menduz Now that #20 is in place (thanks, btw), we started looking at adding {ignoreCase=true}

Given the rule:

Keyword ::= ('And' | 'ByVal' | 'ByRef' | 'Case' | 'Call' | 'Class' | 'Const' | 'Default' | 'Dim' | 'Do' | 'Each' | 'ElseIf' | 'Else' | 'Empty' | 'End' | 'Erase' | 'Error' | 'Eqv' | 'Exit' | 'Explicit' | 'False' | 'For' | 'Function' | 'Get' | 'GoTo' | 'If' | 'In' | 'Is' | 'Let' | 'Loop' | 'Mod' | 'New' | 'Next' | 'Nothing' | 'Not' | 'Null' | 'On' | 'Option' | 'Or' | 'Preserve' | 'Private' | 'Property' | 'Public' | 'ReDim' | 'Resume' | 'Select' | 'Set' | 'Sub' | 'Then' | 'To' | 'True' | 'Until' | 'While' | 'WEnd' | 'With' | 'Xor') ![A-Za-z0-9_\.] {ignoreCase=true}

Would the appropriate place be the parser or the grammar?

In custom.ts we tried passing the attributes into getSubItems figuring when generating the regexps, we could just change /A/ to /[Aa]/. However since getSubItems is getting called for every keyword, we don't have the ignoreCase attribute.

We looked into placing it the parser, by adding an ignoreCase?: boolean; in the iRule, and rule.ignoreCase = attributes['ignoreCase'] == 'true'; in the grammar.

TBH, I am getting lost where these regular expressions are actually being testing.

@jsm174
Copy link
Contributor Author

jsm174 commented Jan 3, 2020

@menduz Thanks for 1.7.4!

Quick question. Does ebnf-highlighter get updated with the latest lib? That website has been absolutely invaluable to vpx-js.

@menduz
Copy link
Member

menduz commented Jan 3, 2020

No, it wasn't updated since 2016 https://github.com/menduz/ebnf-highlighter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants