Skip to content
This repository has been archived by the owner on Feb 1, 2020. It is now read-only.

Syntax specification for the new parser

Radu Mereuta edited this page Nov 23, 2016 · 4 revisions

Syntax specification for the new parser

The syntax specification for the new parser has evolved from the current SDF format, with just minor changes. The syntax is formally specified in e-kore.k.

Before starting the full description, here are the main differences between K with SDF and the new parser:

  • Tokens have been reworked. Now any production can create a constant/token, by adding the token attribute. This allows for context-free tokens, instead of just regular ones.
  • noAutoReject has been replaced with autoReject, and the default behavior reversed. If autoReject is specified, then all terminals from all productions (taking modularity into account) will be matched against the currently matched constant. If it matches, the parse fails. This is an easy way to implement keyword rejection.
  • reject has been reworked, now it is an attribute on a token production, instead of a sort. It takes a regular expression, and it can extend autoReject.
  • Token{<sdf lex>} has been replaced with r"<regex>"

Syntax specification for the new parser

The new parser takes as input only productions, which are a sequence of terminals, non-terminal, and regular-expressions. Examples:
syntax Exp ::= "if" Exp "then" Exp r"(?!else)" // if_then, not followed by else
syntax Id ::= r"(?<![A-Za-z0-9\\_])[A-Za-z\\_][A-Za-z0-9\\_]*" [token, autoReject]
syntax BubbleItem ::= r"[^ \t\n\r]" [token, reject2("rule|syntax|endmodule|configuration|context")]

Terminals are delimited by double quotes and the parser will match exactly the sequence of characters enclosed.
RegexTerminals prefixed by r and delimited by double quotes, they match strings described the the regular expression enclosed.
NonTerminals will match any production of the specified sort.

AST modifiers (token, klabel, bracket, subsort)

Taking into consideration only productions, the parser will create parse trees, but they are quite verbose, and hard to handle in a semantic environment (reminder: 1 + 2 produces the parse tree: (("1")," ", "+", " ", ("2")), and the AST: _+_(1, 2)).

For this, the following attributes can be used in order to help build the proper AST:

  • klabel - takes the current production's children (non-terminals), and creates a node with the label specified in the klabel attribute. All terminals and regular-expressions are ignored.
  • token - creates a node containing two fields: the sort of the production, and the exact string representation of the matched input. Adding the attribute
    • autoReject will exclude any terminal defined in the syntax. This offers a quick way of rejecting keywords, but for better control, the users may use
    • reject2 which takes as input a regular expression.
  • bracket - allowed only on productions that have one non-terminal and some terminals/regex-terminals, and the node will be eliminated. It is commonly used to group productions and/or override priority or associativity.
  • chain productions/subsorts - by default, productions with exactly one non-terminal are eliminated, unless the user specifies a klabel attribute.

By default, all productions that are not a subsort, or are annotated with bracket or token, will be automatically tagged with a generated klabel.

The regular expression language can be found here: Regex Language, with two additions: negative lookahead (?!X) and negative lookbehind (?<!X).

Declarative disambiguation filters

Just like in the SDF implementation, we have priorities, associativity and prefer/avoid to help with disambiguation. Example:

syntax Exp ::= Exp "*" Exp [klabel(Mul), left]
             > left:
               Exp "+" Exp [klabel(Plus), left]
             | Exp "-" Exp [klabel(Minus), left]
syntax left Mul
syntax left Plus Minus
syntax priority Mul > Plus Minus
  • priority - is a relation between two distinct productions, and it tells the parser what ASTs to eliminate. For example, 1+2*3 would produce the ambiguous parse amb(Plus(1, Mul(2, 3)), Mul(Plus(1, 2), 3)). Saying that Mul > Plus means that Plus is not allowed to be a direct child of Mul. This check is applied only for the outermost non-terminals, so productions like Exp ::= Exp "[" Exp "]" are safe to parse (no error would be produced).
  • associativity - can be specified for productions of the same type, or with the same priority. The disambiguation filter is very similar with the priority filter, except it disallows certain terms from appearing in the leftmost or rightmost side. Available options are: left, right and non-assoc.
  • prefer/avoid - filters ambiguities at the top, after parsing has been completed. Compared to associativity and priorities, this is a safe filter, in the sense that it will not return a parse error if it finds an error in the parse forest. It can be used to solve, for example, the dangling else problem:
syntax Stmt ::= "if" Exp "then" Stmt             [klabel(if)]
              | "if" Exp "then" Stmt "else" Stmt [klabel(ife), avoid]

The following input if A then if B then C else D would produce the ambiguous tree:
amb(if(A, ife(B, C, D)), ife(A, if(B, C), D)). The avoid attribute will tell the parser to eliminate the second branch, and the correct one, where else is associated with the innermost if, remains.

There are two ways to specify priorities and associativity: one is inline, between productions, and the second is a separate sentence, where the klabel or tags used on productions can be used as references.

Sort annotations

Because we can use concrete syntax in rules, we have to break apart contexts from the user defined language. This means that sometimes K can reach ambiguities that can not be solved automatically:

Int ::= Int "+" Int [klabel(PlusInt)]
String ::= String "+" String [klabel(PlusString)]

So for the rule A + B => .K we get an ambiguity between PlusInt and PlusString. The sort annotations are supposed to allow the users to reintroduce the context and avoid parsing ambiguities. These are the four annotation types:

Srt    ::= Srt ":Srt" 
Srt    ::= Srt "::Srt" // equivalent to "<:>"
Srt    ::= K   ":>Srt"
KBott  ::= Srt "<:Srt"

The first one is special since it adds a runtime check on any annotated variable. For the example above, saying:
rule A:String + B:String => .K would remove the ambiguity warning.

The second one is exactly like the first, but it doesn't add the runtime check.

The third can be used as a wrapper for any type of term, and be forcefully inserted in a place where a Srt is expected.
rule 1:>String + "abc" => .K

The fourth one is the opposite. Can take a term of sort Srt, and forcefully insert it anywhere.
rule 1<:Int + "abc" => .K

The arrow points to where the type checker is going to be strict. Using the last two is not recommended, but can be useful sometimes.

Future work/questions

Whitespace
Currently, whitespace/layout is added by default, and reproduce the C-style comments: //... and /*...*/. Possible ways to specify custom regular expressions could be something like this: syntax #Whitespace ::= r"<regex>", in which case, the non-terminal #Whitespace would be used instead of the default regular expression. There might also be languages where whitespaces play an important role in the parsing process, like Python or JavaScript, in which case the user might want more granularity when specifying where and which kind of whitespaces can be parsed. One possibility would be to write:
syntax Exp ::= "if" [#Whitespaces1] Exp "then" [#Whitespace2] Stmt [noDefaultWhitespace]
Specifying a non-terminal between square brackets would tell the parser to eliminate from the AST whatever was matched, and the noDefaultWhitespace attribute would tell the parser generator to not include the default whitespace.

Clone this wiki locally