-
Notifications
You must be signed in to change notification settings - Fork 61
K Parsing and Unparsing
Parsers and unparsers translate between concrete and abstract syntax. The concrete syntax used to represent expressions is usually ambiguous; ambiguities are resolved using operator precedence and parentheses. Internal representations are unambiguous but not suitable as concrete syntax.
The parsing problem is to translate from concrete syntax to internal representation. The unparsing problem is to take an internal representation and to produce a concrete syntax that, when parsed, reproduces the original internal representation. That is, an unparser is correct only if it is a right inverse of its parser (parse (unparse e) = e). For example, if our internal representation is a tree representing the product of x + y with z, we cannot simply walk the tree and emit the concrete representation x + y * z; we must emit (x + y) * z instead.
The unparsing problem does not have a unique solution; in general, there are many possible concrete syntaxes that, when parsed, reproduce the original internal representation. For example, we could choose a concrete syntax that parenthesizes every expression, as in one version of our pretty printer. But this is unidiomatic and unreadable, we want to produce results without extra parentheses. Of course, it is very challenging to get a perfect unparser that omits every unnecessary parentheses, instead we can come up with a product that comes with tolerable amount of extra parentheses.
Parsing a string is polynomial time in the size of the input string. Giving a bound on the size of the shortest string required to unparse a given input would put unparsing in some nondeterministic complexity class, but I don't know how to find such a bound.
Putting a lower bound on the problem can be done taking advantage of ambiguity. Given a grammar and a term to unparse, if you can divide the grammar into an unambiguous grammar that produces your term, and remaining productions that couldn't produce the term (because of the labels assigned), then unparsing reduces to finding a string in the difference of those two context free grammars. That sounds like it might give undecidability.
For a simple example, start with basic expressions.
Start ::= Exp
Exp ::= Int > Exp "*" Exp > Exp "+" Exp | "(" Exp ")" [bracket]
The string "1*(2+3*(4+5))" is legal input.
We can prohibit that by changing the start symbol of the grammar to
Start ::= Exp | Any "))" Any
using a nonterminal Any that matches any string of characters, defined like
Any ::= Token{.*}
or maybe more explicitly as productions if we want to avoid regular expressions.
The extra production makes the string "1*(2+3*(4+5))" ambiguous.
To parse as an Exp, it could be written with an extra space "1*(2+3*(4+5) )".
Now I'll finish by describing a concrete application of this idea to show that unparsing is at least NP-hard, by writing a (family of) grammar where some of the ASTs encode 3-SAT problems, and any string which unambiguously unparses such a term encodes a solution to the 3-SAT problem. For simplicity, assume no whitespace is allowed.
We have a collection of variable names.
syntax VarId := "a" | "b" | "c" | "d" | ..
A wrapper label that serves no purpose in the AST has two alternate concrete syntaxes, which represent "True" and "False". (we can avoid this sort of overloading using a more complicated representation that encodes truth using a single kind of bracket)
syntax Var := "+" Var [klabel(var)] | "-" Var [klabel(var)]
The main part of the grammar describes a set of clauses.
syntax Lit := "+" Var[klabel(posLit)] | "-" Var [klabel(negLit)]
syntax Clause := "clause" "(" Lit "," Lit "," Lit ")"
syntax Problem := List{Clause,""}
Now Clause represents a clause of a 3-sat problem, for example 'clause('posLit('var('a)),'negLit('var('b)),'posLit('var('c))) represents (a / not b / c), and any AST in Problem represents a 3-SAT problem.
Then extra productions are added to the grammar to cause ambiguities with strings that don't represent valid solutions.
One batch of productions produces ambiguous parses of any string which doesn't consistently unparse a single variable as either true ("+") or false ("-"):
syntax InconsistentVar :=
Any "+a" Any "-a" Any | Any "-a" Any "+a" Any
| Any "+b" Any "-b" Any | Any "-b" Any "+b" Any
A string that parses as Problem and not as InconsistentVar consistently assigned each variable appearing in the problem either true or false, so the only remaining error to exclude is an unsatisfied clause.
syntax UnsatLit ::= "+-"VarId | "-+" VarId
syntax UnsatClause ::= "clause" "(" UnsatLit, UnsatLit, UnsatLit ")"
now with the start symbol
syntax Start ::= Problem | InconsistentVar | Any UnsatClause Any
unparsing certain terms is equivalent to solving a 3-SAT problem.
The algorithm follows:
flatten the term being unparsed without any parentheses or casts
parse the term to obtain a complete parse DAG including all ambiguities before applying priorities, associativities, and prefer/avoid.
perform a post-order traversal of the term being unparsed
for each term you come to:
if the term is not in every path through the parse forest:
if the term is preferred over all paths not containing the term in the parse forest (i.e. based on priority, associativity, and prefer/avoid):
trim the parse forest of all such paths
continue
else if the term is at the root of an ambiguity in the parse forest:
if all the other paths beginning at that root have a different sort than the term:
add a cast around the term
trim the parse forest of all other paths beginning at that root
continue
else:
report algorithm failure: term is not unparsable
else:
add a bracket around the term
trim the parse forest of all paths that this bracket removes from the parse forest
continue
The unparsing method uses information about precedence, associativity, and “fixity” of operators to transform an internal form into a concrete syntax. The method works from the bottom up and from the inside out. In each expression, it finds the operator, and it considers the subexpressions and their positions, left or right, relative to the operator. It decides whether to parenthesize subexpressions by comparing precedence and associativity of the current operator with the precedences and associativities of the most loosely binding operators in the subexpressions. Operators that are “covered” by parentheses do not participate in the comparison.
We do not want parentheses when: either the inner operator has higher precedence than the outer, or they have the same precedence and associativity and the associativity is to the left (right) if the inner is a left (right) child of the outer.
Second Version: General, including prefix and postfix operators, as well as non-associative infix operators of arbitrary arity
If the inner operator has higher precedence, parentheses are not needed.
If a postfix operator appears to the left of an infix operator, it always applies to the preceding expression, and parentheses are not needed. Similarly when a prefix operator appears to the right of an infix operator.
In case of left child, if left child has lower precedence than predecessor, then parentheses are needed, but if the two have the same precedence, then parentheses can be avoided, provided both are left-associative.A similar argument applies to the case of right child when both operators are right-associative.
Finally, in case of only child of a unary prefix or postfix operator, if the precedence of child is not greater than the precedence of predecessor, then parentheses can be avoided if and only if both operators have the same fixity, i.e., both are prefix operators or both are postfix operators.