Up until now we've been using parser combinators to build our parsers. Parser
combinators build top-down parsers that formally belong to the try
combinator) this is simultaneously both an
extremely powerful and simple model to implement as we saw before with our
simple 100 line parser library.
However there is a family of grammars that include left-recursion that
Now we demonstrated before that we could handle these cases using the
parser combinator chainl1
, and while this is possible sometimes it
can in many cases be an inefficient use of the parser stack and lead to ambiguous
cases.
The other major family of parsers,
The most common form of this toolchain is the Lex/Yacc lexer and parser
generator which compile into efficient C parsers for
Our parser and lexer logic will be spread across two different modules.
- Lexer.x
- Parser.y
The code in each of these modules is a hybrid of the specific Alex/Happy grammar
syntax and arbitrary Haskell logic that is spliced in. Code delineated by braces
({}
) is regular Haskell, while code outside is parsera and lexer logic.
-- **Begin Haskell Syntax**
{
{-# OPTIONS_GHC -w #-}
module Lexer (
Token(..),
scanTokens
) where
import Syntax
}
-- **End Haskell Syntax**
-- **Begin Alex Syntax**
%wrapper "basic"
$digit = 0-9
$alpha = [a-zA-Z]
$eol = [\n]
-- **End Alex Syntax**
The files will be used during the code generation of the two modules Lexer
and Parser
. The toolchain is accessible in several ways, first via the
command-line tools alex
and happy
which will generate the resulting
modules by passing the appropriate input file to the tool.
$ alex Lexer.x # Generates Lexer.hs
$ happy Parser.y # Generates Parser.hs
Or inside of the cabal file using the build-tools
command.
Build-depends: base, array
build-tools: alex, happy
other-modules:
Parser,
Lexer
So the resulting structure of our interpreter will have the following set of files.
- Lexer.hs
- Parser.hs
- Eval.hs
- Main.hs
- Syntax.hs
Our lexer module will export our Token definition and a function for converting an arbitrary String into a token stream or a list of Tokens.
{
module Lexer (
Token(..),
scanTokens
) where
import Syntax
}
The tokens are simply an enumeration of the unique possible tokens in our grammar.
data Token
= TokenLet
| TokenTrue
| TokenFalse
| TokenIn
| TokenLambda
| TokenNum Int
| TokenSym String
| TokenArrow
| TokenEq
| TokenAdd
| TokenSub
| TokenMul
| TokenLParen
| TokenRParen
| TokenEOF
deriving (Eq,Show)
scanTokens :: String -> [Token]
scanTokens = alexScanTokens
The token definition is a list of function definitions mapping atomic characters
and alphabetical sequences to constructors for our Token
datatype.
%wrapper "basic"
$digit = 0-9
$alpha = [a-zA-Z]
$eol = [\n]
tokens :-
-- Whitespace insensitive
$eol ;
$white+ ;
-- Comments
"#".* ;
-- Syntax
let { \s -> TokenLet }
True { \s -> TokenTrue }
False { \s -> TokenFalse }
in { \s -> TokenIn }
$digit+ { \s -> TokenNum (read s) }
"->" { \s -> TokenArrow }
\= { \s -> TokenEq }
\\ { \s -> TokenLambda }
[\+] { \s -> TokenAdd }
[\-] { \s -> TokenSub }
[\*] { \s -> TokenMul }
\( { \s -> TokenLParen }
\) { \s -> TokenRParen }
$alpha [$alpha $digit \_ \']* { \s -> TokenSym s }
Using Happy and our previously defined lexer we'll write down the production rules for our simple untyped lambda calculus.
We start by defining a Syntax
module where we define the AST we'll generate
from running over the token stream to produce the program graph structure.
module Syntax where
type Name = String
data Expr
= Lam Name Expr
| App Expr Expr
| Var Name
| Lit Lit
| Op Binop Expr Expr
deriving (Eq,Show)
data Lit
= LInt Int
| LBool Bool
deriving (Show, Eq, Ord)
data Binop = Add | Sub | Mul | Eql
deriving (Eq, Ord, Show)
The token constructors are each assigned to a name that will be used in our production rules.
-- Lexer structure
%tokentype { Token }
-- Token Names
%token
let { TokenLet }
true { TokenTrue }
false { TokenFalse }
in { TokenIn }
NUM { TokenNum $$ }
VAR { TokenSym $$ }
'\\' { TokenLambda }
'->' { TokenArrow }
'=' { TokenEq }
'+' { TokenAdd }
'-' { TokenSub }
'*' { TokenMul }
'(' { TokenLParen }
')' { TokenRParen }
The parser itself will live inside of a custom monad of our choosing. In this
case we'll add error handling with the Except
monad that will break out of
the parsing process if an invalid production or token is found and return a
Left
value which we'll handle inside of our toplevel logic.
-- Parser monad
%monad { Except String } { (>>=) } { return }
%error { parseError }
And finally our production rules, the toplevel entry point for our parser will
be the expr
rule. The left hand side of the production is a Happy production
rule which can be mutually recursive, while the right hand side is a Haskell
expression with several metavariable indicated by the dollar sign variables that
map to the nth expression on the left hand side.
$1 $2 $3 $4 $5 $6
let VAR '=' Expr in Expr { App (Lam $2 $6) $4 }
-- Entry point
%name expr
-- Operators
%left '+' '-'
%left '*'
%%
Expr : let VAR '=' Expr in Expr { App (Lam $2 $6) $4 }
| '\\' VAR '->' Expr { Lam $2 $4 }
| Form { $1 }
Form : Form '+' Form { Op Add $1 $3 }
| Form '-' Form { Op Sub $1 $3 }
| Form '*' Form { Op Mul $1 $3 }
| Fact { $1 }
Fact : Fact Atom { App $1 $2 }
| Atom { $1 }
Atom : '(' Expr ')' { $2 }
| NUM { Lit (LInt $1) }
| VAR { Var $1 }
| true { Lit (LBool True) }
| false { Lit (LBool False) }
Notice how naturally we can write a left recursive grammar for our binary infix operators.
Parsec's default error reporting leaves a bit to be desired, but does in fact contain most of the information needed to deliver better messages packed inside the ParseError structure.
showSyntaxError :: L.Text -> ParseError -> String
showSyntaxError s err = L.unpack $ L.unlines [
" ",
" " <> lineContents,
" " <> ((L.replicate col " ") <> "^"),
(L.pack $ show err)
]
where
lineContents = (L.lines s) !! line
pos = errorPos err
line = sourceLine pos - 1
col = fromIntegral $ sourceColumn pos - 1
Now when we enter an invalid expression the error reporting will point us directly to the adjacent lexeme that caused the problem as is common in many languages.
λ> \x -> x +
\x -> x +
^
"<interactive>" (line 1, column 11):
unexpected end of input
expecting "(", character, literal string, "[", integer, "if" or identifier
Before our type inference engine would generate somewhat typical type inference error messages. If two terms couldn't be unified it simply told us this and some information about the top-level declaration where it occurred, leaving us with a bit of a riddle about how exactly this error came to be.
Cannot unify types:
Int
with
Bool
in the definition of 'foo'
Effective error reporting in the presence of type inference is a difficult task, effectively our typechecker takes our frontend AST and transforms it into a large constraint problem, destroying position information in the process. Even if the position information were tracked, the nature of unification is that a cascade of several unifications can lead to unsolvability and the immediate two syntactic constructs that gave rise to a unification failure are not necessarily the two that map back to human intuition about how the type error arose. Very little research has done on this topic and it remains an open topic with very immediate and applicable results to programming.
To do simple provenance tracking we will use a technique of tracking the "flow" of type information through our typechecker and associate position information with the inferred types.
type Name = String
data Expr
= Var Loc Name
| App Loc Expr Expr
| Lam Loc Name Expr
| Lit Loc Int
data Loc = NoLoc | Located Int
deriving (Show, Eq, Ord)
So now inside of our parser we simply attach Parsec information on to each AST node. For example for the variable term.
variable :: Parser Expr
variable = do
x <- identifier
l <- sourceLine <$> getPosition
return (Var (Located l) x)
Our type system will also include information, although by default it will use
the NoLoc
value until explicit information is provided during inference. The
two functions getLoc
and setLoc
will be used to update and query the
position information from type terms.
data Type
= TVar Loc TVar
| TCon Loc Name
| TArr Loc Type Type
deriving (Show, Eq, Ord)
newtype TVar = TV String
deriving (Show, Eq, Ord)
typeInt :: Type
typeInt = TCon NoLoc "Int"
setLoc :: Loc -> Type -> Type
setLoc l (TVar _ a) = TVar l a
setLoc l (TCon _ a) = TCon l a
setLoc l (TArr _ a b) = TArr l a b
getLoc :: Type -> Loc
getLoc (TVar l _) = l
getLoc (TCon l _) = l
getLoc (TArr l _ _) = l
Our fresh variable supply now also takes a location field which is attached to the resulting type variable.
fresh :: Loc -> Check Type
fresh l = do
s <- get
put s{count = count s + 1}
return $ TVar l (TV (letters !! count s))
infer :: Expr -> Check Type
infer expr = case expr of
Var l n -> do
t <- lookupVar n
return $ setLoc l t
App l a b -> do
ta <- infer a
tb <- infer b
tr <- fresh l
unify ta (TArr l tb tr)
return tr
Lam l n a -> do
tv <- fresh l
ty <- inEnv (n, tv) (infer a)
return (TArr l (setLoc l ty) tv)
Lit l _ -> return (setLoc l typeInt)
Now specifically at the call site of our unification solver, if we encounter a unification fail we simply pluck the location information off the two type terms and plug it into the type error fields.
unifies t1 t2 | t1 == t2 = return emptyUnifer
unifies (TVar _ v) t = v `bind` t
unifies t (TVar _ v) = v `bind` t
unifies (TArr _ t1 t2) (TArr _ t3 t4) = unifyMany [t1, t2] [t3, t4]
unifies (TCon _ a) (TCon _ b) | a == b = return emptyUnifer
unifies t1 t2 = throwError $ UnificationFail t1 (getLoc t1) t2 (getLoc t2)
bind :: TVar -> Type -> Solve Unifier
bind a t
| eqLoc t a = return (emptySubst, [])
| occursCheck a t = throwError $ InfiniteType a (getLoc t) t
| otherwise = return $ (Subst $ Map.singleton a t, [])
So now we can explicitly trace the provenance of the specific constraints that gave rise to a given type error all the way back to the source that generated them.
Cannot unify types:
Int
Introduced at line 27 column 5
f 2 3
with
Int -> c
Introduced at line 5 column 9
let f x y = x y
This is of course the simplest implementation of the tracking method and could be further extended by giving a weighted ordering to the constraints based on their likelihood of importance and proximity and then choosing which location to report based on this information. This remains an open area of work.
Haskell's syntax uses indentation blocks to delineated sections of code. This use of indentation sensitive layout to convey the structure of logic is sometimes called the offside rule in parsing literature. At the beginning of a "laidout" block the first declaration or definition can start in any column, and the parser marks that indentation level. Every subsequent declaration at the same logical level must have the same indentation.
-- Start of layout ( Column: 0 )
fib :: Int -> Int
fib x = truncate $ ( 1 / sqrt 5 ) * ( phi ^ x - psi ^ x ) -- (Column: > 0)
-- Start of new layout ( Column: 2 )
where
-- Indented block ( Column: > 2 )
phi = ( 1 + sqrt 5 ) / 2
psi = ( 1 - sqrt 5 ) / 2
The Parsec monad is parameterized over a type which stands for the State layer baked into the monad allowing us to embed custom parser state inside of our rules. To adopt our parser to handle sensitive whitespace we will use:
-- Indentation sensitive Parsec monad.
type IParsec a = Parsec Text ParseState a
data ParseState = ParseState
{ indents :: Column
} deriving (Show)
initParseState :: ParseState
initParseState = ParseState 0
The parser stores the internal position state (SourcePos) during its
traversal, and makes it accessible inside of rule logic via the getPosition
function.
data SourcePos = SourcePos SourceName !Line !Column
getPosition :: Monad m => ParsecT s u m SourcePos
In terms of this function we can write down a set of logic that will allow us to
query the current column count and then either succeed or fail to match on a
pattern based on the current indentation level. The laidout
combinator will
capture the current indentation state and push it into the indents
field in
the State monad.
laidout :: Parsec s ParseState a -> Parsec s ParseState a
laidout m = do
cur <- indents <$> getState
pos <- sourceColumn <$> getPosition
modifyState $ \st -> st { indents = pos }
res <- m
modifyState $ \st -> st { indents = cur }
return res
And then have specific logic which guard the parser match based on comparing the current indentation level to the stored indentation level.
indentCmp
:: (Column -> Column -> Bool)
-> Parsec s ParseState ()
indentCmp cmp = do
col <- sourceColumn <$> getPosition
current <- indents <$> getState
guard (col `cmp` current)
We can then write two combinators in terms of this function which match on either further or identical indentation.
indented :: IParsec ()
indented = indentCmp (>) <?> "Block (indented)"
align :: IParsec ()
align = indentCmp (==) <?> "Block (same indentation)"
On top of these we write our two combinators for handling block syntax, which match a sequence of vertically aligned patterns as a list.
block, block1 :: Parser a -> Parser [a]
block p = laidout (many (align >> p))
block1 p = laidout (many1 (align >> p))
Haskell uses an optional layout rule for several constructs, allowing us to equivalently manually delimit indentation sensitive syntax with braces. The most common use is for do-notation. So for example:
example = do { a <- m; b }
example = do
a <- m
b
To support this in Parsec style we implement a maybeBraces
function.
maybeBraces :: Parser a -> Parser [a]
maybeBraces p = braces (endBy p semi) <|> block p
maybeBraces1 :: Parser a -> Parser [a]
maybeBraces1 p = braces (endBy1 p semi) <|> block p
Haskell famously allows the definition of custom infix operators, an extremely useful language feature although this poses a bit of a challenge to parse! There are two ways to do this and both depend on two properties of the operators.
- Precedence
- Associativity
-
The first, the way that GHC does it, is to parse all operators as left associative and of the same precedence, and then before desugaring go back and "fix" the parse tree given all the information we collected after finishing parsing.
-
The second method is a bit of a hack, and involves simply storing the collected operators inside of the Parsec state monad and then simply calling
buildExpressionParser
on the current state each time we want to parse an infix operator expression.
To do the later method we set up the AST objects for our fixity definitions, which associate precedence and associativity annotations with a custom symbol.
data FixitySpec = FixitySpec
{ fixityFix :: Fixity
, fixityName :: String
} deriving (Eq, Show)
data Assoc
= L
| R
| N
deriving (Eq,Ord,Show)
data Fixity
= Infix Assoc Int
| Prefix Int
| Postfix Int
deriving (Eq,Ord,Show)
Our parser state monad will hold a list of the active fixity specifications and whenever a definition is encountered we will append to this list.
data ParseState = ParseState
{ indents :: Column
, fixities :: [FixitySpec]
} deriving (Show)
initParseState :: ParseState
initParseState = ParseState 0 defaultOps
addOperator :: FixitySpec -> Parsec s ParseState ()
addOperator fixdecl = do
modifyState $ \st -> st { fixities = fixdecl : (fixities st) }
The initial state will consist of the default arithmetic and list operators defined with the same specification as the Haskell specification.
defaultOps :: [FixitySpec]
defaultOps = [
FixitySpec (Infix L 4) ">"
, FixitySpec (Infix L 4) "<"
, FixitySpec (Infix L 4) "/="
, FixitySpec (Infix L 4) "=="
, FixitySpec (Infix R 5) ":"
, FixitySpec (Infix L 6) "+"
, FixitySpec (Infix L 6) "-"
, FixitySpec (Infix L 5) "*"
, FixitySpec (Infix L 5) "/"
]
Now in our parser we need to be able to transform the fixity specifications into Parsec operator definitions. This is a pretty straightforward sort and group operation on the list.
fixityPrec :: FixitySpec -> Int
fixityPrec (FixitySpec (Infix _ n) _) = n
fixityPrec (FixitySpec _ _) = 0
toParser (FixitySpec ass tok) = case ass of
Infix L _ -> infixOp tok (op (Name tok)) Ex.AssocLeft
Infix R _ -> infixOp tok (op (Name tok)) Ex.AssocRight
Infix N _ -> infixOp tok (op (Name tok)) Ex.AssocNone
mkTable ops =
map (map toParser) $
groupBy ((==) `on` fixityPrec) $
reverse $ sortBy (compare `on` fixityPrec) $ ops
Now when parsing an infix operator declaration we simply do a state operation and add the operator to the parser state so that all subsequent definitions can use it. This differs from Haskell slightly in that operators must be defined before their usage in a module.
fixityspec :: Parser FixitySpec
fixityspec = do
fix <- fixity
prec <- precedence
op <- operator
semi
let spec = FixitySpec (fix prec) op
addOperator spec
return spec
where
fixity = Infix L <$ reserved "infixl"
<|> Infix R <$ reserved "infixr"
<|> Infix N <$ reserved "infix"
precedence :: Parser Int
precedence = do
n <- natural
if n <= 10
then return (fromInteger n)
else empty
<?> "Invalid operator precedence"
fixitydecl :: Parser Decl
fixitydecl = do
spec <- fixityspec
return $ FixityDecl spec
<?> "operator fixity definition"
And now when we need to parse an infix expression term we simply pull our state out and build the custom operator table, and feed this to the build Expression Parser just as before.
term :: Parser Expr -> Parser Expr
term a = do
st <- getState
let customOps = mkTable (fixities st)
Ex.buildExpressionParser customOps a
The tooling and documentation for Alex and Happy is well-developed as it is used extensively inside of GHC:
- Alex User Guide
- Happy User Guide
- A Tool for Generalized LR Parsing In Haskell
- Haskell Syntax Definition
GHC itself uses Alex and Happy for its parser infastructure. The resulting parser is rather sophisicated.
One of the few papers ever written in Type Error reporting gives some techniques for presentation and tracing provenance:
\clearpage