Skip to content

Latest commit

 

History

History
793 lines (633 loc) · 21.9 KB

008_extended_parser.md

File metadata and controls

793 lines (633 loc) · 21.9 KB
![](img/titles/extended_parser.png)

Extended Parser

Up until now we've been using parser combinators to build our parsers. Parser combinators build top-down parsers that formally belong to the $\mathtt{LL}(k)$ family of parsers. The parser proceeds top-down, with a sequence of $k$ characters used to dispatch on the leftmost production rule. Combined with backtracking (i.e. the try combinator) this is simultaneously both an extremely powerful and simple model to implement as we saw before with our simple 100 line parser library.

However there is a family of grammars that include left-recursion that $\mathtt{LL}(k)$ can be inefficient and often incapable of parsing. Left-recursive rules are such where the left-most symbol of the rule recurses on itself. For example:

$$ \begin{aligned} e ::=\ e\ \t{op}\ \t{atom} \end{aligned} $$

Now we demonstrated before that we could handle these cases using the parser combinator chainl1, and while this is possible sometimes it can in many cases be an inefficient use of the parser stack and lead to ambiguous cases.

The other major family of parsers, $\mathtt{LR}$, are not plagued with the same concerns over left recursion. On the other hand $\mathtt{LR}$ parser are exceedingly more complicated to implement, relying on a rather sophisticated method known as Tomita's algorithm to do the heavy lifting. The tooling around the construction of the production rules in a form that can be handled by the algorithm is often handled by a DSL that generates the code for the parser. While the tooling is fairly robust, there is a level of indirection between us and the code that can often be a bit of brittle to extend with custom logic.

The most common form of this toolchain is the Lex/Yacc lexer and parser generator which compile into efficient C parsers for $\mathtt{LR}$ grammars. Haskell's Happy and Alex are roughly the Haskell equivalent of these tools.

Toolchain

Our parser and lexer logic will be spread across two different modules.

  • Lexer.x
  • Parser.y

The code in each of these modules is a hybrid of the specific Alex/Happy grammar syntax and arbitrary Haskell logic that is spliced in. Code delineated by braces ({}) is regular Haskell, while code outside is parsera and lexer logic.

-- **Begin Haskell Syntax**
{
{-# OPTIONS_GHC -w #-}

module Lexer (
  Token(..),
  scanTokens
) where

import Syntax
}
-- **End Haskell Syntax**

-- **Begin Alex Syntax**
%wrapper "basic"

$digit = 0-9
$alpha = [a-zA-Z]
$eol   = [\n]
-- **End Alex Syntax**

The files will be used during the code generation of the two modules Lexer and Parser. The toolchain is accessible in several ways, first via the command-line tools alex and happy which will generate the resulting modules by passing the appropriate input file to the tool.

$ alex Lexer.x    # Generates Lexer.hs
$ happy Parser.y  # Generates Parser.hs

Or inside of the cabal file using the build-tools command.

  Build-depends:       base, array
  build-tools:         alex, happy
  other-modules:       
    Parser,
    Lexer

So the resulting structure of our interpreter will have the following set of files.

  • Lexer.hs
  • Parser.hs
  • Eval.hs
  • Main.hs
  • Syntax.hs

Alex

Our lexer module will export our Token definition and a function for converting an arbitrary String into a token stream or a list of Tokens.

{
module Lexer (
  Token(..),
  scanTokens
) where

import Syntax
}

The tokens are simply an enumeration of the unique possible tokens in our grammar.

data Token 
  = TokenLet
  | TokenTrue
  | TokenFalse
  | TokenIn
  | TokenLambda
  | TokenNum Int
  | TokenSym String
  | TokenArrow
  | TokenEq
  | TokenAdd
  | TokenSub
  | TokenMul
  | TokenLParen
  | TokenRParen
  | TokenEOF
  deriving (Eq,Show)

scanTokens :: String -> [Token]
scanTokens = alexScanTokens

The token definition is a list of function definitions mapping atomic characters and alphabetical sequences to constructors for our Token datatype.

%wrapper "basic"

$digit = 0-9
$alpha = [a-zA-Z]
$eol   = [\n]

tokens :-

  -- Whitespace insensitive
  $eol                          ;
  $white+                       ;

  -- Comments
  "#".*                         ;

  -- Syntax
  let                           { \s -> TokenLet }
  True                          { \s -> TokenTrue }
  False                         { \s -> TokenFalse }
  in                            { \s -> TokenIn }
  $digit+                       { \s -> TokenNum (read s) }
  "->"                          { \s -> TokenArrow }
  \=                            { \s -> TokenEq }
  \\                            { \s -> TokenLambda }
  [\+]                          { \s -> TokenAdd }
  [\-]                          { \s -> TokenSub }
  [\*]                          { \s -> TokenMul }
  \(                            { \s -> TokenLParen }
  \)                            { \s -> TokenRParen }
  $alpha [$alpha $digit \_ \']* { \s -> TokenSym s }

Happy

Using Happy and our previously defined lexer we'll write down the production rules for our simple untyped lambda calculus.

We start by defining a Syntax module where we define the AST we'll generate from running over the token stream to produce the program graph structure.

module Syntax where

type Name = String

data Expr
  = Lam Name Expr
  | App Expr Expr
  | Var Name
  | Lit Lit
  | Op Binop Expr Expr
  deriving (Eq,Show)

data Lit
  = LInt Int
  | LBool Bool
  deriving (Show, Eq, Ord)

data Binop = Add | Sub | Mul | Eql
  deriving (Eq, Ord, Show)

The token constructors are each assigned to a name that will be used in our production rules.

-- Lexer structure 
%tokentype { Token }

-- Token Names
%token
    let   { TokenLet }
    true  { TokenTrue }
    false { TokenFalse }
    in    { TokenIn }
    NUM   { TokenNum $$ }
    VAR   { TokenSym $$ }
    '\\'  { TokenLambda }
    '->'  { TokenArrow }
    '='   { TokenEq }
    '+'   { TokenAdd }
    '-'   { TokenSub }
    '*'   { TokenMul }
    '('   { TokenLParen }
    ')'   { TokenRParen }

The parser itself will live inside of a custom monad of our choosing. In this case we'll add error handling with the Except monad that will break out of the parsing process if an invalid production or token is found and return a Left value which we'll handle inside of our toplevel logic.

-- Parser monad
%monad { Except String } { (>>=) } { return }
%error { parseError }

And finally our production rules, the toplevel entry point for our parser will be the expr rule. The left hand side of the production is a Happy production rule which can be mutually recursive, while the right hand side is a Haskell expression with several metavariable indicated by the dollar sign variables that map to the nth expression on the left hand side.

$1  $2  $3  $4   $5 $6
let VAR '=' Expr in Expr    { App (Lam $2 $6) $4 }
-- Entry point
%name expr

-- Operators
%left '+' '-'
%left '*'
%%

Expr : let VAR '=' Expr in Expr    { App (Lam $2 $6) $4 }
     | '\\' VAR '->' Expr          { Lam $2 $4 }
     | Form                        { $1 }

Form : Form '+' Form               { Op Add $1 $3 }
     | Form '-' Form               { Op Sub $1 $3 }
     | Form '*' Form               { Op Mul $1 $3 }
     | Fact                        { $1 }

Fact : Fact Atom                   { App $1 $2 }
     | Atom                        { $1 }

Atom : '(' Expr ')'                { $2 }
     | NUM                         { Lit (LInt $1) }
     | VAR                         { Var $1 }
     | true                        { Lit (LBool True) }
     | false                       { Lit (LBool False) }

Notice how naturally we can write a left recursive grammar for our binary infix operators.

Syntax Errors

Parsec's default error reporting leaves a bit to be desired, but does in fact contain most of the information needed to deliver better messages packed inside the ParseError structure.

showSyntaxError :: L.Text -> ParseError -> String
showSyntaxError s err = L.unpack $ L.unlines [
      "  ",
      "  " <> lineContents,
      "  " <> ((L.replicate col " ") <> "^"),
      (L.pack $ show err)
    ]
  where
    lineContents = (L.lines s) !! line
    pos  = errorPos err
    line = sourceLine pos - 1
    col  = fromIntegral $ sourceColumn pos - 1

Now when we enter an invalid expression the error reporting will point us directly to the adjacent lexeme that caused the problem as is common in many languages.

λ> \x -> x + 
  
  \x -> x + 
            ^
"<interactive>" (line 1, column 11):
unexpected end of input
expecting "(", character, literal string, "[", integer, "if" or identifier

Type Error Provenance

Before our type inference engine would generate somewhat typical type inference error messages. If two terms couldn't be unified it simply told us this and some information about the top-level declaration where it occurred, leaving us with a bit of a riddle about how exactly this error came to be.

Cannot unify types: 
        Int
with 
        Bool
in the definition of 'foo'

Effective error reporting in the presence of type inference is a difficult task, effectively our typechecker takes our frontend AST and transforms it into a large constraint problem, destroying position information in the process. Even if the position information were tracked, the nature of unification is that a cascade of several unifications can lead to unsolvability and the immediate two syntactic constructs that gave rise to a unification failure are not necessarily the two that map back to human intuition about how the type error arose. Very little research has done on this topic and it remains an open topic with very immediate and applicable results to programming.

To do simple provenance tracking we will use a technique of tracking the "flow" of type information through our typechecker and associate position information with the inferred types.

type Name = String

data Expr
  = Var Loc Name
  | App Loc Expr Expr
  | Lam Loc Name Expr
  | Lit Loc Int

data Loc = NoLoc | Located Int
  deriving (Show, Eq, Ord)

So now inside of our parser we simply attach Parsec information on to each AST node. For example for the variable term.

variable :: Parser Expr
variable = do
  x <- identifier
  l <- sourceLine <$> getPosition
  return (Var (Located l) x)

Our type system will also include information, although by default it will use the NoLoc value until explicit information is provided during inference. The two functions getLoc and setLoc will be used to update and query the position information from type terms.

data Type
  = TVar Loc TVar
  | TCon Loc Name
  | TArr Loc Type Type
  deriving (Show, Eq, Ord)

newtype TVar = TV String
  deriving (Show, Eq, Ord)

typeInt :: Type
typeInt = TCon NoLoc "Int"

setLoc :: Loc -> Type -> Type
setLoc l (TVar _ a)   = TVar l a
setLoc l (TCon _ a)   = TCon l a
setLoc l (TArr _ a b) = TArr l a b

getLoc :: Type -> Loc
getLoc (TVar l _) = l
getLoc (TCon l _) = l
getLoc (TArr l _ _) = l

Our fresh variable supply now also takes a location field which is attached to the resulting type variable.

fresh :: Loc -> Check Type
fresh l = do
  s <- get
  put s{count = count s + 1}
  return $ TVar l (TV (letters !! count s))
infer :: Expr -> Check Type
infer expr = case expr of
  Var l n -> do
    t <- lookupVar n
    return $ setLoc l t

  App l a b -> do
    ta <- infer a
    tb <- infer b
    tr <- fresh l
    unify ta (TArr l tb tr)
    return tr

  Lam l n a -> do
    tv <- fresh l
    ty <- inEnv (n, tv) (infer a)
    return (TArr l (setLoc l ty) tv)

  Lit l _ -> return (setLoc l typeInt)

Now specifically at the call site of our unification solver, if we encounter a unification fail we simply pluck the location information off the two type terms and plug it into the type error fields.

unifies t1 t2 | t1 == t2 = return emptyUnifer
unifies (TVar _ v) t = v `bind` t
unifies t (TVar _ v) = v `bind` t
unifies (TArr _ t1 t2) (TArr _ t3 t4) = unifyMany [t1, t2] [t3, t4]
unifies (TCon _ a) (TCon _ b) | a == b = return emptyUnifer
unifies t1 t2 = throwError $ UnificationFail t1 (getLoc t1) t2 (getLoc t2)

bind ::  TVar -> Type -> Solve Unifier
bind a t
  | eqLoc t a        = return (emptySubst, [])
  | occursCheck a t  = throwError $ InfiniteType a (getLoc t) t
  | otherwise        = return $ (Subst $ Map.singleton a t, [])

So now we can explicitly trace the provenance of the specific constraints that gave rise to a given type error all the way back to the source that generated them.

Cannot unify types: 
        Int
        Introduced at line 27 column 5

          f 2 3

with 
        Int -> c
        Introduced at line 5 column 9

          let f x y = x y

This is of course the simplest implementation of the tracking method and could be further extended by giving a weighted ordering to the constraints based on their likelihood of importance and proximity and then choosing which location to report based on this information. This remains an open area of work.

Indentation

Haskell's syntax uses indentation blocks to delineated sections of code. This use of indentation sensitive layout to convey the structure of logic is sometimes called the offside rule in parsing literature. At the beginning of a "laidout" block the first declaration or definition can start in any column, and the parser marks that indentation level. Every subsequent declaration at the same logical level must have the same indentation.

-- Start of layout ( Column: 0 )
fib :: Int -> Int
fib x = truncate $ ( 1 / sqrt 5 ) * ( phi ^ x - psi ^ x ) -- (Column: > 0)
  -- Start of new layout ( Column: 2 )
  where
      -- Indented block ( Column: > 2 )
      phi = ( 1 + sqrt 5 ) / 2
      psi = ( 1 - sqrt 5 ) / 2

The Parsec monad is parameterized over a type which stands for the State layer baked into the monad allowing us to embed custom parser state inside of our rules. To adopt our parser to handle sensitive whitespace we will use:

-- Indentation sensitive Parsec monad.
type IParsec a = Parsec Text ParseState a

data ParseState = ParseState
  { indents :: Column
  } deriving (Show)

initParseState :: ParseState
initParseState = ParseState 0

The parser stores the internal position state (SourcePos) during its traversal, and makes it accessible inside of rule logic via the getPosition function.

data SourcePos = SourcePos SourceName !Line !Column
getPosition :: Monad m => ParsecT s u m SourcePos

In terms of this function we can write down a set of logic that will allow us to query the current column count and then either succeed or fail to match on a pattern based on the current indentation level. The laidout combinator will capture the current indentation state and push it into the indents field in the State monad.

laidout :: Parsec s ParseState a -> Parsec s ParseState a
laidout m = do
  cur <- indents <$> getState
  pos <- sourceColumn <$> getPosition
  modifyState $ \st -> st { indents = pos }
  res <- m
  modifyState $ \st -> st { indents = cur }
  return res

And then have specific logic which guard the parser match based on comparing the current indentation level to the stored indentation level.

indentCmp
  :: (Column -> Column -> Bool)
  -> Parsec s ParseState ()
indentCmp cmp = do
  col <- sourceColumn <$> getPosition
  current <- indents <$> getState
  guard (col `cmp` current)

We can then write two combinators in terms of this function which match on either further or identical indentation.

indented :: IParsec ()
indented = indentCmp (>) <?> "Block (indented)"

align :: IParsec ()
align = indentCmp (==) <?> "Block (same indentation)"

On top of these we write our two combinators for handling block syntax, which match a sequence of vertically aligned patterns as a list.

block, block1 :: Parser a -> Parser [a]
block  p = laidout (many (align >> p))
block1 p = laidout (many1 (align >> p))

Haskell uses an optional layout rule for several constructs, allowing us to equivalently manually delimit indentation sensitive syntax with braces. The most common use is for do-notation. So for example:

example = do { a <- m; b }

example = do
  a <- m
  b

To support this in Parsec style we implement a maybeBraces function.

maybeBraces :: Parser a -> Parser [a]
maybeBraces p = braces (endBy p semi) <|> block p

maybeBraces1 :: Parser a -> Parser [a]
maybeBraces1 p = braces (endBy1 p semi) <|> block p

Extensible Operators

Haskell famously allows the definition of custom infix operators, an extremely useful language feature although this poses a bit of a challenge to parse! There are two ways to do this and both depend on two properties of the operators.

  • Precedence
  • Associativity
  1. The first, the way that GHC does it, is to parse all operators as left associative and of the same precedence, and then before desugaring go back and "fix" the parse tree given all the information we collected after finishing parsing.

  2. The second method is a bit of a hack, and involves simply storing the collected operators inside of the Parsec state monad and then simply calling buildExpressionParser on the current state each time we want to parse an infix operator expression.

To do the later method we set up the AST objects for our fixity definitions, which associate precedence and associativity annotations with a custom symbol.

data FixitySpec = FixitySpec
  { fixityFix :: Fixity
  , fixityName :: String
  } deriving (Eq, Show)

data Assoc
  = L
  | R
  | N
  deriving (Eq,Ord,Show)

data Fixity
  = Infix Assoc Int
  | Prefix Int
  | Postfix Int
  deriving (Eq,Ord,Show)

Our parser state monad will hold a list of the active fixity specifications and whenever a definition is encountered we will append to this list.

data ParseState = ParseState
  { indents :: Column
  , fixities :: [FixitySpec]
  } deriving (Show)

initParseState :: ParseState
initParseState = ParseState 0 defaultOps

addOperator ::  FixitySpec -> Parsec s ParseState ()
addOperator fixdecl = do
  modifyState $ \st -> st { fixities = fixdecl : (fixities st) }

The initial state will consist of the default arithmetic and list operators defined with the same specification as the Haskell specification.

defaultOps :: [FixitySpec]
defaultOps = [
    FixitySpec (Infix L 4) ">"
  , FixitySpec (Infix L 4) "<"
  , FixitySpec (Infix L 4) "/="
  , FixitySpec (Infix L 4) "=="

  , FixitySpec (Infix R 5) ":"

  , FixitySpec (Infix L 6) "+"
  , FixitySpec (Infix L 6) "-"

  , FixitySpec (Infix L 5) "*"
  , FixitySpec (Infix L 5) "/"
  ]

Now in our parser we need to be able to transform the fixity specifications into Parsec operator definitions. This is a pretty straightforward sort and group operation on the list.

fixityPrec :: FixitySpec -> Int
fixityPrec (FixitySpec (Infix _ n) _) = n
fixityPrec (FixitySpec _ _) = 0

toParser (FixitySpec ass tok) = case ass of
    Infix L _ -> infixOp tok (op (Name tok)) Ex.AssocLeft
    Infix R _ -> infixOp tok (op (Name tok)) Ex.AssocRight
    Infix N _ -> infixOp tok (op (Name tok)) Ex.AssocNone

mkTable ops =
  map (map toParser) $
    groupBy ((==) `on` fixityPrec) $
      reverse $ sortBy (compare `on` fixityPrec) $ ops

Now when parsing an infix operator declaration we simply do a state operation and add the operator to the parser state so that all subsequent definitions can use it. This differs from Haskell slightly in that operators must be defined before their usage in a module.

fixityspec :: Parser FixitySpec
fixityspec = do
  fix  <- fixity
  prec <- precedence
  op   <- operator
  semi
  let spec = FixitySpec (fix prec) op
  addOperator spec
  return spec
  where
    fixity = Infix L <$ reserved "infixl"
         <|> Infix R <$ reserved "infixr"
         <|> Infix N <$ reserved "infix"

precedence :: Parser Int
precedence = do
  n <- natural
  if n <= 10
  then return (fromInteger n)
  else empty
  <?> "Invalid operator precedence"

fixitydecl :: Parser Decl
fixitydecl = do
  spec <- fixityspec
  return $ FixityDecl spec
 <?> "operator fixity definition"

And now when we need to parse an infix expression term we simply pull our state out and build the custom operator table, and feed this to the build Expression Parser just as before.

term :: Parser Expr -> Parser Expr
term a = do
  st <- getState
  let customOps = mkTable (fixities st)
  Ex.buildExpressionParser customOps a

Full Source

Resources

The tooling and documentation for Alex and Happy is well-developed as it is used extensively inside of GHC:

GHC itself uses Alex and Happy for its parser infastructure. The resulting parser is rather sophisicated.

One of the few papers ever written in Type Error reporting gives some techniques for presentation and tracing provenance:

\clearpage