Author : Jack Robbins
This project is an interpreter for an interpreted Pascal-like programming language. Interpreted languages, such as the language in this project and more well known ones like Python and R, are executed line-by-line, as opposed to compiled languages where programs are translated into machine code all at once before execution. The interpreter for this language is made up of two main copmonents, a tokenizer/lexical analyzer, and a recursive-descent parser/interpreter. To help show how this programming language is supposed to look and function, there are numerous example programs provided in the tests folder. If you are interested, I would greatly encourage you to read the formal write-up below, download the code, and write some programs for yourself in this programming langauge.
In short, with this project we can take a program like this, written in our own custom programming language:
program circle;
var
{Clean program testing nested if statement}
r, a, p, b : real;
flag : boolean := true;
i, j : integer := 0;
str : string := 'End of Program';
begin
r := 6;
p := 0;
a := 5;
if (a > 0 and a < 100) then
begin
b := 25;
if (i = 0) then j := r + 1;
p := 2 * 3.14 * r
end
else
begin
p := -1;
flag := false
end;
{Display the results}
writeln ( 'The result of a = ' , a, ', ', b);
writeln('The result of p = ' , p);
writeln(str)
end.
And run it, giving us an output that looks like this:
The result of a = 5.00, 25.00
The result of p = 37.68
Note
There is adequate documentation within the source code itself, but for a full theoretical understanding of this project, I strongly recommend reading the full write-up that follows this executive summary.
All programming languages must have a grammar ruleset that defines what statements are valid for the lanuage. These grammer rulesets are called context-free grammars, becuase they generate valid sentences without knowing their true purpose. EBNF is a notation/ruleset for formally describing context-free grammars. Context-free grammars are the theoretical foundation for all programming languages. The formal language generated by a context-free grammar is the set of terminal strings that can be produced by the grammar rules.
Mathematically, any context-free grammar G is defined as the 4-tuple
-
$V$ is the finite set of all non-terminals$v \in V$ where each element v is a non-terminal character. Every non-terminal character is a character that cannot be changed the rules of the language. -
$\Sigma$ is the finite set of all terminals such that$\Sigma \cap V = \varnothing$ . The set of terminals are defined by the language. -
$R$ is a finite subset of the binary relation defined as$V \times (V \cup \Sigma)^*$ where$\beta \in (V \cup \Sigma)^*$ is a string of terminals and non-terminals. -
$S$ is the start/entry symbol for the program. It is necessary that$S \in V$ .
The mathematical rule for the grammar above allows for the creation of all valid "sentences", or in our case, programmatic commands, for our language.
We have now explored the mathematical foundation for context-free grammars. As stated above, EBNF meta-syntax is used to describe context-free grammars, which are the foundation for all programming languages, including the langauge inmplemented in this project. The table below is a handy "cheat sheet" that contains everything that you will need to know to understand the EBNF rules for this language.
Important
Listed below is the notation that will be used in the formal EBNF definition of this programming language in the next section. It is important that this notation is understood before reading the rules for this language.
Usage | Notation | Description |
---|---|---|
Language Terminals | Any UPPERCASE text | Terminals are the lexical foundation of the language. They are symbols that cannot be changed using the rules of the grammar. The final output after applying all grammar rules will be a string of terminals. The terminals that are in this language are further defined below |
Language Nonterminals | Any bolded text | Nonterminals, or syntactic variables, are symbols that can be replaced. |
Optional Repetition | {. . . . .} | Indicates that the sequence of terminals and nonterminals inside may be repeated 0 or many times |
Optional Inclusion | [. . . . .] | Indicates that the sequence of terminals and nonterminals inside may be used not at all or 1 time |
Grouping | (. . . . .) | Indicates that the sequence of terminals and nonterminals inside are grouped together. Usually seen used with the | operator |
Concatenation | , | Inidicates concatenation of items |
Alternation | | | Indicates that items on the left or right are used "either-or" |
Definition | ::= | The non-terminal on the left of the definition operator is defined as the sequence of terminals and nonterminals on the right of the operator |
With this brief introduction in mind, let's take a look at the EBNF ruleset for the language implemented in this project.
- Prog ::= PROGRAM IDENT ; DeclPart CompoundStmt
- DeclPart ::= VAR DeclStmt { ; DeclStmt }
- DeclStmt ::= IDENT {, IDENT } : Type [:= Expr]
- Type ::= INTEGER | REAL | BOOLEAN | STRING
- Stmt ::= SimpleStmt | StructuredStmt
- SimpleStmt ::= AssignStmt | WriteLnStmt | WriteStmt
- StructuredStmt ::= IfStmt | CompoundStmt
- CompoundStmt ::= BEGIN Stmt {; Stmt } END
- WriteLnStmt ::= WRITELN (ExprList)
- WriteStmt ::= WRITE (ExprList)
- IfStmt ::= IF Expr THEN Stmt [ ELSE Stmt ]
- AssignStmt ::= Var := Expr
- Var ::= IDENT
- ExprList ::= Expr { , Expr }
- Expr ::= LogOrExpr ::= LogAndExpr { OR LogAndExpr }
- LogAndExpr ::= RelExpr {AND RelExpr }
- RelExpr ::= SimpleExpr [ ( = | < | > ) SimpleExpr ]
- SimpleExpr :: Term { ( + | - ) Term }
- Term ::= SFactor { ( * | / | DIV | MOD ) SFactor }
- SFactor ::= [( - | + | NOT )] Factor
- Factor ::= IDENT | ICONST | RCONST | SCONST | BCONST | (Expr)
With this description of our langauge in mind, let's explore the project's structure and function.
This project is made up of 2 main components that work in tandem to achieve the goal of running a file that the user gives to us in our coding language. Additionally, there are several, less important components. Those will be covered towards throughout the explaination, but have not been given their own individual sections
The 2 main components are: the tokenizer/lexical analyzer, in lex.cpp and the Parser/Interpreter, in parserInterp.cpp. Let's explore each program individually.
Tokenizing/Lexical Analysis in lex.cpp
Tokenizing/Lexical analysis refers to effectively translating a program file that a user has written into a a stream of tokens that we can use to run the program. lex.cpp is responsible for just this task.
This is accomplished through the main function of lex.cpp:
Lexitem getNextToken(istream& in, int& linenum)
This function takes a reference to an istream object and a reference to an integer as the line number, and acts as a state machine to go through the characters that its currently at. For example, if the program observes the next character to be a letter of some kind, it will automatically enter the INID(inside of identifier) state, and process the following characters as part of an identifier. It does this for integers, real number, strings, booleans and constants. If at any point the lexical analyzer runs into a lexeme(word) that is not a recognized part of the language, it will return the ERR token. It is important to note that this program only tokenizes and analyzes the lexemes of the language, it does not check for syntax, or logical correctness. That is all handled by the other program.
There is a special function that is used when dealing with keywords/reserved words in our language. This function is:
Lexitem id_or_kw(String& lexeme, int linenum)
This function is called by getNextToken because there is a need to differentiate between identifiers(variable names) and reserved words. For example, if the program observes the lexeme "writeln", it has to have a way of determining if writeln is a valid variable name, or if it is a reserved word in our language. It is for this reason that, when in the state INID, the getNextToken function calls id_or_kw(), which then compares the lexeme with a map of all of our reserved words, to determine whether it is an identifier or keyword.
Once a token is returned by lex.cpp, that token is processed by parserInterp.cpp. We will discuss parserInterp.cpp soon, but before that, provided below is a comprehensive list of all of the tokens that can be generated by the tokenizer.
Identifiers and constants follow certain rules, and will only be tokenized if those rules are obeyed. These rules and some examples are shown in the table below.
Token | Description | Regular Expression Definition | Valid Examples | Invalid Examples |
---|---|---|---|---|
IDENT | An identifier is a user-defined program variable | IDENT := Letter {( Letter | Digit | _ | $ )} Letter := [ a-z A-Z ] Digit := [0-9] |
hello$, myVar, first_name | $hello, 1st_name, _name |
SCONST | A string constant is defined as a sequence of characters enclosed in single quotes | SCONST := 'Any Character string' | 'Hello $5 #., s9 my name is', 'hello' | "This is an invalid string due to the double quotes" |
ICONST | An integer constant is any valid integer | ICONST := [0-9]+ | 2, 200, 3444 | -68, 9.56 |
RCONST | A real constant is any valid real number | RCONST := ([0-9]+).([0-9]*) | 9.01, 0.2, 2. | .2, 2.3.4 |
BCONST | A boolean constant is either true or false | BCONST := (true | false) | true, false | tRuE, FALSE |
Furthermore, our language has certain reserved words. A reserved word is a word that may not be used as a variable name by the programmer, as the word has some syntactic function in the language. For the sake of brevity, the tokens that these reserved words are tokenized into are also included here.
Reserved Word | Token |
---|---|
and | AND |
begin | BEGIN |
boolean | BOOLEAN |
div | DIV |
end | END |
else | ELSE |
false | FALSE |
if | IF |
integer | INTEGER |
mod | MOD |
not | NOT |
or | OR |
program | PROGRAM |
real | REAL |
string | STRING |
write | WRITE |
writeln | WRITELN |
var | VAR |
true | TRUE |
false | FALSE |
Our language also has various arithmetic/logical operators that can be used. The table below shows the operands and their functions. Just like with the reserved words, the tokens for each operator/keyword are included in this table.
Operator Symbol/Keyword | Token | Description |
---|---|---|
+ | PLUS | Arithmetic addition or concatenation |
- | MINUS | Arithmetic subtraction |
* | MULT | Multiplication |
/ | DIV | Division |
:= | ASSOP | Assignment operator |
= | EQ | Equality |
< | LTHAN | Less than operator |
> | GTHAN | Greater than operator |
and | AND | Logical conjunction |
or | OR | Logical disjunction |
not | NOT | Logical complement |
div | IDIV | Integer division |
mod | MOD | Modulos operator |
Additionally, there are several delimiters that the language makes syntactic use of. These are shown in the table below.
Delimiter | Token |
---|---|
, | COMMA |
; | SEMICOL |
: | COLON |
. | DOT |
( | LPAREN |
) | RPAREN |
Finally, there are a few miscellaneous tokens/rules that you should be aware of
- Anything enclosed in {} will be ignored by the tokenizer as a comment
- Any word that is not a lexeme in the language will be returned with an ERR token
- The DONE token is returned when the tokenizer reaches the end of file(eof) character
With that, we have covered everything in lex.cpp. As stated before, once a token is generated by the lexical analyzer, it is passed on to the parser/interpreter in parserIntep.cpp, which is what we will explore next.
Syntax Analysis/Interpretation in parserInterp.cpp
parserInterp.cpp is responsible for performing syntactic analysis and execution for our program. Syntax analysis refers to the validity of a statement in the frame of our grammar rules, as well as certain other cases where invalid syntax/statements may be found. To do this, we should first see a full description of the language.
- There are four basic types: INTEGER, REAL, STRING and BOOLEAN
- There are operator precedence and associativty rules(see table below)
- An If-statement evaluates a logical expression and executes if it is true, and does not if it isn't. An Else-clause is optional for an If-Statement, and is only evaluate if the logical expression is false
- It is an error to use a variable before it has been defined
- WRITELN and WRITE statements print out their comma-separated insides from left-to-right
- The ASSOP( := ) operator in the AssignStmt assigns a value to a variable. It evaluates the Expr on the right-hand side and saves its value in a memory location associated with the left-hand side variable (Var). A left-hand side variable of a Numeric type must be assigned a numeric value. Type conversion must be automatically applied if the right-hand side numeric value of the evaluated expression does not match the numeric type of the left-hand side variable. While a left-hand side variable of string or Boolean type must be assigned value of the same type as of the left-hand side variable.
- Binary operations for numeric operators may only be performed on numeric operands
- Binary operations for boolean operators may only be performed on boolean operands
- Relational operators(=, <, >) operate only on two compatible types
- Unary sign operators(+/-) operate only on numeric types, whereas the unary NOT operator operates only on booleans
Operator Precedence and Associativity
Note
Items with a lower precedence level get executed first
Precedence Level | Operator | Description | Associativity |
---|---|---|---|
1 | Unary +, - , not | Operates on a single numeric or boolean operand | Right-to-left |
2 | *, / , div, mod | Operate on two numeric operands | Left-to-right |
3 | +, - | Arithmetic addition and subctraction | Left-to-right |
4 | <, >, = | Relational comparison | No cascading |
5 | and | Logical conjunction | Left-to-right |
6 | or | Logical disjunction | Left-to-right |
parserInterp.cpp is also designed to check for valid syntax accordig to these rules, and to obey the operator precedence. It achieces the latter through its design as a recursive-descent parse tree.
The program parserInterp.cpp is designed as an implicit recursive-descent parse tree. That is, direct recursion is not a part of its overall design. It achieves recursion through its design, and how every function calls the function below it on the parse-tree. There is no explicit parse-tree either, but instead an implicit one that is created based on how methods call eachother.
Recall our EBNF Ruleset from above:
- Prog ::= PROGRAM IDENT ; DeclPart CompoundStmt
- DeclPart ::= VAR DeclStmt { ; DeclStmt }
- DeclStmt ::= IDENT {, IDENT } : Type [:= Expr]
- Type ::= INTEGER | REAL | BOOLEAN | STRING
- Stmt ::= SimpleStmt | StructuredStmt
- SimpleStmt ::= AssignStmt | WriteLnStmt | WriteStmt
- StructuredStmt ::= IfStmt | CompoundStmt
- CompoundStmt ::= BEGIN Stmt {; Stmt } END
- WriteLnStmt ::= WRITELN (ExprList)
- WriteStmt ::= WRITE (ExprList)
- IfStmt ::= IF Expr THEN Stmt [ ELSE Stmt ]
- AssignStmt ::= Var := Expr
- Var ::= IDENT
- ExprList ::= Expr { , Expr }
- Expr ::= LogOrExpr ::= LogAndExpr { OR LogAndExpr }
- LogAndExpr ::= RelExpr {AND RelExpr }
- RelExpr ::= SimpleExpr [ ( = | < | > ) SimpleExpr ]
- SimpleExpr :: Term { ( + | - ) Term }
- Term ::= SFactor { ( * | / | DIV | MOD ) SFactor }
- SFactor ::= [( - | + | NOT )] Factor
- Factor ::= IDENT | ICONST | RCONST | SCONST | BCONST | (Expr)
Every bolded word has its own method defined in parserInterp.cpp, as shown in this function signatures from the header file parserInterp.h
extern bool Prog(istream& in, int& line);
extern bool DeclPart(istream& in, int& line);
extern bool DeclStmt(istream& in, int& line);
extern bool Stmt(istream& in, int& line);
extern bool StructuredStmt(istream& in, int& line);
extern bool CompoundStmt(istream& in, int& line);
extern bool SimpleStmt(istream& in, int& line);
extern bool WriteLnStmt(istream& in, int& line);
extern bool WriteStmt(istream& in, int& line);
extern bool IfStmt(istream& in, int& line);
extern bool AssignStmt(istream& in, int& line);
extern bool Var(istream& in, int& line, LexItem & idtok);
extern bool ExprList(istream& in, int& line);
extern bool Expr(istream& in, int& line, Value & retVal);
extern bool LogANDExpr(istream& in, int& line, Value & retVal);
extern bool RelExpr(istream& in, int& line, Value & retVal);
extern bool SimpleExpr(istream& in, int& line, Value & retVal);
extern bool Term(istream& in, int& line, Value & retVal);
extern bool SFactor(istream& in, int& line, Value & retVal);
extern bool Factor(istream& in, int& line, Value & retVal, int sign);
Through this, we achieve a recursive-descent parse tree.
Here is an example to make things clear: Stmt calls StructuredStmt which then calls CompoundStmt which then calls Stmt
As you can see here, statement does not directly call itself, but it triggers a series of other function calls that lead to it being called again. Through indirect left recursion, we generate a parse tree that analyzes the syntax of each respective component of the language, and evaluates it accordingly.
Speaking of evaluation and values, there is a third program that we have not yet discussed, being val.cpp. val.cpp and the header file val.h contain the definition of the Value class, which is the class that is used to store all of the returned values in our program.
The definition of the Value class is as follows
enum ValType { VINT, VREAL, VSTRING, VBOOL, VERR };
class Value {
ValType T;
bool Btemp;
int Itemp;
double Rtemp;
string Stemp;
Additionally, val.cpp contains overloaded operators so that we can do operations between two objects of the Value class. Their signatures are as follows:
// numeric overloaded add this to op
Value operator+(const Value& op) const;
// numeric overloaded subtract op from this
Value operator-(const Value& op) const;
// numeric overloaded multiply this by op
Value operator*(const Value& op) const;
// numeric overloaded divide this by oper
Value operator/(const Value& op) const;
// numeric overloaded modulus of this by oper
Value operator%(const Value& oper) const;
//numeric integer division this by oper
Value div(const Value& oper) const;
//overloaded equality operator of this with op
Value operator==(const Value& op) const;
//overloaded greater than operator of this with op
Value operator>(const Value& op) const;
//overloaded less than operator of this with op
Value operator<(const Value& op) const;
//integer divide operator of this by op
Value idiv(const Value& op) const;
//Logic operations
Value operator&&(const Value& oper) const;//Overloaded Anding operator
Value operator||(const Value& oper) const;//Overloaded Oring operator
Value operator!() const;//Overloaded Not/complement operator
Every time we execute an expression, and need to save the result, we wrap the result in an instance of the Value class, so that we are able to store the type of the expression's result and the actual value all in one convenient class. This also helps for type checking.
On the topic of storage, there are several map containers that are important for parserInterp.cpp's function:
// defVar keeps track of all variables that have been defined in the program thus far
map<string, bool> defVar;
// SymTable keeps track of the type for all of our variables
map<string, Token> SymTable;
//Container of temporary locations of Value objects for results of expressions, variables values and constants
//Key is a variable name, value is its Value. Holds all variables defined by assign statements
map<string, Value> TempsResults;
//declare a pointer variable to a queue of Value objects
queue <Value> * ValQue;
The comments are pretty detailed, but one thing worth mentioning is that ValQue is only used for write and writeln methods, as it is a queue of all the values to be printed to the console.
All of this comes together for the full functionality of our interpeter. The entry point to the program and the top of our parse tree is the prog method. The driver program, prog3.cpp, only needs to call the prog method and pass in a reference to an istream object to run the entire program contained in the file pointed to by the in pointer. Once prog is called, it begins the execution of the parse tree. Every method in the parse tree calls the GetNextToken method in lex.cpp, checking for syntactic and lexical errors along the way, until the entire program file has been read through and executed, or until a syntactic or lexical error is reached, in which case the program exits.
Now that you have a theoretical understanding of how these programs work, I would greatly encourage anyone interested to download the source code and try writing/running your own program in this given programming language. There are various examples of programs given in the tests folder here on Github. Happy coding!