Due by 11:59pm on Monday, 2/19/2018
Demo due by 11:59pm on Monday, 3/5/2018
In this assignment we'll work on a parser for a small subset of the language Python. In particular, we'll use the widely-used parser generator Bison to implement a parser that, when combined with the scanner we wrote in assignment 1, will perform syntax-directed translation from Python to C/C++.
There are a few major parts to this assignment, described below. To get you started, you are provided with a Flex scanner specification in scanner.l
that solves the problem defined in assignment 1. There is also a makefile that specifies compilation for the scanner. A simple main()
function for the scanner is written in main.cpp
.
Flex and Bison are designed to easily integrate with each other, but some modifications to the scanner specification will still be needed to make the scanner and parser work together. These modifications will be easiest to do in stages:
-
Set up a basic Bison parser definition (say in
parser.y
) with no nonterminals. The main thing you'll need to do is write%token
directives to specify all of the terminals you'll use in your grammar. These terminals will correspond directly to the syntactic categories we recognized with the scanner in assignment 1, e.g.IDENTIFIER
,FLOAT
,WHILE
,PLUS
, etc. To write these%token
directives, you'll need to figure out what data type(s) you'll use to represent the different program constructs in the representation output by the parser. Remember, our end goal with this project is to output C/C++ code corresponding to the Python code being parsed. -
Once you have your parser definition started, add a compilation step for it in the makefile. The goal at this point is to generate a header file containing integer values for all of the nonterminals/syntactic categories, so you can include that header file in the scanner and start returning these integer values instead of just printing out syntactic categories. To generate this header file, add the
-d
option to yourbison
command, e.g.:bison -d -o parser.cpp parser.y
This will generate two files,
parser.cpp
andparser.hpp
, the later of which is the header file to include in the scanner definition. -
Now, make the scanner return syntactic categories instead of printing them. For Python, this sounds easier than it actually is. In particular, under the default setup, a Bison-generated parser exists as a function
yyparse()
. This function repeatedly calls the scanning functionyylex()
that's generated by our Flex specification, and, on each call, it expectsyylex()
to return the integer code for the syntactic category of the next word in the source program.Thus, you might be tempted to simply replace all of the
cout
statements in the scanner that print syntactic categories intoreturn
statements that just return those syntactic categories instead. This would work for all but a small few cases. In particular, there are a few situations where a single call to the scanner could generate multiple tokens. Specifically, when a program is dedented by multiple levels at once, we need our scanner to be able to return multipleDEDENT
tokens from a single Flex rule. This cannot be done with a simple return statement. There are at least two ways to solve this problem:-
Use a queue to store tokens to return. Every time a token is generated in the scanner, place it into a queue instead of returning it. Then, at the beginning of each call to
yylex()
, first check the queue to see if there are any tokens waiting to be returned. If there are, simply dequeue the first token and return it. Note that under this approach, you may need to do some extra work to be able to return the lexeme along with each syntactic category that's returned, since this is needed for some syntactic categories likeIDENTIFIER
. One possibility would be to store lexeme/syntactic category pairs in your queue. -
Implement a push parser. The default model implemented by a Bison-generated parser is to "pull" tokens from the scanner by calling
yylex()
each time a new token is needed. A push parser reverses these roles so thatyylex()
is called only once and now "pushes" a token to the parser each time a new token becomes available. It does this by calling the functionyypush_parse()
with the new token passed as an argument. Under the push-parsing paradigm, it doesn't matter if the scanner generates multiple tokens at a time, since each one can be pushed to the parser in turn. You can read more about how push parsers work in Bison here:https://www.gnu.org/software/bison/manual/bison.html#Push-Decl
-
Once your scanner is able to generate one token at a time, either via a token queue or via push parsing, you are ready to write some grammar rules to recognize constructs in the Python language. At this point, you don't need to worry about attaching actions to these rules. You can just get your grammar in place.
The grammar you write will need to recognize a simplified subset of Python. In particular, your grammar should be able to recognize a program comprised of the following kinds of statements:
-
Assignment statements. These are statements where the value of an expression is assigned to a specific variable, e.g.:
circumference = pi * 2 * r
In the subset of Python we'll implement, every assignment statement will be constrained to a single line of code, and it will be terminated by a newline. The expression on the right-hand side of the assignment can be any valid expression involving identifiers, floats, integers, or booleans and the following operators:
+
,-
,*
,/
,==
,!=
,>
,>=
,<
,<=
,not
. Expressions may also contain parentheses()
. -
If-elif-else statements. In Python these look like the following:
if a: x = 2 * y elif b <= 7: x = 3 * y else: x = 4 * y
Of course, the
elif
andelse
parts are both optional. The statement could also include moreelif
blocks. Importantly, all of the statements to be executed for each of theif
,elif
, andelse
conditions are indented to the same level. In other words, each block is contained within a matchingINDENT
/DEDENT
pair. Also, for this assignment, every one of these blocks will be preceded by a newline. In other words, another statement cannot be included on the same line as theif
, theelif
, or theelse
. For this assignment the conditions forif
andelif
statements can be any valid expression or any boolean combination of expressions using theand
andor
operators. -
While statements. These are similar to
if
statements, e.g.:while i < 10: i = i + 1
Again, the block of statements to be executed in each iteration of the while loop will be contained within a matching
INDENT
/DEDENT
pair and will be separated from thewhile
statement with a newline. Again, the termination conditions forwhile
statements can be any valid expression or any boolean combination of expressions using theand
andor
operators. -
Break statements. These simply consist of the keyword
break
followed by a newline, i.e.:break
For this assignment, some things you specifically do not need to worry about are:
- For loops.
- Function definitions and function calls.
- Arrays and dictionaries.
Syntax-directed translation is essentially compilation by the parser. In other words, in a syntax-directed translation scheme, the parser directly outputs the target program. Our goal for this assignment is to perform syntax-directed translation from Python into C or C++.
Once you have your grammar defined, you can begin to attach actions to your rules to perform the syntax directed translation. The easiest approach here will be to use the information you gain from the rules of your grammar about constructs recognized in the source program to generate corresponding C/C++ language strings for those constructs. In this way, at the end of the parse, your grammar's goal symbol will refer to a string containing the entire translated target program.
A few things to consider while you're performing the syntax-directed translation:
-
It will probably be easiest if you don't worry about adding things like
#include <iostream>
or wrapping your target program within amain()
function until the parse is complete. If your parse simply translates a sequence of Python statements into a corresponding sequence of C/C++ statements, you can wrap this translated sequence in amain()
function at the very end. -
In order to generate a working C/C++ program, you'll need a variable declaration for each variable used in the program. To do this, you can maintain a simple symbol table, where each variable identifier is stored when it's first encountered. When your parse is finished, you can simply iterate through the identifiers stored in the symbol table and, for each identifier, generate a variable declaration at the top of your
main()
function.There are a couple simplifying assumptions you can make for the purposes of this assignment to make this a little easier:
-
Every variable will appear as the left-hand side of an assignment statement before it is used anywhere else.
-
All variables can be scoped to the
main()
function. You don't need to worry about scoping variables within blocks (e.g. inside of anif
block). -
All variables can have the same type, e.g.
double
orfloat
.
-
-
So you can tell what's happening with your translated code, you should also generate one
printf()
/cout
statement at the end of yourmain()
function for each variable to print the value of that variable at the end of the execution of the translated program. -
If the source program contains one or more syntax errors, you should not output a target program. Instead, you should report at least the first encountered syntax error.
-
Don't worry about indentation in your generated target program.
Once you get your translation fully working, you should be able to use gcc
/g++
to compile and run the generated target program, provided the source program contains no syntax errors.
You should be able to type make
to generate an executable parser from your scanner and parser specifications.
There are some simple Python programs you may use for testing your parser included in the testing_code/
directory. Some of these programs (i.e. p*.py
) are syntactically valid, and your parser should be able to translate them successfully. There are example translations for these programs included in the example_output/
directory. Some of the programs in testing_code/
(i.e. error*.py
) contain various syntax errors. Your parser should fail to translate these programs.
We'll be using GitHub Classroom for this assignment, and you will submit your assignment via GitHub. Make sure your completed files are committed and pushed by the assignment's deadline to the master branch of the GitHub repo that was created for you by GitHub Classroom. A good way to check whether your files are safely submitted is to look at the master branch your assignment repo on the github.com website (i.e. https://github.com/OSU-CS480-W18/assignment-2-YourGitHubUsername/). If your changes show up there, you can consider your files submitted.
The TAs will grade your assignment by compiling and running it on one of the ENGR servers, e.g. flip.engr.oregonstate.edu
, so you should make sure your code works as expected there. bison
and flex
are installed on the ENGR servers. If your code does not compile and run on the ENGR servers, the TAs will deduct at least 25 points from your score.
This assignment is worth 100 points total, broken down as follows:
- 30 points: scanner is modified to correctly return tokens to the parser
- 30 points: grammar rules are correctly set up for the subset of Python described above
- 35 points: parser successfully performs syntax-directed translation, as described above
- 5 points: makefile is specified to fully generate an executable parser