Due date and time: Thursday, March 4, 11:59pm
Introduction
As we discussed at the outset of this course -- and as you had no doubt heard or read before -- there are two major types of language processors: compilers and interpreters. Compilers translate a program in one language into an equivalent program in another, often lower-level, language. Interpreters execute a program and generate its output, in a sense translating the program from the source to the target language as it executes. Though their overall aims are different, they share a great deal of functionality. Both compilers and interpreters must read and understand the structure and meaning of the source program. Both must be able to determine whether the program is syntactically or semantically erroneous, and must report errors in some usable fashion. (It should be noted that an interpreter may do this work before executing the program, or may do some or all of this work piecemeal as the program is executing.)
For a language like Monkie2004, in which the majority of semantic rules can be checked statically (i.e. before the program executes), either kind of language processor is best off starting with the phases that we've built in the first three assignments: scanning the input program and breaking it into tokens, parsing these tokens and determining the syntactic structure of the program, building an intermediate representation of the program (such as an abstract syntax tree), and performing static semantic checking on it. When these phases have been completed, it is clear that the program is syntactically and semantically correct -- at least with respect to semantic rules that can be checked statically -- and there is a convenient tree-based representation of the program ready to be operated upon.
From this point, however, the way we should proceed is radically different, depending on whether we intend to build a compiler or an interpreter. If we're building a compiler, we may proceed by writing a module that takes the abstract syntax tree and generates a "flat" intermediate representation for it, one that is closer to the machine code that we will eventually want to emit. We'll then want our compiler to perform various optimizations on it to reduce, if possible, the amount of time and/or memory that it will consume when executed. Next, we'll need to map this intermediate code to machine code, which will require us to select the appropriate sequence of machine instructions for each intermediate code instruction, as well as deciding on how we'll use registers and/or cache memory to reduce the number of accesses to main memory (or, worse yet, virtual memory stored on disk!). Finally, our compiler, using all of the knowledge about the source program that it has gained during these phases, will emit target code.
On the other hand, an interpreter is not concerned with rewriting the input program; it's concerned with executing it and determining its output. Given an intermediate representation of the program, such as an abstract syntax tree, an interpreter operates by traversing it and evaluating the meaning of its nodes on the fly. For example, for some expression tree rooted with an addition operation, an interpreter can evaluate it by first evaluating the left subtree and determining its value, then evaluating the right subtree and determining its value, and finally adding these two values together to yield the final result of the addition. Since the interpreter is viewing the program at, essentially, the source level, it needs to maintain a symbol table, declaring variables into it as their declarations are reached, then removing declarations when the variables fall out of scope. Some mechanism needs to be included to support calls to subprograms, which may either involve the creation and maintenance of an explicit run-time stack (complete with activation records), or a simpler approach built on subprogram calls in the interpreter's source language.
So, what we should do next depends on what kind of language processor we intend to build. This assignment will ask you to build upon the work you did in the previous assignment, extending your program to be a complete Monkie2004 interpreter. When you're done, you'll be able to execute Monkie2004 programs and view their output. Future assignments will explore some aspects of the remaining tasks performed by a compiler (though it should be pointed out here that we will not be building a complete Monkie2004 compiler this quarter).
Changes to the Monkie2004 language for this assignment
No changes have been made to the syntax or static semantic rules of the language; they remain as they were in the previous assignment.
There is one change to the apparent intent of the language, though it involves a rule that has never formally been specified: the meaning of the ref keyword. It was originally intended that ref would be used to signify that a formal parameter was to be passed using pass-by-reference semantics. For this assignment, the ref keyword may still appear in a parameter list as before, but, as a simplification, it will not have any meaning. All parameters will be passed by value. (Optionally, you may implement pass-by-reference semantics for ref if you wish, but it is not required, and I won't be offering any extra credit for it. If you want some ideas about how to implement it, feel free to contact me.)
The dynamic semantic rules of Monkie2004
The following is a list of the dynamic semantic rules for Monkie2004. It is considered a supplement to the static semantic rules presented in the previous assignment, and only applies to programs that have no lexical, syntactic, or static semantic errors. The rules below cover aspects of Monkie2004 programs that only have meaning at run-time. Most of the rules describe the behavior of legal Monkie2004 programs. In a few cases, dynamic semantic errors are described. When a Monkie2004 program encounters a dynamic semantic error, it prints an error message to the output and terminates immediately.
Execution of a Monkie2004 program begins with all global declarations being made. All global variables are assigned their default initial values, as described in the rules below. After all global declarations have been made, a procedure with the following signature is called:
procedure program()
If no such procedure exists, it is a dynamic semantic error and the program terminates immediately. If it does exist, the execution of the program lasts until program( ) returns.
The dynamic semantic rules of Monkie2004 are:
- There are three types of variables in Monkie2004 programs: integer, boolean, and string.
- Integer variables may contain signed 32-bit integral values in the range -231 to 231 - 1 inclusive. (Not coincidentally, this mirrors the allowable range of Java int values.)
- Addition, subtraction, multiplication, division, and integer negation work as you would expect, yielding the result of performing that mathematical operation on the appropriate number of integer operands. Overflow and underflow causes the value to "wrap around" as it does in Java, e.g. (231 - 1) + 1 = -231. This is not considered a dynamic semantic error in Monkie2004, which means that you need not write any special code to handle or avoid this case.
- The relational operators ==, /=, <, <=, >, and >= work as you would expect, as well. If given two integer operands, the values of those operands are compared using the given relational operator, yielding a boolean result of true if the comparison is true and false otherwise.
- A division operation in which the right operand is zero is a dynamic semantic error in Monkie2004.
- Boolean variables may contain one of two possible values: true and false. A number of operators take boolean operands, so it is necessary to define the effect of each:
- The not operator takes one boolean operand and yields the negation of the value of that operand, i.e. not false yields true and not true yields false.
- The and operator takes two boolean operands and yields true if and only if both operands have the value true. and is not short-circuited, meaning that both operands are always evaluated before yielding a result.
- The and then operator is similar to the and operator, except that it is short-circuited. Its left operand is evaluated first. If it has the value false, the and then operation yields false. If not, the and then operation yields the value of the right operand.
- The or operator takes two boolean operands and yields false if and only if both operands have the value false. or is not short-circuited.
- The or else operator is similar to the or operator, except that it is short-circuited. Its left operand is evaluated first. If it has the value true, the or else operation yields true. If not, the or else operation yields the value of the right operand.
- The xor operator takes two boolean operands and yields true if and only if the two operands evaluate to different values. xor is not short-circuited.
- The implies operator takes two boolean operands. It is a short-circuited operation. Its left operand is evaluated first. If it has the value false, the implies operation yields true. If not, the implies operation yields the value of the right operand.
- The relational operators == and /= may take two boolean operands. In this case, they behave in the way you would expect: they compare the values of the operands, yielding true if the comparison is true and false if not.
- String variables contain a sequence of zero or more characters. Each character is a member of the Unicode character set, represented by a 16-bit character code. Not surprisingly, this is the same representation that Java uses to store its strings. A few operators take string operands:
- The & operator takes two string operands and yields the concatenation of the value of the right operand to the value of the left. For example, "Monkie" & "2004" yields "Monkie2004".
- The relational operators ==, /=, <, <=, >, and >= may take two string operands, in which case they perform a lexicographical comparison of the values of the operands. Each such operation yields true if the comparison is true and false otherwise.
- A lexicographical comparison of strings in Monkie2004 behaves in the same fashion as a similar comparison of strings in Java using the compareTo( ) method. For more information, see this link.
- Variables are implicitly initialized to a default value at the time of their declaration.
- Integers are implicitly initialized to zero.
- Strings are implicitly initialized to the empty string (i.e. "").
- Booleans are implicitly initialized to false.
- This rule applies to global variables, local variables, and the implicitly-declared Result variable in all functions.
- Procedure and function calls behave similarly to many other programming languages that you may have learned previously. The actual parameters are evaluated in left-to-right order. The resulting values are matched up positionally with the corresponding formal parameters and are used to initialize them (i.e. the first actual parameter's value is used to initialize the first formal parameter, and so on).
- The implicitly-declared Result variable in a function represents the function's return value. The return value of a function is the last value assigned to Result while the function's body executed. (If Result is never assigned, the return value is the default value assigned to Result at the function's outset.)
- Block statements are executed by executing each statement within them in order.
- If statements are executed in the following way:
- First, the conditional expression is evaluated, yielding a boolean value.
- If its value is true, the block statement that follows the then keyword is executed. If its value is false, the block statement that follows the else keyword (if any) is executed.
- While loops are executed in the following way:
- First, the conditional expression is evaluated, yielding a boolean value.
- If its value is false, the while loop is exited and control moves on to the statement that follows the while loop.
- If its value is true, the block statement that follows the do keyword is executed. After the block statement has finished executing, control jumps back to the top of the while loop and the condition is tested again.
- Assignment statements are executed in the following way:
- The expression on the right hand side of the assignment is evaluated, yielding a value.
- That value is placed into the variable named on the left hand side of the assignment.
- The seven predefined subprograms, which are used for console input and output, behave in the following way:
- function read_string( ): string. The input cursor appears and the user may type input and hit the Enter key. All of the input that they typed prior to hitting the Enter key is combined into a string value, which is returned from the function.
- procedure print_string(s: string). The characters contained in the string s are printed to the console. The output cursor remains on the same line afterward.
- function read_integer( ): integer. The input cursor appears and the user may type input and hit the Enter key.
- If the input typed by the user before pressing Enter is a sequence of digits (optionally preceded by a minus sign) which, when converted to the corresponding integer value, lies within the legal range of integer values, this function returns the corresponding integer value.
- If the input typed by the user is anything else (including input that begins with a legal integer, but additionally contains any other characters), it is a dynamic semantic error.
- procedure print_integer(i: integer). A string representation of i's value is printed to the console. Positive integers are printed in decimal form (e.g. 17). Negative integers have their magnitudes printed in decimal form, preceded by a minus sign (e.g. -17). The output cursor remains on the same line afterward.
- function read_boolean( ): boolean. The input cursor appears and the user may type input and hit the Enter key.
- If the input typed by the user before pressing Enter was exactly the string "true", this function returns the boolean value true.
- If the input typed by the user before pressing Enter was exactly the string "false", this function returns the boolean value false.
- If the input typed by the user is anything else (including input that begins with either "true" or "false" but additionally contains any other characters), it is a dynamic semantic error.
- procedure print_boolean(b: boolean). If b's value is true, the string "true" is printed to the console, otherwise the string "false" is printed to the console. The output cursor remains on the same line afterward.
- procedure print_endline( ). The output cursor is moved to the beginning of the next line.
Implementing your interpreter
Assuming that you completed a solution to at least Part 1 of the previous assignment, you have a completed CUP script specifying a parser that builds an AST for the input program. To support this, you also have a set of Java classes that implement the various kinds of AST nodes. Each AST node, at present, contains an analyze( ) method, which is used to perform static semantic checking on it (and its children, as appropriate).
Given an AST and a symbol table, an interpreter is relatively straightforward to implement. (Don't get me wrong; there are plenty of devilish details. But conceptually, it's not difficult to explain.) Much of what I've suggested here is provided as example code in the starting point.
- Each variable stored in the SymbolTable will need to be accompanied by its value. Since different variables will have different values, we'll unfortunately need to store an Object reference with each variable in the SymbolTable. In the case of integer variables, we'll store an Integer object. For boolean variables, we'll store a Boolean object. For strings, we'll store String objects.
- When initially declaring a variable, it will be given its default value (0 for integers, false for booleans, the empty string for strings). To easily support the determination of initial values, I added a getInitialValue( ) method to the Type class, which returns an Object that is the initial value for variables of that type.
- Each Definition node (global variable declarations and subprogram declarations) needs a declare(SymbolTable st) method added to it. The declare( ) method adds a declaration for either the global variable or the subprogram to the symbol table. If you've already got code in your analyze( ) method that declares the symbols into the symbol table, you can reuse it.
- Each Statement node needs an execute(SymbolTable st) method added to it. Depending on what kind of statement it is, the body of this method will be different, but the basic idea is that it will affect whatever changes are implied by the statement. For example, an assignment statement will cause a variable's value to be changed.
- Each Expression node needs an evaluate(SymbolTable st) method added to it, which returns an Object that represents the expression's value. Depending on what kind of expression it is, the body of this method will be different, but the basic idea is that each evaluate( ) method will evaluate the appropriate kind of expression, returning the appropriate value. For example, the evaluate( ) method in the ConcatenationExpression will first evaluate its left-hand operand (by calling evaluate( ) on it), then evaluate its right-hand operand, then return the concatenation of these two results.
- To support procedure and function calls, each subprogram that is declared in the symbol table needs to be accompanied by the AST node that represents the subprogram. For example, if you had a SubprogramDeclaration class (which extends Definition) in your solution to the previous assignment, you'd add a SubprogramDeclaration field to the Subprogram class, along with an accessor method to access it, and an additional parameter to the Subprogram constructor (and the declareSubprogram( ) method in the SymbolTable class) to pass it in. Also, you should add a call( ) method to your SubprogramDeclaration class, which takes two parameters: the SymbolTable and an ArrayList of the values of the actual parameters. The call( ) method should return an Object, which is the return value of the subprogram. (Procedures can return null, while functions can return Integer, Boolean, or String objects, as appropriate.)
- Since dynamic semantic errors cause the interpreter to immediately terminate the program, I created an unchecked exception class called InterpreterException. Every dynamic semantic error causes one of these to be thrown. They are caught in my Driver class.
- The predefined I/O subprograms can be implemented as classes that extend your SubprogramDeclaration class (or its equivalent). Its call( ) method, instead of traversing an AST, will simply consist of Java code. I've provided an example of this in the starting point, along with a utility class called ConsoleInput that you'll likely find useful.
One fact will greatly simplify your implementation: you may freely assume that, by the time the interpreter begins executing it, the input program is free of lexical, syntactic, or static semantic errors. So, for example, you may assume that both operands to a concatenation operation are strings, the type of the expression on the right-hand side of an assignment statement matches the type of the variable on the left-hand side, and so on. This means that, even if you have to do quite a bit of casting, you can at least assume that the casts will be proper.
Starting point
Officially, the starting point for this assignment is your solution to the previous assignment. We won't be testing your static semantic checker again for this assignment, meaning that we will only test your solution to this assignment using Monkie2004 programs that are syntactically correct and do not violate the static semantic rules. So, if you weren't able to get the previous assignment done, you will not be doubly penalized, unless your solution to the previous assignment reported errors for legal Monkie2004 programs. However, if you were unable to complete Part 1 of the previous assignment, you will need to get it finished before you can proceed with this one.
While I want you to use your own code as a starting point, I am providing some suggested approaches and example code from my interpreter, which you can use or ignore at your discretion. They are available as a Zip archive.
Be aware that the code I provided may not fit in perfectly with your design, so you may need to make some modifications to it. All of this code is provided as-is (much of it uncommented) to give you some ideas about how to proceed with your solution to this assignment. Since each of you will be starting with a somewhat different solution to the previous assignment, it was not really practical for me to provide code that would surely work with each of your previous designs. But I thought that the files that I've provided would help lead you in a good direction.
Deliverables
Place your completed CUP script and all of the .java files that comprise your program into a Zip archive, then submit that Zip archive. You need not include the .java files created by CUP (Parser.java and Tokens.java), but we won't penalize you if you do. However, you should be aware that we'll be regenerating these ourselves during the grading process, to be sure that they really did come from your CUP script. Please don't include other files, such as .class files, in your Zip archive. Also, don't include any of the example code from the starting point that you didn't end up using.
Follow this link for a discussion of how to submit your assignment. Remember that we do not accept paper submissions of your assignments, nor do we accept them via email under any circumstances.
In order to keep the grading process relatively simple, we require that you keep your program designed in such a way as it can be compiled and executed with the following set of commands:
cup monkie.cup
javac *.java
java Driver inputfile.m
Limitations
The limitations from the previous assignment still apply to this one; you may not make changes to the Monkie2004 grammar that was given to you in the previous assignment, except for the actions you wrote to build your abstract syntax tree (and any modifications you needed to make to them for this assignment) and adding names to the symbols on the right-hand sides of rules when you need to refer to their associated values. Other changes to the CUP script are not permitted.
- Originally written by Alex Thornton, Winter 2004.