ICS 142 Winter 2004
Assignment #6
Due date and time: Friday, March 19, 11:59pm
Introduction
Ultimately, the job of a compiler is to take a program in some source language and generate an equivalent program in some target language. Generally, that target language is an executable program for some platform, meaning that assembly code must be generated at some stage. Many of the abstractions provided by high-level programming languages -- procedures, scopes, arrays, structures, and support for a variety of built-in data types (including automatic conversions between them), to name a few -- do not exist at the assembly level. So the overall job of the back end of a compiler is to map higher-level abstractions into lower-level ones, choosing an assembly-level implementation for high-level language constructs. Naturally, some implementations of language constructs are better than others. But some are better in some situations and worse in others. Context plays a large role in selecting a good implementation. The complexities of generating good intermediate code are complicated by attempting to analyze this context on the fly. It makes good software engineering sense, then, to generate intermediate code that makes a "best guess" at a good implementation, then allow an optimizer to find better implementations whenever possible, based on more complex analysis of context.
The job of an optimizer is to take an intermediate code program and rewrite it to be a better program with the equivalent effect. ("Better," of course, can mean many things: faster, less memory usage, or less power consumption, for example.) Optimizers can employ many forms of analysis to improve a program, which are typically arranged into passes, where each pass uses one technique to attempt to improve the code. The net effect of all the passes, some of which may be repeated more than once, should be a significant improvement of the original program.
In this assignment, we'll explore a few issues that arise in the optimization of a substantial subset of the intermediate language ILOC that was discussed in lecture (and is discussed in the textbook). You'll write a program that takes a fragment of ILOC code, performs one or more optimization passes on it, and outputs the optimized fragment. The entire structure of the program will be provided, including a scanner/parser for ILOC, representations for ILOC instructions, and a module to pretty-print the output. Your job will only be to write three optimization modules. (The framework is extensible, so you're welcome to implement additional optimization modules if you'd like, though I won't be offering any extra credit for them.)
The subset of ILOC for this assignment
In this assignment, a substantial subset of ILOC (as presented in lecture and the textbook) is to be supported and optimized. The following ILOC instructions are to be supported in this assignment:
Opcode | Source Operands | Target Operands | Description |
add | reg1, reg2 | reg3 | Adds the value in reg1 and reg2, storing the result in reg3. |
addI | reg1, int2 | reg3 | Adds the value in reg1 to the constant integer int2, storing the result in reg3. |
sub | reg1, reg2 | reg3 | Subtracts the value in reg2 from reg1, storing the result in reg3. |
subI | reg1, int2 | reg3 | Subtracts the constant integer int2 from reg1, storing the result in reg3. |
rsubI | reg1, int2 | reg3 | Subtracts the value in reg1 from the constant integer int2, storing the result in reg3. |
mult | reg1, reg2 | reg3 | Multiplies the values in reg1 and reg2, storing the result in reg3. |
multI | reg1, int2 | reg3 | Multiplies the value in reg1 by the integer constant int2, storing the result in reg3. |
div | reg1, reg2 | reg3 | Divides the value in reg1 by the value in reg2, storing the result in reg3. If reg2's value is zero, it is assumed that a processor exception is raised. |
divI | reg1, int2 | reg3 | Divides the value in reg1 by the integer constant int2, storing the result in reg3. If int2 is zero, it is assumed that a processor exception is raised. |
rdivI | reg1, int2 | reg3 | Divides the value of the integer constant int2 by the value in reg1, storing the result in reg3. If the value in reg1 is zero, it is assumed that a processor exception is raised. |
lshift | reg1, reg2 | reg3 | Left-shifts the value in reg1 by reg2 bits, storing the result in reg3. |
lshiftI | reg1, int2 | reg3 | Left-shifts the value in reg1 by the number of bits specified by the integer constant int2, storing the result in reg3. |
rshift | reg1, reg2 | reg3 | Right-shifts the value in reg1 by reg2 bits, storing the result in reg3. |
rshiftI | reg1, int2 | reg3 | Right-shifts the value in reg1 by the number of bits specified by the integer constant int2, storing the result in reg3. |
and | reg1, reg2 | reg3 | AND's together the (presumably boolean) values stored in reg1 and reg2, storing the result in reg3. |
andI | reg1, bool2 | reg3 | AND's together the (presumably boolean) value stored in reg1 with the boolean constant bool2, storing the result in reg3. |
or | reg1, reg2 | reg3 | OR's together the (presumably boolean) values stored in reg1 and reg2, storing the result in reg3. |
orI | reg1, bool2 | reg3 | OR's together the (presumably boolean) value stored in reg1 with the boolean constant bool2, storing the result in reg3. |
xor | reg1, reg2 | reg3 | XOR's together the (presumably boolean) values stored in reg1 and reg2, storing the result in reg3. |
xorI | reg1, bool2 | reg3 | XOR's together the (presumably boolean) value stored in reg1 with the boolean constant bool2, storing the result in reg3. |
not | reg1 | reg2 | NOT's the (presumably boolean) value stored in reg1, storing the result in reg2. |
load | reg1 | reg2 | Loads the value stored in the memory address stored in reg1 into the register reg2. |
loadI | const1 | reg1 | Places the value of the constant const1 into reg1. const1 may be either an integer or a boolean constant. |
loadAI | reg1, int2 | reg3 | Loads the value stored in the memory address calculated by adding the integer constant int2 to the value stored in reg1. The loaded value is placed into reg3. |
loadAO | reg1, reg2 | reg3 | Loads the value stored in the memory address calculated by adding the values stored in reg1 and reg2. The loaded value is placed into reg3. |
store | reg1 | reg2 | Stores the value in reg1 into the memory address indicated in reg2. |
storeAI | reg1 | reg2, int3 | Stores the value in reg1 into the memory address calculated by adding the integer constant int3 to the value in reg2. |
storeAO | reg1 | reg2, reg3 | Stores the value in reg1 into the memory address calculated by adding the values in reg2 and reg3. |
i2i | reg1 | reg2 | Copies the value stored in reg1 into reg2. |
cmp_LT cmp_LE cmp_EQ cmp_NE cmp_GE cmp_GT |
reg1, reg2 | reg3 | Compares the values stored in reg1 and reg2, storing the boolean result of the comparison into reg3. Each of these instructions uses a different form of comparison: cmp_LT uses <, cmp_LE uses <=, and so on. |
cbr | reg1 | Label1, Label2 | If the (presumably boolean) value stored in reg1 is true, jump to Label1, otherwise jump to Label2. |
jumpI | none | Label1 | Jump to Label1. |
nop | none | none | Has no effect, but is sometimes necessary as a placeholder. Optimizations should not remove these; they are placed automatically when needed. |
ILOC code is to be written into an input file, subject to the following restrictions:
Here is an example input file. It should be pointed out that, for the sake of readability, I've spaced the input in the file somewhat, though whitespace is not considered relevant, except when it is necessary to separate tokens.
loadI 50 => r1; loadI 100 => r2; cmp_LT r1, r2 => r3; cbr r3 -> L1, L2; L1: add r4, r5 => r6; jumpI -> L3; L2: add r7, r8 => r6; jumpI -> L3; L3: cmp_LT r6, r7 => r8; cbr r8 -> L4, L1; L4: cmp_GT r6, r7 => r8; cbr r8 -> L5, L6; L5: loadI true => r9; jumpI -> L7; L6: loadI false => r9; jumpI -> L7; L7: nop;
The cbr and jumpI instructions use the symbol -> to separate source operands from target operands. The nop instruction has no operands. All other instructions use the symbol => to separate source operands from target operands.
Comments may be placed into input files; anything following two slashes (i.e. //) until the end of the line is considered to be a comment, much like in Java.
Basic blocks and control-flow graphs
Many optimization techniques involve some form of compile-time simulation of the run-time behavior of the program. In the presence of control-flow structures, such as if-then-else statements and while loops (which introduce conditional branches into the intermediate code), this simulation becomes increasingly more difficult to do precisely. To simplify matters, we divide the intermediate code program into straight-line chunks of code that must always be executed in sequence from beginning to end. Each such chunk is called a basic block.
Here's a brief example ILOC program:
loadI 1 => r1; loadI 10 => r2; loadI 0 => r3; cmp_GT r1, r2 => r4; cbr r4 -> L2, L1; L1: add r3, r1 => r3; addI r1, 1 => r1; cmp_LE r1, r2 => r4; cbr r4 -> L1, L2; L2: store r3 => r0;
This program uses a loop to calculate the sum of the integers 1..10 and store it in a memory address stored in the register r0. There are three basic blocks in this program:
From this example, we see that any jump instruction (cbr or jumpI) always constitutes the end of a basic block. A label must always constitute the beginning, since it is possible to jump to labels.
The longer the basic blocks in a program, the better. Basic blocks indicate sequences of code that are likely to pipeline well. It is also significantly easier to perform optimizations on a single basic block than to try to perform them on more than one block (due to the complexities introduced by control-flow).
Basic blocks are said to be connected together as a control-flow graph. In this context, we'll often refer to the blocks as nodes. In our example, the following edges exist between nodes in the control-flow graph:
Viewing basic blocks as a graph like this is handy, since it allows a variety of well-known graph algorithms to be useful for performing optimization. We'll use one such algorithm in Part 3.
Good news: I've provided the portion of your program that reads the input file, finds the basic blocks, and builds the control-flow graph.
Guiding assumptions about the profitability of optimizations
Different architectures have radically different performance characteristics. For this reason, optimizers for different architectures will need to make different decisions. After all, the point of optimization is not to make the code prettier; it's to make the code perform better. A compiler writer must pay careful attention to the optimizations she chooses to perform, to ensure that they really constitute an improvement in the program as it will execute on the target architecture.
For your optimizer, use the following assumptions about the performance of the underlying architecture to guide your choices. This list may or may not compare favorably with a list for an actual architecture; I've made these decisions to make the problem a more interesting one to solve.
Of course, it should be noted here that it is only possible to replace one instruction with another in the case that their effect is the same. For example, given this brief sequence of instructions:
loadI 100 => r1; load r2 => r3; addI r1, 50 => r4; addI r3, 50 => r5;
...since r1 is known to have the constant value 100, we can replace the first addI instruction with a loadI. The second addI, however, cannot be replaced, since its run-time value is based on a value in a memory location whose value is unknown to us at compile-time. So, a correctly-rewritten sequence might look like this:
loadI 100 => r1; load r2 => r3; loadI 150 => r4; // this is considered better than addI addI r3, 50 => r5; // this cannot be replaced, since we can't know r3's value at compile-time
Part 1: Local algebraic simplification (30 points)
Apply the following algebraic transformations on the code within each basic block. Handle each basic block separately. (The term "local", when used to describe an optimization technique, indicates an optimization that works only separately within each basic block. For this transformation, it actually makes little difference, since we can make substitutions without understanding any context.)
Part 2: Local constant propagation and folding (50 points)
Whenever it can be proven that a register must have a known constant value, that fact can then be used to simplify instructions that use the value of the register. For example, consider the following pair of instructions:
loadI 40 => r1; addI r1, 40 => r2;
Since it's clear that the value of r1 must be 40 after the first instruction executes, the second instruction is really the addition of 40 and 40. (Replacing a register with a known constant value is called constant propagation.) Since both operands are known to be constants, we might as well perform the addition at compile-time and replace the addI instruction with an immediate load. (Combining constants together at compile-time is known as constant folding.) The combination of constant propagation and constant folding yields this pair of instructions in lieu of the original two:
loadI 40 => r1; loadI 80 => r2;
This optimization has two benefits. First, it replaces an add instruction with an immediate load, which, according to our guidelines from earlier in the write-up, is considered to be an improvement. Second, and perhaps more importantly, r2 now has a known constant value, which enables us to propagate that value to future instructions.
Proving whether a register has a constant value can be tricky in the general case, though if we limit ourselves to one basic block at a time, a much simpler algorithm can be used:
when we start processing a basic block, consider all registers to be non-constants for each instruction i in the block, in top-to-bottom order { if one or more of the operands in the instruction are known to be constants propagate the constant values... ...fold constants... ...and replace the instruction if possible regardless of whether we made a change to the instruction... if the instruction now stores a constant value into a register add that (register, constant value) pair to our collection of registers and known constant values else if the instruction now stores a non-constant value into a register remove that register from our collection of registers and known constant values }
This algorithm boils down to a simulation of the basic block's execution at compile-time. We make the most conservative assumption to start with, that none of the registers are known to be constants. Anytime a register is assigned a constant value (such as with a loadI instruction), we add it (and the value) to a collection of registers with known constant values. Anytime a register is assigned a non-constant value (such as with a load instruction), we remove it from the collection of registers with known constant values. This collection, as it turns out, is really a map (in the data structure sense of the word), which might efficiently be implemented using a hash table (e.g. HashMap in the Java library).
Not surprisingly, the algorithm for maintaining the collection of known constant values becomes a great deal more complicated when it can be run over many basic blocks. This technique is known as global constant propagation and folding. (The term "global," when applied to an optimization technique, does not mean a program-wide optimization. It means an optimization made on all the basic blocks in one procedure, considered together.) We won't be covering global optimizations in this course, though there's plenty of reading material on the subject in Chapters 9 and 10 of the textbook, if you're interested.
I'll leave it as an exercise for each of you to figure out which instructions can be replaced and how they ought to be replaced, based on the set of guiding assumptions from earlier in the write-up. Don't forget to update your set of registers with known constant values whenever it changes!
Part 3: Unreachable block elimination (20 points)
Either as a result of a poorly-written input file, or more likely as the result of one of the optimizations in the previous parts, one or more entire basic blocks in the control-flow graph may become unreachable. If this is the case, we should eliminate unreachable nodes from the control-flow graph of our ILOC program entirely, since they serve no purpose.
The analysis required is a relatively straightforward depth-first graph traversal algorithm with marking:
consider all nodes in the CFG to be unmarked let currentNode = node 0 (the start node) loop { mark currentNode if there exists an unvisited successor n of currentNode currentNode = n else backtrack }
I've illustrated the algorithm using a pseudo-loop, but I actually implemented it as a recursive algorithm with backtracking. Since the CFGNode class I provided does not have a marking feature in it, I suggest implementing the marks by storing them in a separate one-dimensional boolean array.
Once you've finished the traversal phase, iterate through the nodes and remove the ones that were never marked. Nodes can be removed by calling the removeNode( ) method on the ControlFlowGraph. Beware that the start node, node 0, is always considered reachable and, hence, may not be removed!
A step-by-step example
Suppose we began with the following ILOC input file, running all three optimizer passes (local algebraic simplification, local constant propagation and folding, and unreachable block elimination) in sequence.
loadI 1 => r1; multI r1, 64 => r2; addI r2, 50 => r3; cmp_LT r2, r3 => r4; cbr r4 -> L3, L1; L1: loadI 1 => r5; loadI 10 => r6; loadI 0 => r7; cmp_GT r5, r6 => r8; cbr r8 -> L4, L2; L2: add r7, r5 => r7; addI r5, 1 => r5; cmp_LE r5, r6 => r8; cbr r8 -> L2, L4; L3: loadI 1024 => r7; L4: addI r7, 100 => r9; store r9 => r0;
Results of local algebraic simplification
loadI 1 => r1; lshiftI r1, 6 => r2; // multiplication by power of 2 simplified addI r2, 50 => r3; cmp_LT r2, r3 => r4; cbr r4 -> L3, L1; L1: loadI 1 => r5; loadI 10 => r6; loadI 0 => r7; cmp_GT r5, r6 => r8; cbr r8 -> L4, L2; L2: add r7, r5 => r7; addI r5, 1 => r5; cmp_LE r5, r6 => r8; cbr r8 -> L2, L4; L3: loadI 1024 => r7; L4: addI r7, 100 => r9; store r9 => r0;
Results of local constant propagation and folding
loadI 1 => r1; loadI 64 => r2; loadI 114 => r3; loadI true => r4; jumpI -> L3; L1: loadI 1 => r5; loadI 10 => r6; loadI 0 => r7; loadI false => r8; jumpI -> L2; L2: add r7, r5 => r7; addI r5, 1 => r5; cmp_LE r5, r6 => r8; cbr r8 -> L2, L4; L3: loadI 1024 => r7; L4: addI r7, 100 => r9; store r9 => r0;
Several instructions were replaced by loadI instructions in this pass, since several times registers were used whose values were known constants. Also, two of the three conditional branches were replaced by immediate jumps, since the result of the comparisons that preceded them became constants.
A couple of things should be pointed out here:
Results of unreachable block elimination
loadI 1 => r1; loadI 64 => r2; loadI 114 => r3; loadI true => r4; jumpI -> L3; L3: loadI 1024 => r7; L4: addI r7, 100 => r9; store r9 => r0;
The code between L1 and L3 was removed. This is the code that I would expect a working version of your program to output for this example.
Again, it should be pointed out that this code is not perfect by any stretch. But it is a marked improvement over what we started with. Additional passes that performed other kinds of analyses would be capable of making additional improvements. For example, these two instructions:
L3: loadI 1024 => r7; L4: addI r7, 100 => r9;
...would ideally be subject to constant propagation, changing them to this instead:
L3: loadI 1024 => r7; L4: loadI 1124 => r9;
The presence of the label L4 separates these two instructions into different basic blocks. Our constant propagation algorithm works only within a basic block, rendering it incapable of making this change. Furthermore, the values of registers r1, r2, r3, r4, and r7 are never used in this fragment. This being the case, an ideal optimizer would detect the fact that they are no longer "live" and remove the corresponding loadI instructions entirely. A peephole optimizer might then remove the jump to L3. No longer serving a purpose, both labels could be removed, leaving us with only this code as a rewrite of the entire original code fragment:
loadI 1124 => r9; store r9 => r0;
When properly designed and implemented, optimization is a beautiful thing!
Starting point
The entire framework of the program is being provided to you as a Zip archive. Most of the code is available only in its compiled form, as .class files. The .java files that will be relevant to your work have been provided.
Deliverables
Place all of the .java files that comprise your program into a Zip archive. Also, include a file called README.txt in your archive, which briefly explains what portion of the program you believe you have working, what aspects of it are only partially working, and what aspects do not work at all. You do not need to include the provided .class files from the Starting Point.
Follow this link for a discussion of how to submit your assignment. Remember that we do not accept paper submissions of your assignments, nor do we accept them via email under any circumstances.
In order to keep the grading process relatively simple, we require that you keep your program designed in such a way as it can be compiled and executed with the following set of commands:
javac *.java java Driver example.iloc 1 2 ...
...where there may be any list of at least one valid optimizer pass number (which may include duplicates). We will test your optimization passes both in isolation and in concert.