ICS 142 Winter 2004, Assignment #6

Introduction

Ultimately, the job of a compiler is to take a program in some source language and generate an equivalent program in some target language. Generally, that target language is an executable program for some platform, meaning that assembly code must be generated at some stage. Many of the abstractions provided by high-level programming languages -- procedures, scopes, arrays, structures, and support for a variety of built-in data types (including automatic conversions between them), to name a few -- do not exist at the assembly level. So the overall job of the back end of a compiler is to map higher-level abstractions into lower-level ones, choosing an assembly-level implementation for high-level language constructs. Naturally, some implementations of language constructs are better than others. But some are better in some situations and worse in others. Context plays a large role in selecting a good implementation. The complexities of generating good intermediate code are complicated by attempting to analyze this context on the fly. It makes good software engineering sense, then, to generate intermediate code that makes a "best guess" at a good implementation, then allow an optimizer to find better implementations whenever possible, based on more complex analysis of context.

The job of an optimizer is to take an intermediate code program and rewrite it to be a better program with the equivalent effect. ("Better," of course, can mean many things: faster, less memory usage, or less power consumption, for example.) Optimizers can employ many forms of analysis to improve a program, which are typically arranged into passes, where each pass uses one technique to attempt to improve the code. The net effect of all the passes, some of which may be repeated more than once, should be a significant improvement of the original program.

In this assignment, we'll explore a few issues that arise in the optimization of a substantial subset of the intermediate language ILOC that was discussed in lecture (and is discussed in the textbook). You'll write a program that takes a fragment of ILOC code, performs one or more optimization passes on it, and outputs the optimized fragment. The entire structure of the program will be provided, including a scanner/parser for ILOC, representations for ILOC instructions, and a module to pretty-print the output. Your job will only be to write three optimization modules. (The framework is extensible, so you're welcome to implement additional optimization modules if you'd like, though I won't be offering any extra credit for them.)

The subset of ILOC for this assignment

In this assignment, a substantial subset of ILOC (as presented in lecture and the textbook) is to be supported and optimized. The following ILOC instructions are to be supported in this assignment:

Opcode	Source Operands	Target Operands	Description
add	reg1, reg2	reg3	Adds the value in reg1 and reg2, storing the result in reg3.
addI	reg1, int2	reg3	Adds the value in reg1 to the constant integer int2, storing the result in reg3.
sub	reg1, reg2	reg3	Subtracts the value in reg2 from reg1, storing the result in reg3.
subI	reg1, int2	reg3	Subtracts the constant integer int2 from reg1, storing the result in reg3.
rsubI	reg1, int2	reg3	Subtracts the value in reg1 from the constant integer int2, storing the result in reg3.
mult	reg1, reg2	reg3	Multiplies the values in reg1 and reg2, storing the result in reg3.
multI	reg1, int2	reg3	Multiplies the value in reg1 by the integer constant int2, storing the result in reg3.
div	reg1, reg2	reg3	Divides the value in reg1 by the value in reg2, storing the result in reg3. If reg2's value is zero, it is assumed that a processor exception is raised.
divI	reg1, int2	reg3	Divides the value in reg1 by the integer constant int2, storing the result in reg3. If int2 is zero, it is assumed that a processor exception is raised.
rdivI	reg1, int2	reg3	Divides the value of the integer constant int2 by the value in reg1, storing the result in reg3. If the value in reg1 is zero, it is assumed that a processor exception is raised.
lshift	reg1, reg2	reg3	Left-shifts the value in reg1 by reg2 bits, storing the result in reg3.
lshiftI	reg1, int2	reg3	Left-shifts the value in reg1 by the number of bits specified by the integer constant int2, storing the result in reg3.
rshift	reg1, reg2	reg3	Right-shifts the value in reg1 by reg2 bits, storing the result in reg3.
rshiftI	reg1, int2	reg3	Right-shifts the value in reg1 by the number of bits specified by the integer constant int2, storing the result in reg3.
and	reg1, reg2	reg3	AND's together the (presumably boolean) values stored in reg1 and reg2, storing the result in reg3.
andI	reg1, bool2	reg3	AND's together the (presumably boolean) value stored in reg1 with the boolean constant bool2, storing the result in reg3.
or	reg1, reg2	reg3	OR's together the (presumably boolean) values stored in reg1 and reg2, storing the result in reg3.
orI	reg1, bool2	reg3	OR's together the (presumably boolean) value stored in reg1 with the boolean constant bool2, storing the result in reg3.
xor	reg1, reg2	reg3	XOR's together the (presumably boolean) values stored in reg1 and reg2, storing the result in reg3.
xorI	reg1, bool2	reg3	XOR's together the (presumably boolean) value stored in reg1 with the boolean constant bool2, storing the result in reg3.
not	reg1	reg2	NOT's the (presumably boolean) value stored in reg1, storing the result in reg2.
load	reg1	reg2	Loads the value stored in the memory address stored in reg1 into the register reg2.
loadI	const1	reg1	Places the value of the constant const1 into reg1. const1 may be either an integer or a boolean constant.
loadAI	reg1, int2	reg3	Loads the value stored in the memory address calculated by adding the integer constant int2 to the value stored in reg1. The loaded value is placed into reg3.
loadAO	reg1, reg2	reg3	Loads the value stored in the memory address calculated by adding the values stored in reg1 and reg2. The loaded value is placed into reg3.
store	reg1	reg2	Stores the value in reg1 into the memory address indicated in reg2.
storeAI	reg1	reg2, int3	Stores the value in reg1 into the memory address calculated by adding the integer constant int3 to the value in reg2.
storeAO	reg1	reg2, reg3	Stores the value in reg1 into the memory address calculated by adding the values in reg2 and reg3.
i2i	reg1	reg2	Copies the value stored in reg1 into reg2.
cmp_LT cmp_LE cmp_EQ cmp_NE cmp_GE cmp_GT	reg1, reg2	reg3	Compares the values stored in reg1 and reg2, storing the boolean result of the comparison into reg3. Each of these instructions uses a different form of comparison: cmp_LT uses <, cmp_LE uses <=, and so on.
cbr	reg1	Label1, Label2	If the (presumably boolean) value stored in reg1 is true, jump to Label1, otherwise jump to Label2.
jumpI	none	Label1	Jump to Label1.
nop	none	none	Has no effect, but is sometimes necessary as a placeholder. Optimizations should not remove these; they are placed automatically when needed.

ILOC code is to be written into an input file, subject to the following restrictions:

All instructions are to be terminated by semicolons. This is not the notation used in the textbook or in lecture, but it simplified my parser implementation.
The names of all registers must begin with a lowercase r and be followed by a non-negative integer. Example register names are r3 and r999.
The names of all labels must begin with an uppercase L and be followed by a non-negative integer. Example label names are L3 and L999.
An instruction may have either zero or one labels associated with it. No instruction may have two or more labels.
All jumps must be to existent labels.
The numbers and types of all operands must be correct. An attempt to read an instruction from an input file with the wrong number or types of operands -- or an attempt to create such an instruction programmatically in your optimizer -- will result in an exception being thrown by my code, which will crash your program.

Here is an example input file. It should be pointed out that, for the sake of readability, I've spaced the input in the file somewhat, though whitespace is not considered relevant, except when it is necessary to separate tokens.

     loadI  50      => r1;
     loadI  100     => r2;
     cmp_LT r1, r2  => r3;
     cbr    r3      -> L1, L2;
L1:  add    r4, r5  => r6;
     jumpI          -> L3;
L2:  add    r7, r8  => r6;
     jumpI          -> L3;
L3:  cmp_LT r6, r7  => r8;
     cbr    r8      -> L4, L1;
L4:  cmp_GT r6, r7  => r8;
     cbr    r8      -> L5, L6;
L5:  loadI  true    => r9;
     jumpI          -> L7;
L6:  loadI  false   => r9;
     jumpI          -> L7;
L7:  nop;

The cbr and jumpI instructions use the symbol -> to separate source operands from target operands. The nop instruction has no operands. All other instructions use the symbol => to separate source operands from target operands.

Comments may be placed into input files; anything following two slashes (i.e. //) until the end of the line is considered to be a comment, much like in Java.

Basic blocks and control-flow graphs

Many optimization techniques involve some form of compile-time simulation of the run-time behavior of the program. In the presence of control-flow structures, such as if-then-else statements and while loops (which introduce conditional branches into the intermediate code), this simulation becomes increasingly more difficult to do precisely. To simplify matters, we divide the intermediate code program into straight-line chunks of code that must always be executed in sequence from beginning to end. Each such chunk is called a basic block.

Here's a brief example ILOC program:

     loadI  1       => r1;
     loadI  10      => r2;
     loadI  0       => r3;
     cmp_GT r1, r2  => r4;
     cbr    r4      -> L2, L1;
L1:  add    r3, r1  => r3;
     addI   r1, 1   => r1;
     cmp_LE r1, r2  => r4;
     cbr    r4      -> L1, L2;
L2:  store  r3      => r0;

This program uses a loop to calculate the sum of the integers 1..10 and store it in a memory address stored in the register r0. There are three basic blocks in this program:

The first five instructions must always run in sequence from beginning to end. They constitute the first basic block, which we'll call block 0.
The next four instructions must always run in sequence from beginning to end. They constitute the second basic block, which we'll call block 1.
The last instruction is on its own, only running when it is finally jumped to. It, by itself, constitutes the third basic block, which we'll call block 2.

From this example, we see that any jump instruction (cbr or jumpI) always constitutes the end of a basic block. A label must always constitute the beginning, since it is possible to jump to labels.

The longer the basic blocks in a program, the better. Basic blocks indicate sequences of code that are likely to pipeline well. It is also significantly easier to perform optimizations on a single basic block than to try to perform them on more than one block (due to the complexities introduced by control-flow).

Basic blocks are said to be connected together as a control-flow graph. In this context, we'll often refer to the blocks as nodes. In our example, the following edges exist between nodes in the control-flow graph:

There is an edge from node 0 to node 2, since it's possible for the last instruction in node 0 to jump to L2, which is the first instruction in node 2.
There is an edge from node 0 to node 1, since it's possible for the last instruction in node 0 to jump to L1, which is the first instruction in node 1.
For similar reasons, there are edges from node 1 to node 2, and also from node 1 to node 1.
Node 2 has no outgoing edges, since it is the end of our code fragment.

Viewing basic blocks as a graph like this is handy, since it allows a variety of well-known graph algorithms to be useful for performing optimization. We'll use one such algorithm in Part 3.

Good news: I've provided the portion of your program that reads the input file, finds the basic blocks, and builds the control-flow graph.

Guiding assumptions about the profitability of optimizations

Different architectures have radically different performance characteristics. For this reason, optimizers for different architectures will need to make different decisions. After all, the point of optimization is not to make the code prettier; it's to make the code perform better. A compiler writer must pay careful attention to the optimizations she chooses to perform, to ensure that they really constitute an improvement in the program as it will execute on the target architecture.

For your optimizer, use the following assumptions about the performance of the underlying architecture to guide your choices. This list may or may not compare favorably with a list for an actual architecture; I've made these decisions to make the problem a more interesting one to solve.

The immediate form of arithmetic, shifting, and logical instructions should be preferred over the regular form. For example, addI should be preferred over add, rshiftI should be preferred over rshift, and xorI should be preferred over xor.
A single add or subtract operation should be preferred over a single multiplication or division. A single left- or right-shift should be preferred over a single addition, subtraction, multiplication, or division. (This only applies in the case that all operations in question are non-immediate or all operations are immediate. A multI is considered better than an add, though an addI is considered better than a multI.)
The not instruction should be preferred over any other boolean operator.
A register-to-register copy, i2i, should be preferred over any instruction that performs a computation.
A loadI operation, which loads an immediate value into a register, should be preferred over any instruction that performs a computation. In other words, any time you can do the work of one or more instructions at compile time and replace it with a loadI operation, this will be considered profitable.
A loadI operation should also be preferred over a register-to-register copy (i2i) operation.
The "address+immediate" forms of load and store (loadAI and storeAI) should be preferred over the "address+offset" forms (loadAO and storeAO).
An immediate jump (jumpI) should be preferred over a conditional branch (cbr). The motivation for this is the desire to keep a pipeline full. A conditional branch operation always runs the risk of causing the pipeline to be flushed (though, realistically, branch prediction can reduce the risk), while an immediate jump does not.

Of course, it should be noted here that it is only possible to replace one instruction with another in the case that their effect is the same. For example, given this brief sequence of instructions:

     loadI  100     => r1;
     load   r2      => r3;
     addI   r1, 50  => r4;
     addI   r3, 50  => r5;

...since r1 is known to have the constant value 100, we can replace the first addI instruction with a loadI. The second addI, however, cannot be replaced, since its run-time value is based on a value in a memory location whose value is unknown to us at compile-time. So, a correctly-rewritten sequence might look like this:

     loadI  100     => r1;
     load   r2      => r3;
     loadI  150     => r4;    // this is considered better than addI
     addI   r3, 50  => r5;    // this cannot be replaced, since we can't know r3's value at compile-time

Part 1: Local algebraic simplification (30 points)

Apply the following algebraic transformations on the code within each basic block. Handle each basic block separately. (The term "local", when used to describe an optimization technique, indicates an optimization that works only separately within each basic block. For this transformation, it actually makes little difference, since we can make substitutions without understanding any context.)

Addition or subtraction of a constant 0 should be replaced with a register-to-register copy.
- e.g. addI r1, 0 => r2; should be replaced with i2i r1 => r2;
Subtraction of a register from itself should be replaced with an immediate load of 0.
- e.g. sub r1, r1 => r2; should be replaced with loadI 0 => r2;
Multiplication by the constant 0 should be replaced with an immediate load of 0.
- e.g. multI r1, 0 => r2; should be replaced with loadI 0 => r2;
Multiplication or division by the constant 1 should be replaced with a register-to-register copy.
- e.g. multI r1, 1 => r2; should be replaced with i2i r1 => r2;
Reverse-immediate division of the constant 0 by a register should be replaced with an immediate load of 0.
- e.g. rdivI r1, 0 => r2; should be replaced with loadI 0 => r2;
Multiplication or division by immediate powers of 2 should be replaced by the appropriate left- or right-shift operation.
- e.g. multI r1, 64 => r2; should be replaced with lshiftI r1, 6 => r2;
Division of a register by itself should be replaced with an immediate load of 1.
- e.g. div r1, r1 => r2; should be replaced with loadI 1 => r2;
Left- or right-shift by the constant 0 should be replaced with a register-to-register copy.
- e.g. lshiftI r1, 0 => r2; should be replaced with i2i r1 => r2;
Immediate AND (andI), OR (orI), XOR (xorI) by the constants true or false should be replaced with the appropriate boolean identity.
- e.g. andI r1, false => r2; should be replaced with loadI false => r2;
AND, OR of a register by itself should be replaced with a register-to-register copy.
- e.g. and r1, r1 => r2; should be replaced with i2i r1 => r2;
XOR of a register by itself should be replaced with an immediate load of false.
- e.g. xor r1, r1 => r2; should be replaced with loadI false => r2;

Part 2: Local constant propagation and folding (50 points)

Whenever it can be proven that a register must have a known constant value, that fact can then be used to simplify instructions that use the value of the register. For example, consider the following pair of instructions:

     loadI   40      => r1;
     addI    r1, 40  => r2;

Since it's clear that the value of r1 must be 40 after the first instruction executes, the second instruction is really the addition of 40 and 40. (Replacing a register with a known constant value is called constant propagation.) Since both operands are known to be constants, we might as well perform the addition at compile-time and replace the addI instruction with an immediate load. (Combining constants together at compile-time is known as constant folding.) The combination of constant propagation and constant folding yields this pair of instructions in lieu of the original two:

     loadI   40      => r1;
     loadI   80      => r2;

This optimization has two benefits. First, it replaces an add instruction with an immediate load, which, according to our guidelines from earlier in the write-up, is considered to be an improvement. Second, and perhaps more importantly, r2 now has a known constant value, which enables us to propagate that value to future instructions.

Proving whether a register has a constant value can be tricky in the general case, though if we limit ourselves to one basic block at a time, a much simpler algorithm can be used:

    when we start processing a basic block, consider all registers to be non-constants

    for each instruction i in the block, in top-to-bottom order
    {
        if one or more of the operands in the instruction are known to be constants
            propagate the constant values...
            ...fold constants...
            ...and replace the instruction if possible

        regardless of whether we made a change to the instruction...
            if the instruction now stores a constant value into a register
                add that (register, constant value) pair to our collection of registers and known constant values
            else if the instruction now stores a non-constant value into a register
                remove that register from our collection of registers and known constant values
    }

This algorithm boils down to a simulation of the basic block's execution at compile-time. We make the most conservative assumption to start with, that none of the registers are known to be constants. Anytime a register is assigned a constant value (such as with a loadI instruction), we add it (and the value) to a collection of registers with known constant values. Anytime a register is assigned a non-constant value (such as with a load instruction), we remove it from the collection of registers with known constant values. This collection, as it turns out, is really a map (in the data structure sense of the word), which might efficiently be implemented using a hash table (e.g. HashMap in the Java library).

Not surprisingly, the algorithm for maintaining the collection of known constant values becomes a great deal more complicated when it can be run over many basic blocks. This technique is known as global constant propagation and folding. (The term "global," when applied to an optimization technique, does not mean a program-wide optimization. It means an optimization made on all the basic blocks in one procedure, considered together.) We won't be covering global optimizations in this course, though there's plenty of reading material on the subject in Chapters 9 and 10 of the textbook, if you're interested.

I'll leave it as an exercise for each of you to figure out which instructions can be replaced and how they ought to be replaced, based on the set of guiding assumptions from earlier in the write-up. Don't forget to update your set of registers with known constant values whenever it changes!

Part 3: Unreachable block elimination (20 points)

Either as a result of a poorly-written input file, or more likely as the result of one of the optimizations in the previous parts, one or more entire basic blocks in the control-flow graph may become unreachable. If this is the case, we should eliminate unreachable nodes from the control-flow graph of our ILOC program entirely, since they serve no purpose.

The analysis required is a relatively straightforward depth-first graph traversal algorithm with marking:

    consider all nodes in the CFG to be unmarked
    let currentNode = node 0 (the start node)
    loop
    {
        mark currentNode
        if there exists an unvisited successor n of currentNode
            currentNode = n
        else
            backtrack
    }

I've illustrated the algorithm using a pseudo-loop, but I actually implemented it as a recursive algorithm with backtracking. Since the CFGNode class I provided does not have a marking feature in it, I suggest implementing the marks by storing them in a separate one-dimensional boolean array.

Once you've finished the traversal phase, iterate through the nodes and remove the ones that were never marked. Nodes can be removed by calling the removeNode( ) method on the ControlFlowGraph. Beware that the start node, node 0, is always considered reachable and, hence, may not be removed!

A step-by-step example

Suppose we began with the following ILOC input file, running all three optimizer passes (local algebraic simplification, local constant propagation and folding, and unreachable block elimination) in sequence.

     loadI   1       => r1;
     multI   r1, 64  => r2;
     addI    r2, 50  => r3;
     cmp_LT  r2, r3  => r4;
     cbr     r4      -> L3, L1;
L1:  loadI   1       => r5;
     loadI   10      => r6;
     loadI   0       => r7;
     cmp_GT  r5, r6  => r8;
     cbr     r8      -> L4, L2;
L2:  add     r7, r5  => r7;
     addI    r5, 1   => r5;
     cmp_LE  r5, r6  => r8;
     cbr     r8      -> L2, L4;
L3:  loadI   1024    => r7;
L4:  addI    r7, 100 => r9;
     store   r9      => r0;

Results of local algebraic simplification

     loadI   1       => r1;
     lshiftI r1, 6   => r2;   // multiplication by power of 2 simplified
     addI    r2, 50  => r3;
     cmp_LT  r2, r3  => r4;
     cbr     r4      -> L3, L1;
L1:  loadI   1       => r5;
     loadI   10      => r6;
     loadI   0       => r7;
     cmp_GT  r5, r6  => r8;
     cbr     r8      -> L4, L2;
L2:  add     r7, r5  => r7;
     addI    r5, 1   => r5;
     cmp_LE  r5, r6  => r8;
     cbr     r8      -> L2, L4;
L3:  loadI   1024    => r7;
L4:  addI    r7, 100 => r9;
     store   r9      => r0;

Results of local constant propagation and folding

     loadI   1       => r1;
     loadI   64      => r2;
     loadI   114     => r3;
     loadI   true    => r4;
     jumpI           -> L3;
L1:  loadI   1       => r5;
     loadI   10      => r6;
     loadI   0       => r7;
     loadI   false   => r8;
     jumpI           -> L2;
L2:  add     r7, r5  => r7;
     addI    r5, 1   => r5;
     cmp_LE  r5, r6  => r8;
     cbr     r8      -> L2, L4;
L3:  loadI   1024    => r7;
L4:  addI    r7, 100 => r9;
     store   r9      => r0;

Several instructions were replaced by loadI instructions in this pass, since several times registers were used whose values were known constants. Also, two of the three conditional branches were replaced by immediate jumps, since the result of the comparisons that preceded them became constants.

A couple of things should be pointed out here:

First of all, the last instruction before the label L2 is an immediate jump to L2. This is not a problem that will be solved in any of our optimizations, though in a real compiler, you would obviously want to solve it. (One way to solve it is to use a peephole optimizer, which walks through the intermediate code after all other optimizations and looks for short, bizarre instruction sequences like this one, or instructions such as i2i r1 => r1;.)
Because of the immediate jump to L3 in the fifth instruction, none of the code between the labels L1 and L3 is reachable. The third pass will eliminate all of this code.

Results of unreachable block elimination

     loadI   1       => r1;
     loadI   64      => r2;
     loadI   114     => r3;
     loadI   true    => r4;
     jumpI           -> L3;
L3:  loadI   1024    => r7;
L4:  addI    r7, 100 => r9;
     store   r9      => r0;

The code between L1 and L3 was removed. This is the code that I would expect a working version of your program to output for this example.

Again, it should be pointed out that this code is not perfect by any stretch. But it is a marked improvement over what we started with. Additional passes that performed other kinds of analyses would be capable of making additional improvements. For example, these two instructions:

L3:  loadI   1024    => r7;
L4:  addI    r7, 100 => r9;

...would ideally be subject to constant propagation, changing them to this instead:

L3:  loadI   1024    => r7;
L4:  loadI   1124    => r9;

The presence of the label L4 separates these two instructions into different basic blocks. Our constant propagation algorithm works only within a basic block, rendering it incapable of making this change. Furthermore, the values of registers r1, r2, r3, r4, and r7 are never used in this fragment. This being the case, an ideal optimizer would detect the fact that they are no longer "live" and remove the corresponding loadI instructions entirely. A peephole optimizer might then remove the jump to L3. No longer serving a purpose, both labels could be removed, leaving us with only this code as a rewrite of the entire original code fragment:

     loadI   1124    => r9;
     store   r9      => r0;

When properly designed and implemented, optimization is a beautiful thing!