[Compiler] 09. Intermediate Code Generation - Three-Address Code and Type Checking

Intermediate Code Generation - Three-Address Code and Type Checking

Intermediate code generation is the phase that converts the syntax tree into a machine-independent Intermediate Representation (IR). This article covers the main IR formats, type checking, and control flow translation.

1. Syntax Tree Variants

Abstract Syntax Tree (AST)

An AST is a concise representation with unnecessary nodes removed from the parse tree.

// a + a * (b - c) + (b - c) * d

AST:
         +
        / \
       +    *
      / \  / \
     a   * -   d
        / \ / \
       a  - b   c
         / \
        b   c

Directed Acyclic Graph (DAG)

A DAG shares common subexpressions in the AST.

// a + a * (b - c) + (b - c) * d

DAG:
         +
        / \
       +    *
      / \  / \
     a   * |   d
        / \|
       a   -     <- (b-c) represented only once
          / \
         b   c

The advantages of a DAG are:

Automatic common subexpression detection: Shares identical operations to prevent redundant computation.
Memory savings: Reuses identical nodes.

DAG Construction Algorithm

Node createNode(String op, Node left, Node right) {
    // Check if a node with the same operation and children already exists
    Node existing = lookup(op, left, right);
    if (existing != null) {
        return existing;  // Reuse existing node
    }

    Node node = new Node(op, left, right);
    insert(op, left, right, node);  // Register in hash table
    return node;
}

2. Three-Address Code

Three-address code is the most widely used intermediate representation format. Each instruction has at most one operator and at most three addresses (operands and result).

Forms of Three-Address Code

Basic form: x = y op z

Main instruction types:
  x = y op z        // Binary operation
  x = op y          // Unary operation
  x = y             // Copy
  goto L            // Unconditional branch
  if x relop y goto L  // Conditional branch
  param x           // Parameter passing
  call p, n         // Function call (p: function, n: arg count)
  return y          // Function return
  x = y[i]          // Array access (index)
  x[i] = y          // Array assignment
  x = &y            // Address of
  x = *y            // Dereference
  *x = y            // Indirect assignment

Example: Three-Address Code Translation of an Expression

// Original code
a = b * -c + b * -c

// Three-address code (AST based)
t1 = minus c
t2 = b * t1
t3 = minus c
t4 = b * t3
t5 = t2 + t4
a = t5

// Three-address code (DAG based - optimized)
t1 = minus c
t2 = b * t1
t5 = t2 + t2
a = t5

3. Implementation of Three-Address Code

Quadruples

Composed of 4 fields: (op, arg1, arg2, result)

No.  | op    | arg1 | arg2 | result
----+-------+------+------+-------
 0  | minus | c    |      | t1
 1  | *     | b    | t1   | t2
 2  | minus | c    |      | t3
 3  | *     | b    | t3   | t4
 4  | +     | t2   | t4   | t5
 5  | =     | t5   |      | a

Triples

Removes the result field and references results by instruction number.

No.  | op    | arg1 | arg2
----+-------+------+------
 0  | minus | c    |
 1  | *     | b    | (0)
 2  | minus | c    |
 3  | *     | b    | (2)
 4  | +     | (1)  | (3)
 5  | =     | a    | (4)

Triples save memory but require updating references when instructions are moved.

Indirect Triples

Maintains a separate pointer list to triples, making instruction reordering easier.

Order list:      [0, 1, 2, 3, 4, 5]
Reordered:       [2, 0, 3, 1, 4, 5]  // Original triples unchanged

4. Static Single Assignment (SSA) Form

In SSA (Static Single Assignment) form, each variable is defined exactly once.

// Normal three-address code
p = a + b
q = p - c
p = q * d      // p defined twice
e = p + q

// SSA form
p1 = a + b
q1 = p1 - c
p2 = q1 * d   // New name p2
e1 = p2 + q1

Phi Functions

Phi functions are used at points where control flow merges.

// Original code
if (flag) x = 1; else x = 2;
y = x;

// SSA form
if (flag) x1 = 1; else x2 = 2;
x3 = phi(x1, x2);   // Selects based on which path was taken
y1 = x3;

Advantages of SSA

Simplifies data flow analysis: Each variable has a unique definition.
Facilitates optimization: Constant propagation, dead code elimination, etc., become intuitive.
Used as the core IR in modern compilers (LLVM, GCC).

5. Type Expressions

How types are represented in a type system.

Basic Types

boolean, char, integer, float, void

Type Constructors

1. Array: array(size, element_type)
   int a[10]  ->  array(10, integer)

2. Record/Struct: record(field_list)
   struct { int x; float y; }
   ->  record((x, integer), (y, float))

3. Pointer: pointer(type)
   int *p  ->  pointer(integer)

4. Function: parameter_types -> return_type
   int f(float, char)  ->  float x char -> integer

Type Equivalence

There are two equivalence criteria:

Structural Equivalence:
  - Two types are equivalent if their structures are the same
  - C language's basic approach

Name Equivalence:
  - Equivalent only if they have the same name
  - Pascal, Ada's approach

Example:
  type A = array(10, integer)
  type B = array(10, integer)

  Structural equivalence: A == B  (same structure)
  Name equivalence:       A != B  (different names)

6. Type Checking

SDD for Type Checking Rules

E -> num               { E.type = integer }
E -> num . num         { E.type = float }
E -> id                { E.type = lookup(id.entry) }
E -> E1 + E2           { E.type = max(E1.type, E2.type)
                          if compatible(E1.type, E2.type)
                          else type_error }
E -> E1 [ E2 ]         { E.type = if E1.type == array(s, t)
                                     and E2.type == integer
                                   then t
                                   else type_error }

Type Coercion

Automatically converts operands of different types.

Type conversion hierarchy (automatic widening):
  char -> int -> long -> float -> double

Example: int + float -> float + float -> float

Conversion rules (widening):
  E -> E1 + E2
    if E1.type == integer and E2.type == integer:
      E.type = integer
    else if E1.type == float and E2.type == integer:
      t = new Temp()
      emit(t "=" "intToFloat" E2.addr)
      E.type = float
    else if E1.type == integer and E2.type == float:
      t = new Temp()
      emit(t "=" "intToFloat" E1.addr)
      E.type = float

7. Control Flow Translation

If-Else Statement Translation

// Source code
if (x < 100) {
    x = x + 1;
} else {
    x = x - 1;
}

// Three-address code
    if x < 100 goto L1
    goto L2
L1: t1 = x + 1
    x = t1
    goto L3
L2: t2 = x - 1
    x = t2
L3: ...

While Statement Translation

// Source code
while (x < 100) {
    x = x * 2;
}

// Three-address code
L1: if x < 100 goto L2
    goto L3
L2: t1 = x * 2
    x = t1
    goto L1
L3: ...

8. Boolean Expression Translation

Short-Circuit Evaluation

Most languages use short-circuit evaluation for boolean expressions.

// a < b || c < d && e < f

Translation:
    if a < b goto Ltrue
    if c < d goto L1
    goto Lfalse
L1: if e < f goto Ltrue
    goto Lfalse

If a < b is true, the rest is not evaluated.

SDD for Boolean Expressions

B -> B1 || B2
  B1.true = B.true
  B1.false = newLabel()
  B2.true = B.true
  B2.false = B.false
  B.code = B1.code || label(B1.false) || B2.code

B -> B1 && B2
  B1.true = newLabel()
  B1.false = B.false
  B2.true = B.true
  B2.false = B.false
  B.code = B1.code || label(B1.true) || B2.code

B -> ! B1
  B1.true = B.false
  B1.false = B.true
  B.code = B1.code

B -> E1 relop E2
  B.code = gen("if" E1.addr relop E2.addr "goto" B.true)
        || gen("goto" B.false)

B -> true
  B.code = gen("goto" B.true)

B -> false
  B.code = gen("goto" B.false)

9. Backpatching

Backpatching is a technique for filling in branch target labels later when they are not yet determined.

Core Idea of Backpatching

1. Leave the target address empty when generating branch instructions
2. Fill in the empty address when the target is determined (backpatch)

Example:
  100: if x < y goto ___     <- target unknown
  101: goto ___              <- target unknown
  ...
  105: x = x + 1             <- discover this is the target

  After backpatching:
  100: if x < y goto 105     <- filled in
  101: goto ___              <- still unknown

Functions Used in Backpatching

makelist(i):    Creates a list containing only instruction number i
merge(p1, p2):  Merges two lists p1, p2
backpatch(p, i): Sets the target address of all instructions in list p to i

Boolean Expression Translation with Backpatching

B -> B1 || M B2
  backpatch(B1.falselist, M.quad)
  B.truelist = merge(B1.truelist, B2.truelist)
  B.falselist = B2.falselist

B -> B1 && M B2
  backpatch(B1.truelist, M.quad)
  B.truelist = B2.truelist
  B.falselist = merge(B1.falselist, B2.falselist)

B -> ! B1
  B.truelist = B1.falselist
  B.falselist = B1.truelist

M -> epsilon
  M.quad = nextquad  (record current instruction number)

Backpatching Example

// a < b || c < d && e < f

Generation process:
  100: if a < b goto ___      (B1.truelist = {100})
  101: goto ___                (B1.falselist = {101})
  102: if c < d goto ___      (B2.truelist = {102})
  103: goto ___                (B2.falselist = {103})
  104: if e < f goto ___      (B3.truelist = {104})
  105: goto ___                (B3.falselist = {105})

After backpatching:
  100: if a < b goto Ltrue    (truelist)
  101: goto 102               (B1.false -> B2 start)
  102: if c < d goto 104      (B2.true -> B3 start)
  103: goto Lfalse            (falselist)
  104: if e < f goto Ltrue    (truelist)
  105: goto Lfalse            (falselist)

Final truelist = {100, 104}
Final falselist = {103, 105}

10. Switch Statement Translation

// Source code
switch (x) {
    case 1: stmt1; break;
    case 5: stmt2; break;
    default: stmt3;
}

// Three-address code
    t = x
    if t == 1 goto L1
    if t == 5 goto L2
    goto L3            // default
L1: (stmt1 code)
    goto Lnext
L2: (stmt2 code)
    goto Lnext
L3: (stmt3 code)
    goto Lnext
Lnext: ...

When there are many cases, optimization is done using jump tables or binary search.

Summary

Concept	Key Content
DAG	Graph sharing common subexpressions from AST
Three-Address Code	One operator per instruction, max 3 addresses
Quadruples	(op, arg1, arg2, result) format
Triples	Results referenced by instruction number
SSA	Each variable defined exactly once
Type Checking	Verifying type compatibility of operators/operands
Short-Circuit Eval.	Partial evaluation of boolean expressions
Backpatching	Filling in branch targets later

Intermediate code generation is the core phase that represents the semantics of source language in a machine-independent form. In the next article, we will cover runtime environments where programs execute.