Split View: [컴파일러] 09. 중간 코드 생성 - 3주소 코드와 타입 검사

[컴파일러] 09. 중간 코드 생성 - 3주소 코드와 타입 검사

중간 코드 생성 - 3주소 코드와 타입 검사

중간 코드 생성은 구문 트리를 기계 독립적인 **중간 표현(Intermediate Representation, IR)**으로 변환하는 단계입니다. 이 글에서는 주요 IR 형식과 타입 검사, 제어 흐름 번역을 다룹니다.

1. 구문 트리의 변형

추상 구문 트리(AST)

AST는 파스 트리에서 불필요한 노드를 제거한 간결한 표현입니다.

// a + a * (b - c) + (b - c) * d

AST:
         +
        / \
       +    *
      / \  / \
     a   * -   d
        / \ / \
       a  - b   c
         / \
        b   c

방향 비순환 그래프(DAG)

DAG는 AST에서 **공통 부분식(common subexpression)**을 공유합니다.

// a + a * (b - c) + (b - c) * d

DAG:
         +
        / \
       +    *
      / \  / \
     a   * |   d
        / \|
       a   -     <- (b-c)를 한 번만 표현
          / \
         b   c

DAG의 장점은 다음과 같습니다.

공통 부분식 자동 검출: 같은 연산을 공유하여 중복 계산을 방지합니다.
메모리 절약: 동일한 노드를 재사용합니다.

DAG 구성 알고리즘

Node createNode(String op, Node left, Node right) {
    // 같은 연산과 자식을 가진 노드가 이미 있는지 확인
    Node existing = lookup(op, left, right);
    if (existing != null) {
        return existing;  // 기존 노드 재사용
    }

    Node node = new Node(op, left, right);
    insert(op, left, right, node);  // 해시 테이블에 등록
    return node;
}

2. 3주소 코드(Three-Address Code)

3주소 코드는 가장 널리 사용되는 중간 표현 형식입니다. 각 명령문에는 최대 하나의 연산자와 최대 세 개의 주소(피연산자와 결과)가 있습니다.

3주소 코드의 형태

기본 형태: x = y op z

주요 명령문 종류:
  x = y op z        // 이항 연산
  x = op y          // 단항 연산
  x = y             // 복사
  goto L            // 무조건 분기
  if x relop y goto L  // 조건 분기
  param x           // 매개변수 전달
  call p, n         // 함수 호출 (p: 함수, n: 인자 수)
  return y          // 함수 반환
  x = y[i]          // 배열 접근 (인덱스)
  x[i] = y          // 배열 대입
  x = &y            // 주소
  x = *y            // 역참조
  *x = y            // 간접 대입

예제: 수식의 3주소 코드 변환

// 원본 코드
a = b * -c + b * -c

// 3주소 코드 (AST 기반)
t1 = minus c
t2 = b * t1
t3 = minus c
t4 = b * t3
t5 = t2 + t4
a = t5

// 3주소 코드 (DAG 기반 - 최적화)
t1 = minus c
t2 = b * t1
t5 = t2 + t2
a = t5

3. 3주소 코드의 구현 방식

쿼드러플(Quadruples)

4개의 필드로 구성됩니다: (op, arg1, arg2, result)

번호 | op    | arg1 | arg2 | result
----+-------+------+------+-------
 0  | minus | c    |      | t1
 1  | *     | b    | t1   | t2
 2  | minus | c    |      | t3
 3  | *     | b    | t3   | t4
 4  | +     | t2   | t4   | t5
 5  | =     | t5   |      | a

트리플(Triples)

result 필드를 제거하고, 명령문 번호로 결과를 참조합니다.

번호 | op    | arg1 | arg2
----+-------+------+------
 0  | minus | c    |
 1  | *     | b    | (0)
 2  | minus | c    |
 3  | *     | b    | (2)
 4  | +     | (1)  | (3)
 5  | =     | a    | (4)

트리플은 메모리를 절약하지만, 명령문 이동 시 참조를 갱신해야 합니다.

간접 트리플(Indirect Triples)

트리플에 대한 포인터 리스트를 별도로 유지하여, 명령문 순서 변경을 용이하게 합니다.

순서 리스트:  [0, 1, 2, 3, 4, 5]
명령문 재배치: [2, 0, 3, 1, 4, 5]  // 원본 트리플 수정 불필요

4. 정적 단일 배정(SSA) 형식

SSA(Static Single Assignment) 형식에서는 각 변수가 정확히 한 번만 정의됩니다.

// 일반 3주소 코드
p = a + b
q = p - c
p = q * d      // p가 두 번 정의됨
e = p + q

// SSA 형식
p1 = a + b
q1 = p1 - c
p2 = q1 * d   // 새 이름 p2
e1 = p2 + q1

Phi 함수

제어 흐름이 합류하는 지점에서 phi 함수를 사용합니다.

// 원본 코드
if (flag) x = 1; else x = 2;
y = x;

// SSA 형식
if (flag) x1 = 1; else x2 = 2;
x3 = phi(x1, x2);   // 어느 경로로 왔는지에 따라 선택
y1 = x3;

SSA의 장점

데이터 흐름 분석이 단순해집니다: 각 변수의 정의가 유일합니다.
최적화가 용이합니다: 상수 전파, 죽은 코드 제거 등이 직관적입니다.
**현대 컴파일러(LLVM, GCC)**에서 핵심 IR로 사용됩니다.

5. 타입 표현식(Type Expression)

타입 시스템의 타입을 표현하는 방법입니다.

기본 타입

boolean, char, integer, float, void

타입 생성자

1. 배열: array(크기, 원소타입)
   int a[10]  ->  array(10, integer)

2. 레코드/구조체: record(필드목록)
   struct { int x; float y; }
   ->  record((x, integer), (y, float))

3. 포인터: pointer(타입)
   int *p  ->  pointer(integer)

4. 함수: 인자타입 -> 반환타입
   int f(float, char)  ->  float x char -> integer

타입 동치(Type Equivalence)

두 가지 동치 기준이 있습니다.

구조적 동치 (Structural Equivalence):
  - 두 타입의 구조가 같으면 동치
  - C 언어의 기본 접근법

이름 동치 (Name Equivalence):
  - 같은 이름일 때만 동치
  - Pascal, Ada의 접근법

예:
  type A = array(10, integer)
  type B = array(10, integer)

  구조적 동치: A == B  (구조가 같으므로)
  이름 동치:   A != B  (이름이 다르므로)

6. 타입 검사(Type Checking)

타입 검사 규칙의 SDD

E -> num               { E.type = integer }
E -> num . num         { E.type = float }
E -> id                { E.type = lookup(id.entry) }
E -> E1 + E2           { E.type = max(E1.type, E2.type)
                          if compatible(E1.type, E2.type)
                          else type_error }
E -> E1 [ E2 ]         { E.type = if E1.type == array(s, t)
                                     and E2.type == integer
                                   then t
                                   else type_error }

타입 변환(Type Coercion)

타입이 다른 피연산자를 자동으로 변환합니다.

타입 변환 계층 (자동 확대 변환):
  char -> int -> long -> float -> double

예: int + float -> float + float -> float

변환 규칙 (widening):
  E -> E1 + E2
    if E1.type == integer and E2.type == integer:
      E.type = integer
    else if E1.type == float and E2.type == integer:
      t = new Temp()
      emit(t "=" "intToFloat" E2.addr)
      E.type = float
    else if E1.type == integer and E2.type == float:
      t = new Temp()
      emit(t "=" "intToFloat" E1.addr)
      E.type = float

7. 제어 흐름의 번역

If-Else 문의 번역

// 소스 코드
if (x < 100) {
    x = x + 1;
} else {
    x = x - 1;
}

// 3주소 코드
    if x < 100 goto L1
    goto L2
L1: t1 = x + 1
    x = t1
    goto L3
L2: t2 = x - 1
    x = t2
L3: ...

While 문의 번역

// 소스 코드
while (x < 100) {
    x = x * 2;
}

// 3주소 코드
L1: if x < 100 goto L2
    goto L3
L2: t1 = x * 2
    x = t1
    goto L1
L3: ...

8. 불리언 표현식의 번역

단축 평가(Short-Circuit Evaluation)

대부분의 언어에서 불리언 표현식은 단축 평가를 사용합니다.

// a < b || c < d && e < f

번역:
    if a < b goto Ltrue
    if c < d goto L1
    goto Lfalse
L1: if e < f goto Ltrue
    goto Lfalse

a < b가 참이면 나머지를 평가하지 않습니다.

불리언 표현식의 SDD

B -> B1 || B2
  B1.true = B.true
  B1.false = newLabel()
  B2.true = B.true
  B2.false = B.false
  B.code = B1.code || label(B1.false) || B2.code

B -> B1 && B2
  B1.true = newLabel()
  B1.false = B.false
  B2.true = B.true
  B2.false = B.false
  B.code = B1.code || label(B1.true) || B2.code

B -> ! B1
  B1.true = B.false
  B1.false = B.true
  B.code = B1.code

B -> E1 relop E2
  B.code = gen("if" E1.addr relop E2.addr "goto" B.true)
        || gen("goto" B.false)

B -> true
  B.code = gen("goto" B.true)

B -> false
  B.code = gen("goto" B.false)

9. 백패칭(Backpatching)

백패칭은 분기 대상 레이블이 아직 결정되지 않았을 때, 나중에 채워 넣는 기법입니다.

백패칭의 핵심 아이디어

1. 분기 명령을 생성할 때 대상 주소를 비워 둠
2. 대상이 결정되면 비워둔 주소를 채워 넣음 (backpatch)

예:
  100: if x < y goto ___     <- 대상 미정
  101: goto ___              <- 대상 미정
  ...
  105: x = x + 1             <- 여기가 대상임을 알게 됨

  백패칭 후:
  100: if x < y goto 105     <- 채워 넣음
  101: goto ___              <- 아직 미정

백패칭에 사용되는 함수

makelist(i):    명령문 번호 i만을 포함하는 리스트 생성
merge(p1, p2):  두 리스트 p1, p2를 합침
backpatch(p, i): 리스트 p의 모든 명령문의 대상 주소를 i로 설정

백패칭을 사용한 불리언 표현식 번역

B -> B1 || M B2
  backpatch(B1.falselist, M.quad)
  B.truelist = merge(B1.truelist, B2.truelist)
  B.falselist = B2.falselist

B -> B1 && M B2
  backpatch(B1.truelist, M.quad)
  B.truelist = B2.truelist
  B.falselist = merge(B1.falselist, B2.falselist)

B -> ! B1
  B.truelist = B1.falselist
  B.falselist = B1.truelist

M -> epsilon
  M.quad = nextquad  (현재 명령문 번호 기록)

백패칭 예제

// a < b || c < d && e < f

생성 과정:
  100: if a < b goto ___      (B1.truelist = {100})
  101: goto ___                (B1.falselist = {101})
  102: if c < d goto ___      (B2.truelist = {102})
  103: goto ___                (B2.falselist = {103})
  104: if e < f goto ___      (B3.truelist = {104})
  105: goto ___                (B3.falselist = {105})

백패칭 결과:
  100: if a < b goto Ltrue    (truelist)
  101: goto 102               (B1.false -> B2 시작)
  102: if c < d goto 104      (B2.true -> B3 시작)
  103: goto Lfalse            (falselist)
  104: if e < f goto Ltrue    (truelist)
  105: goto Lfalse            (falselist)

최종 truelist = {100, 104}
최종 falselist = {103, 105}

10. Switch 문의 번역

// 소스 코드
switch (x) {
    case 1: stmt1; break;
    case 5: stmt2; break;
    default: stmt3;
}

// 3주소 코드
    t = x
    if t == 1 goto L1
    if t == 5 goto L2
    goto L3            // default
L1: (stmt1 코드)
    goto Lnext
L2: (stmt2 코드)
    goto Lnext
L3: (stmt3 코드)
    goto Lnext
Lnext: ...

case가 많으면 **점프 테이블(jump table)**이나 이진 탐색으로 최적화합니다.

정리

개념	핵심 내용
DAG	AST에서 공통 부분식을 공유하는 그래프
3주소 코드	명령문당 연산자 1개, 주소 최대 3개
쿼드러플	(op, arg1, arg2, result) 형식
트리플	결과를 명령문 번호로 참조
SSA	각 변수가 정확히 한 번만 정의되는 형식
타입 검사	연산자와 피연산자의 타입 호환성 검증
단축 평가	불리언 식의 부분 평가
백패칭	분기 대상을 나중에 채워 넣는 기법

중간 코드 생성은 소스 언어의 의미를 기계 독립적인 형태로 표현하는 핵심 단계입니다. 다음 글에서는 프로그램이 실행되는 환경인 실행 시간 환경을 다루겠습니다.

[Compiler] 09. Intermediate Code Generation - Three-Address Code and Type Checking

Intermediate Code Generation - Three-Address Code and Type Checking

Intermediate code generation is the phase that converts the syntax tree into a machine-independent Intermediate Representation (IR). This article covers the main IR formats, type checking, and control flow translation.

1. Syntax Tree Variants

Abstract Syntax Tree (AST)

An AST is a concise representation with unnecessary nodes removed from the parse tree.

// a + a * (b - c) + (b - c) * d

AST:
         +
        / \
       +    *
      / \  / \
     a   * -   d
        / \ / \
       a  - b   c
         / \
        b   c

Directed Acyclic Graph (DAG)

A DAG shares common subexpressions in the AST.

// a + a * (b - c) + (b - c) * d

DAG:
         +
        / \
       +    *
      / \  / \
     a   * |   d
        / \|
       a   -     <- (b-c) represented only once
          / \
         b   c

The advantages of a DAG are:

Automatic common subexpression detection: Shares identical operations to prevent redundant computation.
Memory savings: Reuses identical nodes.

DAG Construction Algorithm

Node createNode(String op, Node left, Node right) {
    // Check if a node with the same operation and children already exists
    Node existing = lookup(op, left, right);
    if (existing != null) {
        return existing;  // Reuse existing node
    }

    Node node = new Node(op, left, right);
    insert(op, left, right, node);  // Register in hash table
    return node;
}

2. Three-Address Code

Three-address code is the most widely used intermediate representation format. Each instruction has at most one operator and at most three addresses (operands and result).

Forms of Three-Address Code

Basic form: x = y op z

Main instruction types:
  x = y op z        // Binary operation
  x = op y          // Unary operation
  x = y             // Copy
  goto L            // Unconditional branch
  if x relop y goto L  // Conditional branch
  param x           // Parameter passing
  call p, n         // Function call (p: function, n: arg count)
  return y          // Function return
  x = y[i]          // Array access (index)
  x[i] = y          // Array assignment
  x = &y            // Address of
  x = *y            // Dereference
  *x = y            // Indirect assignment

Example: Three-Address Code Translation of an Expression

// Original code
a = b * -c + b * -c

// Three-address code (AST based)
t1 = minus c
t2 = b * t1
t3 = minus c
t4 = b * t3
t5 = t2 + t4
a = t5

// Three-address code (DAG based - optimized)
t1 = minus c
t2 = b * t1
t5 = t2 + t2
a = t5

3. Implementation of Three-Address Code

Quadruples

Composed of 4 fields: (op, arg1, arg2, result)

No.  | op    | arg1 | arg2 | result
----+-------+------+------+-------
 0  | minus | c    |      | t1
 1  | *     | b    | t1   | t2
 2  | minus | c    |      | t3
 3  | *     | b    | t3   | t4
 4  | +     | t2   | t4   | t5
 5  | =     | t5   |      | a

Triples

Removes the result field and references results by instruction number.

No.  | op    | arg1 | arg2
----+-------+------+------
 0  | minus | c    |
 1  | *     | b    | (0)
 2  | minus | c    |
 3  | *     | b    | (2)
 4  | +     | (1)  | (3)
 5  | =     | a    | (4)

Triples save memory but require updating references when instructions are moved.

Indirect Triples

Maintains a separate pointer list to triples, making instruction reordering easier.

Order list:      [0, 1, 2, 3, 4, 5]
Reordered:       [2, 0, 3, 1, 4, 5]  // Original triples unchanged

4. Static Single Assignment (SSA) Form

In SSA (Static Single Assignment) form, each variable is defined exactly once.

// Normal three-address code
p = a + b
q = p - c
p = q * d      // p defined twice
e = p + q

// SSA form
p1 = a + b
q1 = p1 - c
p2 = q1 * d   // New name p2
e1 = p2 + q1

Phi Functions

Phi functions are used at points where control flow merges.

// Original code
if (flag) x = 1; else x = 2;
y = x;

// SSA form
if (flag) x1 = 1; else x2 = 2;
x3 = phi(x1, x2);   // Selects based on which path was taken
y1 = x3;

Advantages of SSA

Simplifies data flow analysis: Each variable has a unique definition.
Facilitates optimization: Constant propagation, dead code elimination, etc., become intuitive.
Used as the core IR in modern compilers (LLVM, GCC).

5. Type Expressions

How types are represented in a type system.

Basic Types

boolean, char, integer, float, void

Type Constructors

1. Array: array(size, element_type)
   int a[10]  ->  array(10, integer)

2. Record/Struct: record(field_list)
   struct { int x; float y; }
   ->  record((x, integer), (y, float))

3. Pointer: pointer(type)
   int *p  ->  pointer(integer)

4. Function: parameter_types -> return_type
   int f(float, char)  ->  float x char -> integer

Type Equivalence

There are two equivalence criteria:

Structural Equivalence:
  - Two types are equivalent if their structures are the same
  - C language's basic approach

Name Equivalence:
  - Equivalent only if they have the same name
  - Pascal, Ada's approach

Example:
  type A = array(10, integer)
  type B = array(10, integer)

  Structural equivalence: A == B  (same structure)
  Name equivalence:       A != B  (different names)

6. Type Checking

SDD for Type Checking Rules

E -> num               { E.type = integer }
E -> num . num         { E.type = float }
E -> id                { E.type = lookup(id.entry) }
E -> E1 + E2           { E.type = max(E1.type, E2.type)
                          if compatible(E1.type, E2.type)
                          else type_error }
E -> E1 [ E2 ]         { E.type = if E1.type == array(s, t)
                                     and E2.type == integer
                                   then t
                                   else type_error }

Type Coercion

Automatically converts operands of different types.

Type conversion hierarchy (automatic widening):
  char -> int -> long -> float -> double

Example: int + float -> float + float -> float

Conversion rules (widening):
  E -> E1 + E2
    if E1.type == integer and E2.type == integer:
      E.type = integer
    else if E1.type == float and E2.type == integer:
      t = new Temp()
      emit(t "=" "intToFloat" E2.addr)
      E.type = float
    else if E1.type == integer and E2.type == float:
      t = new Temp()
      emit(t "=" "intToFloat" E1.addr)
      E.type = float

7. Control Flow Translation

If-Else Statement Translation

// Source code
if (x < 100) {
    x = x + 1;
} else {
    x = x - 1;
}

// Three-address code
    if x < 100 goto L1
    goto L2
L1: t1 = x + 1
    x = t1
    goto L3
L2: t2 = x - 1
    x = t2
L3: ...

While Statement Translation

// Source code
while (x < 100) {
    x = x * 2;
}

// Three-address code
L1: if x < 100 goto L2
    goto L3
L2: t1 = x * 2
    x = t1
    goto L1
L3: ...

8. Boolean Expression Translation

Short-Circuit Evaluation

Most languages use short-circuit evaluation for boolean expressions.

// a < b || c < d && e < f

Translation:
    if a < b goto Ltrue
    if c < d goto L1
    goto Lfalse
L1: if e < f goto Ltrue
    goto Lfalse

If a < b is true, the rest is not evaluated.

SDD for Boolean Expressions

B -> B1 || B2
  B1.true = B.true
  B1.false = newLabel()
  B2.true = B.true
  B2.false = B.false
  B.code = B1.code || label(B1.false) || B2.code

B -> B1 && B2
  B1.true = newLabel()
  B1.false = B.false
  B2.true = B.true
  B2.false = B.false
  B.code = B1.code || label(B1.true) || B2.code

B -> ! B1
  B1.true = B.false
  B1.false = B.true
  B.code = B1.code

B -> E1 relop E2
  B.code = gen("if" E1.addr relop E2.addr "goto" B.true)
        || gen("goto" B.false)

B -> true
  B.code = gen("goto" B.true)

B -> false
  B.code = gen("goto" B.false)

9. Backpatching

Backpatching is a technique for filling in branch target labels later when they are not yet determined.

Core Idea of Backpatching

1. Leave the target address empty when generating branch instructions
2. Fill in the empty address when the target is determined (backpatch)

Example:
  100: if x < y goto ___     <- target unknown
  101: goto ___              <- target unknown
  ...
  105: x = x + 1             <- discover this is the target

  After backpatching:
  100: if x < y goto 105     <- filled in
  101: goto ___              <- still unknown

Functions Used in Backpatching

makelist(i):    Creates a list containing only instruction number i
merge(p1, p2):  Merges two lists p1, p2
backpatch(p, i): Sets the target address of all instructions in list p to i

Boolean Expression Translation with Backpatching

B -> B1 || M B2
  backpatch(B1.falselist, M.quad)
  B.truelist = merge(B1.truelist, B2.truelist)
  B.falselist = B2.falselist

B -> B1 && M B2
  backpatch(B1.truelist, M.quad)
  B.truelist = B2.truelist
  B.falselist = merge(B1.falselist, B2.falselist)

B -> ! B1
  B.truelist = B1.falselist
  B.falselist = B1.truelist

M -> epsilon
  M.quad = nextquad  (record current instruction number)

Backpatching Example

// a < b || c < d && e < f

Generation process:
  100: if a < b goto ___      (B1.truelist = {100})
  101: goto ___                (B1.falselist = {101})
  102: if c < d goto ___      (B2.truelist = {102})
  103: goto ___                (B2.falselist = {103})
  104: if e < f goto ___      (B3.truelist = {104})
  105: goto ___                (B3.falselist = {105})

After backpatching:
  100: if a < b goto Ltrue    (truelist)
  101: goto 102               (B1.false -> B2 start)
  102: if c < d goto 104      (B2.true -> B3 start)
  103: goto Lfalse            (falselist)
  104: if e < f goto Ltrue    (truelist)
  105: goto Lfalse            (falselist)

Final truelist = {100, 104}
Final falselist = {103, 105}

10. Switch Statement Translation

// Source code
switch (x) {
    case 1: stmt1; break;
    case 5: stmt2; break;
    default: stmt3;
}

// Three-address code
    t = x
    if t == 1 goto L1
    if t == 5 goto L2
    goto L3            // default
L1: (stmt1 code)
    goto Lnext
L2: (stmt2 code)
    goto Lnext
L3: (stmt3 code)
    goto Lnext
Lnext: ...

When there are many cases, optimization is done using jump tables or binary search.

Summary

Concept	Key Content
DAG	Graph sharing common subexpressions from AST
Three-Address Code	One operator per instruction, max 3 addresses
Quadruples	(op, arg1, arg2, result) format
Triples	Results referenced by instruction number
SSA	Each variable defined exactly once
Type Checking	Verifying type compatibility of operators/operands
Short-Circuit Eval.	Partial evaluation of boolean expressions
Backpatching	Filling in branch targets later

Intermediate code generation is the core phase that represents the semantics of source language in a machine-independent form. In the next article, we will cover runtime environments where programs execute.