Split View: [컴파일러] 03. 간단한 구문 지시 번역기 만들기

[컴파일러] 03. 간단한 구문 지시 번역기 만들기

간단한 구문 지시 번역기 만들기

이 글에서는 컴파일러의 핵심 원리를 이해하기 위해 간단한 번역기를 만들어 봅니다. 문법 정의부터 시작하여 파스 트리, 구문 지시 번역의 기초를 다룹니다.

1. 구문 정의(Syntax Definition)

프로그래밍 언어의 구문은 **문맥 자유 문법(Context-Free Grammar, CFG)**으로 정의합니다.

문법의 구성 요소

문법은 4가지 요소로 이루어집니다.

터미널(Terminal): 토큰이라고도 하며, 언어의 기본 단위입니다 (예: +, *, id, num).
논터미널(Nonterminal): 구문 구조를 나타내는 변수입니다 (예: expr, stmt).
생성 규칙(Production): 논터미널이 어떻게 구성되는지를 정의합니다.
시작 기호(Start Symbol): 전체 프로그램을 나타내는 논터미널입니다.

문법 예시: 산술 표현식

expr   -> expr + term | expr - term | term
term   -> term * factor | term / factor | factor
factor -> ( expr ) | num | id

이 문법은 덧셈/뺄셈보다 곱셈/나눗셈의 우선순위가 높음을 표현합니다.

2. 유도(Derivation)

유도란 시작 기호에서 출발하여 생성 규칙을 반복 적용하여 문자열을 생성하는 과정입니다.

예제: `9 - 5 + 2`의 유도

문법:
  list -> list + digit | list - digit | digit
  digit -> 0 | 1 | 2 | ... | 9

좌측 유도(Leftmost Derivation):
  list
  => list + digit
  => list - digit + digit
  => digit - digit + digit
  => 9 - digit + digit
  => 9 - 5 + digit
  => 9 - 5 + 2

유도의 종류

좌측 유도(Leftmost Derivation): 항상 가장 왼쪽 논터미널을 먼저 대체합니다.
우측 유도(Rightmost Derivation): 항상 가장 오른쪽 논터미널을 먼저 대체합니다.

3. 파스 트리(Parse Tree)

파스 트리는 유도 과정을 트리 형태로 표현한 것입니다.

9 - 5 + 2에 대한 파스 트리를 살펴보겠습니다.

         list
        / | \
       /  |  \
     list '+' digit
    / | \       |
   /  |  \      2
 list '-' digit
  |         |
digit       5
  |
  9

파스 트리의 특징은 다음과 같습니다.

루트: 시작 기호
내부 노드: 논터미널 (생성 규칙의 적용)
리프 노드: 터미널 (실제 토큰)
리프 노드를 왼쪽에서 오른쪽으로 읽으면 원래 입력이 됩니다.

4. 모호성(Ambiguity)

하나의 문자열에 대해 두 개 이상의 파스 트리가 존재하면, 그 문법은 **모호하다(ambiguous)**고 합니다.

모호한 문법의 예

string -> string + string | string - string | 0 | 1 | ... | 9

이 문법으로 9 - 5 + 2를 파싱하면 두 가지 해석이 가능합니다.

해석 1: (9 - 5) + 2 = 6    해석 2: 9 - (5 + 2) = 2

     string                    string
    / | \                     / | \
   /  |  \                   /  |  \
 string + string          string - string
 / | \      |               |      / | \
/  |  \     2               9     /  |  \
string - string              string + string
  |        |                   |        |
  9        5                   5        2

모호성 해결

모호성은 **연산자 우선순위(precedence)**와 **결합법칙(associativity)**을 문법에 반영하여 해결합니다.

// 모호한 문법
expr -> expr + expr | expr * expr | num

// 모호성을 해결한 문법 (곱셈 우선, 왼쪽 결합)
expr   -> expr + term | term
term   -> term * factor | factor
factor -> num

5. 연산자 우선순위와 결합법칙

우선순위(Precedence)

문법의 계층 구조로 표현합니다. 낮은 우선순위 연산자가 트리의 더 높은 곳에 위치합니다.

우선순위 (낮음 -> 높음):
  ||  (논리 OR)
  &&  (논리 AND)
  ==  !=
  <   >   <=  >=
  +   -
  *   /   %
  !   -   (단항)

결합법칙(Associativity)

같은 우선순위의 연산자가 연속될 때 어떤 순서로 결합하는지를 정의합니다.

왼쪽 결합(Left-Associative):
  a - b - c = (a - b) - c
  문법: expr -> expr - term | term

오른쪽 결합(Right-Associative):
  a = b = c  는  a = (b = c)
  문법: right -> letter = right | letter

대부분의 이항 연산자는 왼쪽 결합이고, 대입 연산자(=)와 지수 연산자(**)는 오른쪽 결합입니다.

6. 구문 지시 번역(Syntax-Directed Translation) 기초

구문 지시 번역은 문법의 각 생성 규칙에 **의미 동작(semantic action)**을 부착하여, 파싱과 동시에 번역을 수행하는 기법입니다.

속성(Attribute)

문법 기호에 연관된 값을 속성이라 합니다. 속성에는 두 종류가 있습니다.

합성 속성(Synthesized Attribute): 자식 노드의 속성으로부터 계산됩니다.
상속 속성(Inherited Attribute): 부모나 형제 노드의 속성으로부터 계산됩니다.

구문 지시 정의(Syntax-Directed Definition)

각 생성 규칙에 **의미 규칙(semantic rule)**을 연결합니다.

생성 규칙                    의미 규칙
----------------------------------------------
expr -> expr1 + term        expr.val = expr1.val + term.val
expr -> term                expr.val = term.val
term -> num                 term.val = num.lexval

예를 들어, 9 + 5를 평가하면 다음과 같습니다.

         expr        (val = 14)
        / | \
       /  |  \
     expr + term     (val = 9, val = 5)
      |        |
    term      num
      |       (5)
    num
    (9)

7. 후위 표기법(Postfix Notation)

후위 표기법(역폴란드 표기법)은 연산자가 피연산자 뒤에 오는 표현 방식입니다.

중위 표기법 (Infix)        후위 표기법 (Postfix)
-------------------------------------------
9 - 5 + 2              9 5 - 2 +
9 - (5 + 2)            9 5 2 + -
a * b + c              a b * c +
a + b * c              a b c * +

후위 표기법의 장점은 다음과 같습니다.

괄호가 필요 없습니다: 연산 순서가 표기 자체에 내장됩니다.
스택으로 쉽게 평가할 수 있습니다.

후위 표기법 평가 알고리즘

입력: 9 5 - 2 +

스택 변화:
  9 읽음 -> [9]
  5 읽음 -> [9, 5]
  - 읽음 -> [4]       (9 - 5 = 4)
  2 읽음 -> [4, 2]
  + 읽음 -> [6]       (4 + 2 = 6)

결과: 6

8. 중위를 후위로 번역하는 구문 지시 정의

중위 표현식을 후위 표현식으로 번역하는 구문 지시 정의를 작성해 보겠습니다.

생성 규칙                    의미 동작
----------------------------------------------
expr -> expr1 + term        print('+')
expr -> expr1 - term        print('-')
expr -> term                (없음)
term -> 0                   print('0')
term -> 1                   print('1')
...
term -> 9                   print('9')

9 - 5 + 2에 대해 이 번역 체계를 적용하면 9 5 - 2 +가 출력됩니다.

9. 간단한 번역기 구현

재귀 하강(recursive descent) 방식으로 위의 번역기를 Java로 구현해 보겠습니다.

public class PostfixTranslator {
    private String input;
    private int pos;
    private char lookahead;

    public PostfixTranslator(String input) {
        this.input = input;
        this.pos = 0;
        this.lookahead = input.charAt(0);
    }

    // expr -> term rest
    // rest -> + term { print('+') } rest
    //       | - term { print('-') } rest
    //       | (epsilon)
    void expr() {
        term();
        while (true) {
            if (lookahead == '+') {
                match('+');
                term();
                System.out.print('+');
            } else if (lookahead == '-') {
                match('-');
                term();
                System.out.print('-');
            } else {
                return;
            }
        }
    }

    // term -> 0 { print('0') } | 1 { print('1') } | ... | 9 { print('9') }
    void term() {
        if (Character.isDigit(lookahead)) {
            System.out.print(lookahead);
            match(lookahead);
        } else {
            throw new RuntimeException("Syntax error");
        }
    }

    void match(char expected) {
        if (lookahead == expected) {
            pos++;
            if (pos < input.length()) {
                lookahead = input.charAt(pos);
            } else {
                lookahead = '\0'; // 입력 끝
            }
        } else {
            throw new RuntimeException("Expected " + expected);
        }
    }

    public static void main(String[] args) {
        PostfixTranslator translator = new PostfixTranslator("9-5+2");
        translator.expr();
        // 출력: 95-2+
    }
}

실행 흐름 추적

입력: 9-5+2

expr() 호출
  term() -> '9' 출력, lookahead = '-'
  lookahead == '-':
    match('-'), lookahead = '5'
    term() -> '5' 출력, lookahead = '+'
    '+' 는 아니므로 '-' 출력
  lookahead == '+':
    match('+'), lookahead = '2'
    term() -> '2' 출력, lookahead = '\0'
    '+' 출력
  lookahead == '\0': return

최종 출력: 95-2+

10. 추상 구문 트리(AST) 구성

번역기는 후위 표기법 대신 **추상 구문 트리(Abstract Syntax Tree, AST)**를 생성할 수도 있습니다.

AST는 파스 트리를 간소화한 것으로, 불필요한 중간 노드를 제거합니다.

파스 트리:                    AST:
      expr                      +
     / | \                     / \
   expr + term                -   2
  / | \     |                / \
expr - term num:2           9   5
  |      |
term   num:5
  |
num:9

AST 노드 구현

abstract class ASTNode {
    abstract int eval();
}

class NumNode extends ASTNode {
    int value;
    NumNode(int value) { this.value = value; }
    int eval() { return value; }
}

class BinOpNode extends ASTNode {
    char op;
    ASTNode left, right;

    BinOpNode(char op, ASTNode left, ASTNode right) {
        this.op = op;
        this.left = left;
        this.right = right;
    }

    int eval() {
        switch (op) {
            case '+': return left.eval() + right.eval();
            case '-': return left.eval() - right.eval();
            case '*': return left.eval() * right.eval();
            default: throw new RuntimeException("Unknown op: " + op);
        }
    }
}

정리

개념	핵심 내용
문맥 자유 문법	터미널, 논터미널, 생성 규칙, 시작 기호로 구성
유도	시작 기호에서 생성 규칙을 적용하여 문자열 생성
파스 트리	유도 과정의 트리 표현
모호성	하나의 문자열에 두 개 이상의 파스 트리가 존재
우선순위/결합법칙	문법 계층과 재귀 방향으로 표현
구문 지시 번역	생성 규칙에 의미 동작을 부착하여 번역
후위 표기법	연산자가 피연산자 뒤에 오는 표기법
합성 속성	자식 노드로부터 계산되는 속성

이 글에서 만든 간단한 번역기는 컴파일러의 핵심 원리를 보여주는 축소판입니다. 다음 글에서는 어휘 분석 단계를 더 깊이 살펴보겠습니다.

[Compiler] 03. Building a Simple Syntax-Directed Translator

Building a Simple Syntax-Directed Translator

In this article, we will build a simple translator to understand the core principles of compilers. Starting from grammar definition, we cover parse trees and the basics of syntax-directed translation.

1. Syntax Definition

The syntax of a programming language is defined using Context-Free Grammar (CFG).

Components of a Grammar

A grammar consists of four elements:

Terminal: Also called tokens, these are the basic units of the language (e.g., +, *, id, num).
Nonterminal: Variables that represent syntactic constructs (e.g., expr, stmt).
Production: Defines how a nonterminal is composed.
Start Symbol: The nonterminal that represents the entire program.

Grammar Example: Arithmetic Expressions

expr   -> expr + term | expr - term | term
term   -> term * factor | term / factor | factor
factor -> ( expr ) | num | id

This grammar expresses that multiplication/division has higher precedence than addition/subtraction.

2. Derivation

Derivation is the process of starting from the start symbol and repeatedly applying production rules to generate a string.

Example: Derivation of `9 - 5 + 2`

Grammar:
  list -> list + digit | list - digit | digit
  digit -> 0 | 1 | 2 | ... | 9

Leftmost Derivation:
  list
  => list + digit
  => list - digit + digit
  => digit - digit + digit
  => 9 - digit + digit
  => 9 - 5 + digit
  => 9 - 5 + 2

Types of Derivation

Leftmost Derivation: Always replaces the leftmost nonterminal first.
Rightmost Derivation: Always replaces the rightmost nonterminal first.

3. Parse Tree

A parse tree represents the derivation process in tree form.

Let us examine the parse tree for 9 - 5 + 2:

         list
        / | \
       /  |  \
     list '+' digit
    / | \       |
   /  |  \      2
 list '-' digit
  |         |
digit       5
  |
  9

The characteristics of a parse tree are:

Root: Start symbol
Internal nodes: Nonterminals (application of production rules)
Leaf nodes: Terminals (actual tokens)
Reading the leaf nodes from left to right yields the original input.

4. Ambiguity

If two or more parse trees exist for a single string, the grammar is said to be ambiguous.

Example of an Ambiguous Grammar

string -> string + string | string - string | 0 | 1 | ... | 9

Parsing 9 - 5 + 2 with this grammar allows two interpretations:

Interpretation 1: (9 - 5) + 2 = 6    Interpretation 2: 9 - (5 + 2) = 2

     string                    string
    / | \                     / | \
   /  |  \                   /  |  \
 string + string          string - string
 / | \      |               |      / | \
/  |  \     2               9     /  |  \
string - string              string + string
  |        |                   |        |
  9        5                   5        2

Resolving Ambiguity

Ambiguity is resolved by reflecting operator precedence and associativity in the grammar.

// Ambiguous grammar
expr -> expr + expr | expr * expr | num

// Disambiguated grammar (multiplication first, left-associative)
expr   -> expr + term | term
term   -> term * factor | factor
factor -> num

5. Operator Precedence and Associativity

Precedence

Expressed through the hierarchical structure of the grammar. Lower precedence operators are placed higher in the tree.

Precedence (low -> high):
  ||  (logical OR)
  &&  (logical AND)
  ==  !=
  <   >   <=  >=
  +   -
  *   /   %
  !   -   (unary)

Associativity

Defines the order in which operators of the same precedence are combined when they appear consecutively.

Left-Associative:
  a - b - c = (a - b) - c
  Grammar: expr -> expr - term | term

Right-Associative:
  a = b = c  is  a = (b = c)
  Grammar: right -> letter = right | letter

Most binary operators are left-associative, while the assignment operator (=) and exponentiation operator (**) are right-associative.

6. Syntax-Directed Translation Basics

Syntax-directed translation is a technique that attaches semantic actions to each production rule of the grammar, performing translation simultaneously with parsing.

Attributes

Values associated with grammar symbols are called attributes. There are two types:

Synthesized Attribute: Computed from the attributes of child nodes.
Inherited Attribute: Computed from the attributes of parent or sibling nodes.

Syntax-Directed Definition

Links semantic rules to each production rule.

Production                   Semantic Rule
----------------------------------------------
expr -> expr1 + term        expr.val = expr1.val + term.val
expr -> term                expr.val = term.val
term -> num                 term.val = num.lexval

For example, evaluating 9 + 5 yields:

         expr        (val = 14)
        / | \
       /  |  \
     expr + term     (val = 9, val = 5)
      |        |
    term      num
      |       (5)
    num
    (9)

7. Postfix Notation

Postfix notation (Reverse Polish Notation) is a representation where the operator comes after its operands.

Infix Notation              Postfix Notation
-------------------------------------------
9 - 5 + 2              9 5 - 2 +
9 - (5 + 2)            9 5 2 + -
a * b + c              a b * c +
a + b * c              a b c * +

The advantages of postfix notation are:

No parentheses needed: The order of operations is embedded in the notation itself.
Easy evaluation with a stack.

Postfix Notation Evaluation Algorithm

Input: 9 5 - 2 +

Stack changes:
  Read 9 -> [9]
  Read 5 -> [9, 5]
  Read - -> [4]       (9 - 5 = 4)
  Read 2 -> [4, 2]
  Read + -> [6]       (4 + 2 = 6)

Result: 6

8. Syntax-Directed Definition for Infix-to-Postfix Translation

Let us write a syntax-directed definition that translates infix expressions to postfix expressions.

Production                   Semantic Action
----------------------------------------------
expr -> expr1 + term        print('+')
expr -> expr1 - term        print('-')
expr -> term                (none)
term -> 0                   print('0')
term -> 1                   print('1')
...
term -> 9                   print('9')

Applying this translation scheme to 9 - 5 + 2 outputs 9 5 - 2 +.

9. Implementing a Simple Translator

Let us implement the above translator in Java using the recursive descent approach.

public class PostfixTranslator {
    private String input;
    private int pos;
    private char lookahead;

    public PostfixTranslator(String input) {
        this.input = input;
        this.pos = 0;
        this.lookahead = input.charAt(0);
    }

    // expr -> term rest
    // rest -> + term { print('+') } rest
    //       | - term { print('-') } rest
    //       | (epsilon)
    void expr() {
        term();
        while (true) {
            if (lookahead == '+') {
                match('+');
                term();
                System.out.print('+');
            } else if (lookahead == '-') {
                match('-');
                term();
                System.out.print('-');
            } else {
                return;
            }
        }
    }

    // term -> 0 { print('0') } | 1 { print('1') } | ... | 9 { print('9') }
    void term() {
        if (Character.isDigit(lookahead)) {
            System.out.print(lookahead);
            match(lookahead);
        } else {
            throw new RuntimeException("Syntax error");
        }
    }

    void match(char expected) {
        if (lookahead == expected) {
            pos++;
            if (pos < input.length()) {
                lookahead = input.charAt(pos);
            } else {
                lookahead = '\0'; // End of input
            }
        } else {
            throw new RuntimeException("Expected " + expected);
        }
    }

    public static void main(String[] args) {
        PostfixTranslator translator = new PostfixTranslator("9-5+2");
        translator.expr();
        // Output: 95-2+
    }
}

Execution Trace

Input: 9-5+2

expr() called
  term() -> prints '9', lookahead = '-'
  lookahead == '-':
    match('-'), lookahead = '5'
    term() -> prints '5', lookahead = '+'
    not '+', so prints '-'
  lookahead == '+':
    match('+'), lookahead = '2'
    term() -> prints '2', lookahead = '\0'
    prints '+'
  lookahead == '\0': return

Final output: 95-2+

10. Constructing Abstract Syntax Trees (AST)

Instead of postfix notation, a translator can also generate an Abstract Syntax Tree (AST).

An AST is a simplified version of the parse tree, removing unnecessary intermediate nodes.

Parse Tree:                    AST:
      expr                      +
     / | \                     / \
   expr + term                -   2
  / | \     |                / \
expr - term num:2           9   5
  |      |
term   num:5
  |
num:9

AST Node Implementation

abstract class ASTNode {
    abstract int eval();
}

class NumNode extends ASTNode {
    int value;
    NumNode(int value) { this.value = value; }
    int eval() { return value; }
}

class BinOpNode extends ASTNode {
    char op;
    ASTNode left, right;

    BinOpNode(char op, ASTNode left, ASTNode right) {
        this.op = op;
        this.left = left;
        this.right = right;
    }

    int eval() {
        switch (op) {
            case '+': return left.eval() + right.eval();
            case '-': return left.eval() - right.eval();
            case '*': return left.eval() * right.eval();
            default: throw new RuntimeException("Unknown op: " + op);
        }
    }
}

Summary

Concept	Key Content
Context-Free Grammar	Composed of terminals, nonterminals, productions, start symbol
Derivation	Generating strings by applying production rules from the start symbol
Parse Tree	Tree representation of the derivation process
Ambiguity	Two or more parse trees exist for a single string
Precedence/Associativity	Expressed through grammar hierarchy and recursion direction
Syntax-Directed Translation	Attaching semantic actions to production rules for translation
Postfix Notation	Notation where operators follow their operands
Synthesized Attribute	Attributes computed from child nodes

The simple translator built in this article is a miniature that demonstrates the core principles of compilers. In the next article, we will take a deeper look at the lexical analysis phase.