Split View: [컴파일러] 06. 구문 분석 (1) - Top-Down 파싱과 LL 파서

[컴파일러] 06. 구문 분석 (1) - Top-Down 파싱과 LL 파서

구문 분석 (1) - Top-Down 파싱과 LL 파서

구문 분석(Syntax Analysis)은 토큰 스트림이 프로그래밍 언어의 문법에 맞는지 검증하고, 구문 트리를 생성하는 단계입니다. 이 글에서는 Top-Down 방식의 구문 분석을 다룹니다.

1. 파서의 역할

  토큰 스트림
       |
       v
  +---------+
  |  파서    | <---> 심볼 테이블
  +---------+
       |
       v
  파스 트리 / AST

파서의 주요 역할은 다음과 같습니다.

구문 검증: 토큰 스트림이 문법에 맞는지 확인합니다.
구문 트리 생성: 이후 의미 분석과 코드 생성에 사용됩니다.
에러 보고: 구문 오류의 위치와 종류를 보고합니다.
에러 복구: 오류 후에도 계속 파싱하여 추가 오류를 발견합니다.

2. 문맥 자유 문법(CFG) 복습

문맥 자유 문법 G는 4-튜플 (V, T, P, S)로 정의됩니다.

V: 논터미널의 집합
T: 터미널의 집합
P: 생성 규칙의 집합
S: 시작 기호 (V의 원소)

표기법 관례

대문자:     논터미널 (A, B, S, E, T, F, ...)
소문자:     터미널 (a, b, c, +, *, (, ), ...)
그리스 문자: 문법 기호의 문자열 (alpha, beta, gamma, ...)

유도(Derivation)의 형식적 정의

A -> gamma가 생성 규칙이면, alpha A beta에서 alpha gamma beta를 유도할 수 있습니다.

E -> E + T | T
T -> T * F | F
F -> ( E ) | id

좌측 유도 예시: id + id * id
  E => E + T => T + T => F + T => id + T
    => id + T * F => id + F * F => id + id * F => id + id * id

3. 문법 다듬기

Top-Down 파싱을 위해 문법을 변환해야 하는 경우가 있습니다.

3.1 모호성 제거

모호한 문법의 대표적인 예는 dangling else 문제입니다.

// 모호한 문법
stmt -> if expr then stmt
      | if expr then stmt else stmt
      | other

if E1 then if E2 then S1 else S2에서 else가 어느 if에 매치되는지 모호합니다.

// 모호성 해결: else는 가장 가까운 if에 매치
matched_stmt   -> if expr then matched_stmt else matched_stmt
                | other
unmatched_stmt -> if expr then stmt
                | if expr then matched_stmt else unmatched_stmt
stmt           -> matched_stmt | unmatched_stmt

3.2 좌재귀 제거(Left Recursion Elimination)

**좌재귀(left recursion)**가 있으면 재귀 하강 파서가 무한 루프에 빠집니다.

// 좌재귀 문법
A -> A alpha | beta

// 좌재귀 제거 후
A  -> beta A'
A' -> alpha A' | epsilon

실제 예를 보겠습니다.

// 좌재귀가 있는 산술 표현식 문법
E -> E + T | T
T -> T * F | F
F -> ( E ) | id

// 좌재귀 제거 후
E  -> T E'
E' -> + T E' | epsilon
T  -> F T'
T' -> * F T' | epsilon
F  -> ( E ) | id

간접 좌재귀

// 간접 좌재귀
S -> A a | b
A -> S c | d

// S -> A a -> S c a (간접적으로 좌재귀)

간접 좌재귀도 체계적으로 제거할 수 있습니다. 논터미널에 순서를 매기고 순서대로 대입하여 직접 좌재귀로 변환한 뒤 제거합니다.

3.3 좌인수분해(Left Factoring)

두 생성 규칙이 같은 접두사를 공유할 때, 어떤 규칙을 선택할지 결정하기 어렵습니다.

// 공통 접두사가 있는 문법
A -> alpha beta1 | alpha beta2

// 좌인수분해 후
A  -> alpha A'
A' -> beta1 | beta2

실제 예를 보겠습니다.

// if-else 문법
stmt -> if expr then stmt else stmt
      | if expr then stmt

// 좌인수분해 후
stmt       -> if expr then stmt else_part
else_part  -> else stmt | epsilon

4. 재귀 하강 파서(Recursive Descent Parser)

재귀 하강 파서는 각 논터미널에 대해 하나의 함수를 작성하는 Top-Down 파싱 방법입니다.

구현 예제

문법: E -> T E', E' -> + T E' | epsilon, T -> F T', T' -> * F T' | epsilon, F -> ( E ) | id

public class RecursiveDescentParser {
    private Token lookahead;
    private Lexer lexer;

    public RecursiveDescentParser(Lexer lexer) {
        this.lexer = lexer;
        this.lookahead = lexer.nextToken();
    }

    // E -> T E'
    void E() {
        T();
        Eprime();
    }

    // E' -> + T E' | epsilon
    void Eprime() {
        if (lookahead.type == TokenType.PLUS) {
            match(TokenType.PLUS);
            T();
            Eprime();
        }
        // else: epsilon (아무것도 하지 않음)
    }

    // T -> F T'
    void T() {
        F();
        Tprime();
    }

    // T' -> * F T' | epsilon
    void Tprime() {
        if (lookahead.type == TokenType.STAR) {
            match(TokenType.STAR);
            F();
            Tprime();
        }
        // else: epsilon
    }

    // F -> ( E ) | id
    void F() {
        if (lookahead.type == TokenType.LPAREN) {
            match(TokenType.LPAREN);
            E();
            match(TokenType.RPAREN);
        } else if (lookahead.type == TokenType.ID) {
            match(TokenType.ID);
        } else {
            throw new SyntaxError("Unexpected token: " + lookahead);
        }
    }

    void match(TokenType expected) {
        if (lookahead.type == expected) {
            lookahead = lexer.nextToken();
        } else {
            throw new SyntaxError(
                "Expected " + expected + ", got " + lookahead.type
            );
        }
    }
}

5. FIRST 집합과 FOLLOW 집합

예측 파서가 어떤 생성 규칙을 선택할지 결정하기 위해 FIRST와 FOLLOW 집합이 필요합니다.

FIRST(alpha) 집합

alpha에서 유도될 수 있는 문자열의 첫 번째 터미널의 집합입니다.

계산 규칙:
1. X가 터미널이면: FIRST(X) = {X}
2. X -> epsilon이 규칙이면: epsilon을 FIRST(X)에 추가
3. X -> Y1 Y2 ... Yk이면:
   - FIRST(Y1)에서 epsilon을 제외한 것을 FIRST(X)에 추가
   - Y1이 epsilon을 유도하면, FIRST(Y2)도 추가 (반복)
   - 모든 Yi가 epsilon을 유도하면, epsilon도 FIRST(X)에 추가

FIRST 집합 계산 예제

문법:
  E  -> T E'
  E' -> + T E' | epsilon
  T  -> F T'
  T' -> * F T' | epsilon
  F  -> ( E ) | id

FIRST(F)  = { (, id }
FIRST(T') = { *, epsilon }
FIRST(T)  = FIRST(F) = { (, id }
FIRST(E') = { +, epsilon }
FIRST(E)  = FIRST(T) = { (, id }

FOLLOW(A) 집합

논터미널 A 바로 뒤에 나올 수 있는 터미널의 집합입니다.

계산 규칙:
1. S가 시작 기호이면: $를 FOLLOW(S)에 추가
2. A -> alpha B beta이면:
   FIRST(beta)에서 epsilon을 제외한 것을 FOLLOW(B)에 추가
3. A -> alpha B이면, 또는 A -> alpha B beta에서
   epsilon이 FIRST(beta)에 포함되면:
   FOLLOW(A)를 FOLLOW(B)에 추가

FOLLOW 집합 계산 예제

FOLLOW(E)  = { ), $ }
FOLLOW(E') = FOLLOW(E) = { ), $ }
FOLLOW(T)  = FIRST(E') - {epsilon} + FOLLOW(E) + FOLLOW(E')
           = { +, ), $ }
FOLLOW(T') = FOLLOW(T) = { +, ), $ }
FOLLOW(F)  = FIRST(T') - {epsilon} + FOLLOW(T) + FOLLOW(T')
           = { *, +, ), $ }

6. LL(1) 문법과 예측 파싱 테이블

LL(1)의 의미

L: 입력을 왼쪽에서 오른쪽으로 스캔
L: 좌측 유도를 생성
(1): 1개의 선행 토큰(lookahead)을 보고 결정

예측 파싱 테이블 구성 알고리즘

각 생성 규칙 A -> alpha에 대해:
  1. FIRST(alpha)의 각 터미널 a에 대해:
     M[A, a]에 A -> alpha를 추가

  2. epsilon이 FIRST(alpha)에 포함되면:
     FOLLOW(A)의 각 터미널 b에 대해:
     M[A, b]에 A -> alpha를 추가

  3. epsilon이 FIRST(alpha)에 포함되고 $가 FOLLOW(A)에 포함되면:
     M[A, $]에 A -> alpha를 추가

예측 파싱 테이블 예제

       |  id         |  +          |  *          |  (           |  )          |  $
-------+-------------+-------------+-------------+--------------+-------------+--------
  E    | E -> T E'   |             |             | E -> T E'    |             |
  E'   |             | E' -> +TE'  |             |              | E' -> eps   | E' -> eps
  T    | T -> F T'   |             |             | T -> F T'    |             |
  T'   |             | T' -> eps   | T' -> *FT'  |              | T' -> eps   | T' -> eps
  F    | F -> id     |             |             | F -> ( E )   |             |

LL(1) 문법의 조건

문법이 LL(1)이려면 같은 논터미널의 두 생성 규칙 A -> alpha | beta에 대해 다음이 성립해야 합니다.

FIRST(alpha)와 FIRST(beta)가 겹치지 않아야 합니다.
alpha와 beta 중 최대 하나만 epsilon을 유도할 수 있습니다.
beta가 epsilon을 유도할 수 있으면, FIRST(alpha)와 FOLLOW(A)가 겹치지 않아야 합니다.

7. 테이블 기반 예측 파서

스택과 파싱 테이블을 사용하는 예측 파서의 동작을 살펴보겠습니다.

입력 버퍼: id + id * id $
스택:      $ E

파싱 테이블: M[A, a]

파싱 알고리즘

스택에 $와 시작 기호를 push

repeat:
  X = 스택 탑, a = 현재 입력 토큰

  if X == a:
    스택에서 pop, 입력 전진  (match)
  else if X가 터미널:
    에러
  else if M[X, a]가 비어있음:
    에러
  else if M[X, a] = X -> Y1 Y2 ... Yk:
    스택에서 X를 pop
    Yk, ..., Y2, Y1을 순서대로 push  (역순!)

until 스택이 비고 입력이 $ 일 때 (수락)

파싱 추적 예제: `id + id * id`

스택           입력              동작
$ E           id + id * id $   E -> T E'
$ E' T        id + id * id $   T -> F T'
$ E' T' F     id + id * id $   F -> id
$ E' T' id    id + id * id $   match id
$ E' T'       + id * id $      T' -> epsilon
$ E'          + id * id $      E' -> + T E'
$ E' T +      + id * id $      match +
$ E' T        id * id $        T -> F T'
$ E' T' F     id * id $        F -> id
$ E' T' id    id * id $        match id
$ E' T'       * id $           T' -> * F T'
$ E' T' F *   * id $           match *
$ E' T' F     id $             F -> id
$ E' T' id    id $             match id
$ E' T'       $                T' -> epsilon
$ E'          $                E' -> epsilon
$             $                수락!

8. 에러 복구

파서가 구문 오류를 발견했을 때의 복구 전략입니다.

패닉 모드(Panic Mode)

특정 **동기화 토큰(synchronizing token)**을 만날 때까지 입력을 건너뜁니다.

동기화 토큰 선택 기준:
- FOLLOW(A)의 원소를 동기화 토큰으로 사용
- 문장 구분자 (세미콜론, 중괄호 등)
- 키워드 (if, while, return 등)

구문 에러 보고 예시

입력: int x = + 5;

에러 메시지:
  line 1:9 - syntax error: unexpected '+', expected expression

복구 후 계속 파싱하여 추가 에러 발견 가능

에러 프로덕션(Error Production)

자주 발생하는 오류를 문법에 미리 포함시킵니다.

// 정상 문법
stmt -> if ( expr ) stmt

// 에러 프로덕션 추가 (괄호 빠뜨린 경우)
stmt -> if ( expr ) stmt
      | if expr stmt    { error("missing parentheses") }

정리

개념	핵심 내용
Top-Down 파싱	시작 기호에서 출발하여 입력을 유도
좌재귀 제거	`A -> A alpha` 형태를 우재귀로 변환
좌인수분해	공통 접두사를 분리하여 선택을 명확하게
FIRST 집합	문자열에서 유도 가능한 첫 번째 터미널
FOLLOW 집합	논터미널 뒤에 올 수 있는 터미널
LL(1) 문법	1개의 lookahead로 파싱 결정 가능
재귀 하강 파서	각 논터미널에 대해 함수 하나 작성
예측 파서	스택과 파싱 테이블 기반

Top-Down 파싱은 직관적이고 구현이 쉽지만, 표현할 수 있는 문법에 제한이 있습니다. 다음 글에서는 더 강력한 Bottom-Up 파싱과 LR 파서를 살펴보겠습니다.

[Compiler] 06. Syntax Analysis (1) - Top-Down Parsing and LL Parsers

Syntax Analysis (1) - Top-Down Parsing and LL Parsers

Syntax Analysis verifies that the token stream conforms to the grammar of a programming language and generates a syntax tree. This article covers Top-Down syntax analysis.

1. Role of the Parser

  Token Stream
       |
       v
  +---------+
  |  Parser  | <---> Symbol Table
  +---------+
       |
       v
  Parse Tree / AST

The main roles of the parser are:

Syntax Verification: Confirms that the token stream conforms to the grammar.
Syntax Tree Generation: Used for subsequent semantic analysis and code generation.
Error Reporting: Reports the location and type of syntax errors.
Error Recovery: Continues parsing after errors to find additional errors.

2. Context-Free Grammar (CFG) Review

A context-free grammar G is defined as a 4-tuple (V, T, P, S):

V: Set of nonterminals
T: Set of terminals
P: Set of productions
S: Start symbol (an element of V)

Notation Conventions

Uppercase:      Nonterminals (A, B, S, E, T, F, ...)
Lowercase:      Terminals (a, b, c, +, *, (, ), ...)
Greek letters:  Strings of grammar symbols (alpha, beta, gamma, ...)

Formal Definition of Derivation

If A -> gamma is a production, then alpha A beta can derive alpha gamma beta.

E -> E + T | T
T -> T * F | F
F -> ( E ) | id

Leftmost derivation example: id + id * id
  E => E + T => T + T => F + T => id + T
    => id + T * F => id + F * F => id + id * F => id + id * id

3. Grammar Refinement

There are cases where the grammar needs to be transformed for Top-Down parsing.

3.1 Ambiguity Elimination

A classic example of ambiguous grammar is the dangling else problem.

// Ambiguous grammar
stmt -> if expr then stmt
      | if expr then stmt else stmt
      | other

In if E1 then if E2 then S1 else S2, it is ambiguous which if the else matches.

// Disambiguated: else matches the nearest if
matched_stmt   -> if expr then matched_stmt else matched_stmt
                | other
unmatched_stmt -> if expr then stmt
                | if expr then matched_stmt else unmatched_stmt
stmt           -> matched_stmt | unmatched_stmt

3.2 Left Recursion Elimination

Left recursion causes a recursive descent parser to enter an infinite loop.

// Left-recursive grammar
A -> A alpha | beta

// After left recursion elimination
A  -> beta A'
A' -> alpha A' | epsilon

Let us look at a concrete example:

// Arithmetic expression grammar with left recursion
E -> E + T | T
T -> T * F | F
F -> ( E ) | id

// After left recursion elimination
E  -> T E'
E' -> + T E' | epsilon
T  -> F T'
T' -> * F T' | epsilon
F  -> ( E ) | id

Indirect Left Recursion

// Indirect left recursion
S -> A a | b
A -> S c | d

// S -> A a -> S c a (indirectly left-recursive)

Indirect left recursion can also be systematically eliminated. Assign an order to nonterminals, substitute in order to convert to direct left recursion, then eliminate it.

3.3 Left Factoring

When two productions share the same prefix, it is difficult to decide which rule to choose.

// Grammar with common prefix
A -> alpha beta1 | alpha beta2

// After left factoring
A  -> alpha A'
A' -> beta1 | beta2

A concrete example:

// if-else grammar
stmt -> if expr then stmt else stmt
      | if expr then stmt

// After left factoring
stmt       -> if expr then stmt else_part
else_part  -> else stmt | epsilon

4. Recursive Descent Parser

A recursive descent parser is a Top-Down parsing method that writes one function for each nonterminal.

Implementation Example

Grammar: E -> T E', E' -> + T E' | epsilon, T -> F T', T' -> * F T' | epsilon, F -> ( E ) | id

public class RecursiveDescentParser {
    private Token lookahead;
    private Lexer lexer;

    public RecursiveDescentParser(Lexer lexer) {
        this.lexer = lexer;
        this.lookahead = lexer.nextToken();
    }

    // E -> T E'
    void E() {
        T();
        Eprime();
    }

    // E' -> + T E' | epsilon
    void Eprime() {
        if (lookahead.type == TokenType.PLUS) {
            match(TokenType.PLUS);
            T();
            Eprime();
        }
        // else: epsilon (do nothing)
    }

    // T -> F T'
    void T() {
        F();
        Tprime();
    }

    // T' -> * F T' | epsilon
    void Tprime() {
        if (lookahead.type == TokenType.STAR) {
            match(TokenType.STAR);
            F();
            Tprime();
        }
        // else: epsilon
    }

    // F -> ( E ) | id
    void F() {
        if (lookahead.type == TokenType.LPAREN) {
            match(TokenType.LPAREN);
            E();
            match(TokenType.RPAREN);
        } else if (lookahead.type == TokenType.ID) {
            match(TokenType.ID);
        } else {
            throw new SyntaxError("Unexpected token: " + lookahead);
        }
    }

    void match(TokenType expected) {
        if (lookahead.type == expected) {
            lookahead = lexer.nextToken();
        } else {
            throw new SyntaxError(
                "Expected " + expected + ", got " + lookahead.type
            );
        }
    }
}

5. FIRST and FOLLOW Sets

The FIRST and FOLLOW sets are needed for a predictive parser to decide which production rule to choose.

FIRST(alpha) Set

The set of first terminals that can be derived from alpha.

Computation rules:
1. If X is a terminal: FIRST(X) = {X}
2. If X -> epsilon is a rule: add epsilon to FIRST(X)
3. If X -> Y1 Y2 ... Yk:
   - Add FIRST(Y1) minus epsilon to FIRST(X)
   - If Y1 derives epsilon, also add FIRST(Y2) (repeat)
   - If all Yi derive epsilon, add epsilon to FIRST(X)

FIRST Set Computation Example

Grammar:
  E  -> T E'
  E' -> + T E' | epsilon
  T  -> F T'
  T' -> * F T' | epsilon
  F  -> ( E ) | id

FIRST(F)  = { (, id }
FIRST(T') = { *, epsilon }
FIRST(T)  = FIRST(F) = { (, id }
FIRST(E') = { +, epsilon }
FIRST(E)  = FIRST(T) = { (, id }

FOLLOW(A) Set

The set of terminals that can appear immediately after nonterminal A.

Computation rules:
1. If S is the start symbol: add $ to FOLLOW(S)
2. If A -> alpha B beta:
   Add FIRST(beta) minus epsilon to FOLLOW(B)
3. If A -> alpha B, or if A -> alpha B beta
   where epsilon is in FIRST(beta):
   Add FOLLOW(A) to FOLLOW(B)

FOLLOW Set Computation Example

FOLLOW(E)  = { ), $ }
FOLLOW(E') = FOLLOW(E) = { ), $ }
FOLLOW(T)  = FIRST(E') - {epsilon} + FOLLOW(E) + FOLLOW(E')
           = { +, ), $ }
FOLLOW(T') = FOLLOW(T) = { +, ), $ }
FOLLOW(F)  = FIRST(T') - {epsilon} + FOLLOW(T) + FOLLOW(T')
           = { *, +, ), $ }

6. LL(1) Grammar and Predictive Parsing Table

Meaning of LL(1)

L: Scans input from left to right
L: Produces a leftmost derivation
(1): Makes decisions based on 1 lookahead token

Predictive Parsing Table Construction Algorithm

For each production A -> alpha:
  1. For each terminal a in FIRST(alpha):
     Add A -> alpha to M[A, a]

  2. If epsilon is in FIRST(alpha):
     For each terminal b in FOLLOW(A):
     Add A -> alpha to M[A, b]

  3. If epsilon is in FIRST(alpha) and $ is in FOLLOW(A):
     Add A -> alpha to M[A, $]

Predictive Parsing Table Example

       |  id         |  +          |  *          |  (           |  )          |  $
-------+-------------+-------------+-------------+--------------+-------------+--------
  E    | E -> T E'   |             |             | E -> T E'    |             |
  E'   |             | E' -> +TE'  |             |              | E' -> eps   | E' -> eps
  T    | T -> F T'   |             |             | T -> F T'    |             |
  T'   |             | T' -> eps   | T' -> *FT'  |              | T' -> eps   | T' -> eps
  F    | F -> id     |             |             | F -> ( E )   |             |

Conditions for LL(1) Grammar

For a grammar to be LL(1), for two productions A -> alpha | beta of the same nonterminal, the following must hold:

FIRST(alpha) and FIRST(beta) must not overlap.
At most one of alpha and beta can derive epsilon.
If beta can derive epsilon, FIRST(alpha) and FOLLOW(A) must not overlap.

7. Table-Driven Predictive Parser

Let us examine the operation of a predictive parser using a stack and parsing table.

Input buffer: id + id * id $
Stack:        $ E

Parsing table: M[A, a]

Parsing Algorithm

Push $ and start symbol onto the stack

repeat:
  X = stack top, a = current input token

  if X == a:
    Pop from stack, advance input  (match)
  else if X is a terminal:
    Error
  else if M[X, a] is empty:
    Error
  else if M[X, a] = X -> Y1 Y2 ... Yk:
    Pop X from stack
    Push Yk, ..., Y2, Y1 in order  (reverse order!)

until stack is empty and input is $ (accept)

Parsing Trace Example: `id + id * id`

Stack           Input              Action
$ E           id + id * id $   E -> T E'
$ E' T        id + id * id $   T -> F T'
$ E' T' F     id + id * id $   F -> id
$ E' T' id    id + id * id $   match id
$ E' T'       + id * id $      T' -> epsilon
$ E'          + id * id $      E' -> + T E'
$ E' T +      + id * id $      match +
$ E' T        id * id $        T -> F T'
$ E' T' F     id * id $        F -> id
$ E' T' id    id * id $        match id
$ E' T'       * id $           T' -> * F T'
$ E' T' F *   * id $           match *
$ E' T' F     id $             F -> id
$ E' T' id    id $             match id
$ E' T'       $                T' -> epsilon
$ E'          $                E' -> epsilon
$             $                Accept!

8. Error Recovery

Recovery strategies when the parser finds a syntax error.

Panic Mode

Skips input until a specific synchronizing token is encountered.

Synchronizing token selection criteria:
- Use elements of FOLLOW(A) as synchronizing tokens
- Statement delimiters (semicolons, braces, etc.)
- Keywords (if, while, return, etc.)

Syntax Error Reporting Example

Input: int x = + 5;

Error message:
  line 1:9 - syntax error: unexpected '+', expected expression

Can find additional errors after recovery by continuing to parse

Error Productions

Commonly occurring errors are included in the grammar in advance.

// Normal grammar
stmt -> if ( expr ) stmt

// Error production added (missing parentheses case)
stmt -> if ( expr ) stmt
      | if expr stmt    { error("missing parentheses") }

Summary

Concept	Key Content
Top-Down Parsing	Starts from the start symbol and derives input
Left Recursion Elim.	Converts `A -> A alpha` form to right recursion
Left Factoring	Separates common prefix to clarify choice
FIRST Set	First terminal derivable from a string
FOLLOW Set	Terminals that can follow a nonterminal
LL(1) Grammar	Parsing decision possible with 1 lookahead
Recursive Descent	One function per nonterminal
Predictive Parser	Stack and parsing table based

Top-Down parsing is intuitive and easy to implement, but has limitations on the grammars it can express. In the next article, we will explore the more powerful Bottom-Up parsing and LR parsers.