Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Syntax Analysis (1) - Top-Down Parsing and LL Parsers

Syntax Analysis verifies that the token stream conforms to the grammar of a programming language and generates a syntax tree. This article covers Top-Down syntax analysis.

1. Role of the Parser

Token Stream

+---------+

| Parser | <---> Symbol Table

+---------+

Parse Tree / AST

The main roles of the parser are:

1. **Syntax Verification**: Confirms that the token stream conforms to the grammar.

2. **Syntax Tree Generation**: Used for subsequent semantic analysis and code generation.

3. **Error Reporting**: Reports the location and type of syntax errors.

4. **Error Recovery**: Continues parsing after errors to find additional errors.

2. Context-Free Grammar (CFG) Review

A context-free grammar G is defined as a 4-tuple (V, T, P, S):

- **V**: Set of nonterminals

- **T**: Set of terminals

- **P**: Set of productions

- **S**: Start symbol (an element of V)

Notation Conventions

Uppercase: Nonterminals (A, B, S, E, T, F, ...)

Lowercase: Terminals (a, b, c, +, *, (, ), ...)

Greek letters: Strings of grammar symbols (alpha, beta, gamma, ...)

Formal Definition of Derivation

If `A -> gamma` is a production, then `alpha A beta` can derive `alpha gamma beta`.

E -> E + T | T

T -> T * F | F

F -> ( E ) | id

Leftmost derivation example: id + id * id

E => E + T => T + T => F + T => id + T

=> id + T * F => id + F * F => id + id * F => id + id * id

3. Grammar Refinement

There are cases where the grammar needs to be transformed for Top-Down parsing.

3.1 Ambiguity Elimination

A classic example of ambiguous grammar is the **dangling else** problem.

// Ambiguous grammar

stmt -> if expr then stmt

| if expr then stmt else stmt

| other

In `if E1 then if E2 then S1 else S2`, it is ambiguous which `if` the `else` matches.

// Disambiguated: else matches the nearest if

matched_stmt -> if expr then matched_stmt else matched_stmt

| other

unmatched_stmt -> if expr then stmt

| if expr then matched_stmt else unmatched_stmt

stmt -> matched_stmt | unmatched_stmt

3.2 Left Recursion Elimination

**Left recursion** causes a recursive descent parser to enter an infinite loop.

// Left-recursive grammar

A -> A alpha | beta

// After left recursion elimination

A -> beta A'

A' -> alpha A' | epsilon

Let us look at a concrete example:

// Arithmetic expression grammar with left recursion

E -> E + T | T

T -> T * F | F

F -> ( E ) | id

// After left recursion elimination

E -> T E'

E' -> + T E' | epsilon

T -> F T'

T' -> * F T' | epsilon

F -> ( E ) | id

Indirect Left Recursion

// Indirect left recursion

S -> A a | b

A -> S c | d

// S -> A a -> S c a (indirectly left-recursive)

Indirect left recursion can also be systematically eliminated. Assign an order to nonterminals, substitute in order to convert to direct left recursion, then eliminate it.

3.3 Left Factoring

When two productions share the same prefix, it is difficult to decide which rule to choose.

// Grammar with common prefix

A -> alpha beta1 | alpha beta2

// After left factoring

A -> alpha A'

A' -> beta1 | beta2

A concrete example:

// if-else grammar

stmt -> if expr then stmt else stmt

| if expr then stmt

// After left factoring

stmt -> if expr then stmt else_part

else_part -> else stmt | epsilon

4. Recursive Descent Parser

A recursive descent parser is a Top-Down parsing method that writes **one function** for each nonterminal.

Implementation Example

Grammar: `E -> T E'`, `E' -> + T E' | epsilon`, `T -> F T'`, `T' -> * F T' | epsilon`, `F -> ( E ) | id`

public class RecursiveDescentParser {

private Token lookahead;

private Lexer lexer;

public RecursiveDescentParser(Lexer lexer) {

this.lexer = lexer;

this.lookahead = lexer.nextToken();

}

// E -> T E'

void E() {

T();

Eprime();

}

// E' -> + T E' | epsilon

void Eprime() {

if (lookahead.type == TokenType.PLUS) {

match(TokenType.PLUS);

T();

Eprime();

}

// else: epsilon (do nothing)

}

// T -> F T'

void T() {

F();

Tprime();

}

// T' -> * F T' | epsilon

void Tprime() {

if (lookahead.type == TokenType.STAR) {

match(TokenType.STAR);

F();

Tprime();

}

// else: epsilon

}

// F -> ( E ) | id

void F() {

if (lookahead.type == TokenType.LPAREN) {

match(TokenType.LPAREN);

E();

match(TokenType.RPAREN);

} else if (lookahead.type == TokenType.ID) {

match(TokenType.ID);

} else {

throw new SyntaxError("Unexpected token: " + lookahead);

}

void match(TokenType expected) {

if (lookahead.type == expected) {

lookahead = lexer.nextToken();

} else {

throw new SyntaxError(

"Expected " + expected + ", got " + lookahead.type

);

}

5. FIRST and FOLLOW Sets

The FIRST and FOLLOW sets are needed for a predictive parser to decide which production rule to choose.

FIRST(alpha) Set

The set of **first terminals** that can be derived from `alpha`.

Computation rules:

1. If X is a terminal: FIRST(X) = {X}

2. If X -> epsilon is a rule: add epsilon to FIRST(X)

3. If X -> Y1 Y2 ... Yk:

- Add FIRST(Y1) minus epsilon to FIRST(X)

- If Y1 derives epsilon, also add FIRST(Y2) (repeat)

- If all Yi derive epsilon, add epsilon to FIRST(X)

FIRST Set Computation Example

Grammar:

E -> T E'

E' -> + T E' | epsilon

T -> F T'

T' -> * F T' | epsilon

F -> ( E ) | id

FIRST(F) = { (, id }

FIRST(T') = { *, epsilon }

FIRST(T) = FIRST(F) = { (, id }

FIRST(E') = { +, epsilon }

FIRST(E) = FIRST(T) = { (, id }

FOLLOW(A) Set

The set of **terminals** that can appear immediately after nonterminal A.

Computation rules:

1. If S is the start symbol: add $ to FOLLOW(S)

2. If A -> alpha B beta:

Add FIRST(beta) minus epsilon to FOLLOW(B)

3. If A -> alpha B, or if A -> alpha B beta

where epsilon is in FIRST(beta):

Add FOLLOW(A) to FOLLOW(B)

FOLLOW Set Computation Example

FOLLOW(E) = { ), $ }

FOLLOW(E') = FOLLOW(E) = { ), $ }

FOLLOW(T) = FIRST(E') - {epsilon} + FOLLOW(E) + FOLLOW(E')

= { +, ), $ }

FOLLOW(T') = FOLLOW(T) = { +, ), $ }

FOLLOW(F) = FIRST(T') - {epsilon} + FOLLOW(T) + FOLLOW(T')

= { *, +, ), $ }

6. LL(1) Grammar and Predictive Parsing Table

Meaning of LL(1)

- **L**: Scans input from **left to right**

- **L**: Produces a **leftmost derivation**

- **(1)**: Makes decisions based on **1 lookahead token**

Predictive Parsing Table Construction Algorithm

For each production A -> alpha:

1. For each terminal a in FIRST(alpha):

Add A -> alpha to M[A, a]

2. If epsilon is in FIRST(alpha):

For each terminal b in FOLLOW(A):

Add A -> alpha to M[A, b]

3. If epsilon is in FIRST(alpha) and $ is in FOLLOW(A):

Add A -> alpha to M[A, $]

Predictive Parsing Table Example

| id | + | * | ( | ) | $

-------+-------------+-------------+-------------+--------------+-------------+--------

E | E -> T E' | | | E -> T E' | |

E' | | E' -> +TE' | | | E' -> eps | E' -> eps

T | T -> F T' | | | T -> F T' | |

F | F -> id | | | F -> ( E ) | |

Conditions for LL(1) Grammar

For a grammar to be LL(1), for two productions `A -> alpha | beta` of the same nonterminal, the following must hold:

1. `FIRST(alpha)` and `FIRST(beta)` must **not overlap**.

2. At most one of alpha and beta can derive epsilon.

3. If beta can derive epsilon, `FIRST(alpha)` and `FOLLOW(A)` must not overlap.

7. Table-Driven Predictive Parser

Let us examine the operation of a predictive parser using a stack and parsing table.

Input buffer: id + id * id $

Stack: $ E

Parsing table: M[A, a]

Parsing Algorithm

Push $ and start symbol onto the stack

repeat:

X = stack top, a = current input token

if X == a:

Pop from stack, advance input (match)

else if X is a terminal:

Error

else if M[X, a] is empty:

Error

else if M[X, a] = X -> Y1 Y2 ... Yk:

Pop X from stack

Push Yk, ..., Y2, Y1 in order (reverse order!)

until stack is empty and input is $ (accept)

Parsing Trace Example: `id + id * id`

Stack Input Action

$ E id + id * id $ E -> T E'

$ E' T id + id * id $ T -> F T'

$ E' T' F id + id * id $ F -> id

$ E' T' id id + id * id $ match id

$ E' T' + id * id $ T' -> epsilon

$ E' + id * id $ E' -> + T E'

$ E' T + + id * id $ match +

$ E' T id * id $ T -> F T'

$ E' T' F id * id $ F -> id

$ E' T' id id * id $ match id

$ E' T' * id $ T' -> * F T'

$ E' T' F * * id $ match *

$ E' T' F id $ F -> id

$ E' T' id id $ match id

$ E' T' $ T' -> epsilon

$ E' $ E' -> epsilon

$ $ Accept!

8. Error Recovery

Recovery strategies when the parser finds a syntax error.

Panic Mode

Skips input until a specific **synchronizing token** is encountered.

Synchronizing token selection criteria:

- Use elements of FOLLOW(A) as synchronizing tokens

- Statement delimiters (semicolons, braces, etc.)

- Keywords (if, while, return, etc.)

Syntax Error Reporting Example

Input: int x = + 5;

Error message:

line 1:9 - syntax error: unexpected '+', expected expression

Can find additional errors after recovery by continuing to parse

Error Productions

Commonly occurring errors are included in the grammar in advance.

// Normal grammar

stmt -> if ( expr ) stmt

// Error production added (missing parentheses case)

stmt -> if ( expr ) stmt

| if expr stmt { error("missing parentheses") }

Summary

| Concept | Key Content |

| -------------------- | ----------------------------------------------- |

| Top-Down Parsing | Starts from the start symbol and derives input |

| Left Recursion Elim. | Converts `A -> A alpha` form to right recursion |

| Left Factoring | Separates common prefix to clarify choice |

| FIRST Set | First terminal derivable from a string |

| FOLLOW Set | Terminals that can follow a nonterminal |

| LL(1) Grammar | Parsing decision possible with 1 lookahead |

| Recursive Descent | One function per nonterminal |

| Predictive Parser | Stack and parsing table based |

Top-Down parsing is intuitive and easy to implement, but has limitations on the grammars it can express. In the next article, we will explore the more powerful Bottom-Up parsing and LR parsers.