Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Syntax Analysis (2) - Bottom-Up Parsing and LR Parsers

Bottom-Up parsing starts from the input string and reduces it toward the start symbol. It can handle a wider range of grammars than LL parsers and is widely used in actual compilers.

1. Bottom-Up Parsing Overview

Basic Idea

Bottom-Up parsing repeatedly performs **reductions** to turn the input string into the start symbol.

Input: id * id + id

Reduction process (reverse of rightmost derivation):

id * id + id

F * id + id (F -> id)

T * id + id (T -> F)

T * F + id (F -> id)

T + id (T -> T * F)

E + id (E -> T)

E + F (F -> id)

E + T (T -> F)

E (E -> E + T)

This traces the rightmost derivation in reverse.

Handle

A **handle** is the combination of a reducible substring and the corresponding production rule. Finding and reducing the correct handle leads to the start symbol.

Sentential form: T * F + id

Handle: T * F (production: T -> T * F)

After reduction: T + id

Choosing the wrong handle makes it impossible to reach the start symbol. Handle selection is the core challenge of Bottom-Up parsing.

2. Shift-Reduce Parsing

A shift-reduce parser uses a **stack** and an **input buffer**.

Four Operations

1. **Shift**: Pushes the next input token onto the stack.

2. **Reduce**: Replaces the handle at the top of the stack with the nonterminal on the left side of a production.

3. **Accept**: Parsing succeeds.

4. **Error**: A syntax error is found.

Parsing Example: `id * id + id`

Stack Input Action

$ id * id + id $ Shift

$ id * id + id $ Reduce (F -> id)

$ F * id + id $ Reduce (T -> F)

$ T * id + id $ Shift

$ T * id + id $ Reduce (F -> id)

$ T * F + id $ Reduce (T -> T * F)

$ T + id $ Reduce (E -> T)

$ E + id $ Shift

$ E + id $ Reduce (F -> id)

$ E + F $ Reduce (T -> F)

$ E + T $ Reduce (E -> E + T)

$ E $ Accept!

Shift-Reduce Conflicts

Sometimes the parser cannot decide whether to shift or reduce.

- **Shift-Reduce Conflict**: Both shift and reduce are possible

- **Reduce-Reduce Conflict**: Multiple production rules can be used for reduction

Resolving these conflicts is the core of LR parsing techniques.

3. LR Parsing Overview

LR parsers are the most powerful shift-reduce parsers.

Meaning of LR(k)

- **L**: Scans input from **left to right**

- **R**: Produces a **rightmost derivation** in reverse

- **(k)**: Makes decisions based on **k lookahead tokens**

Components of an LR Parser

Input: a1 a2 ... an $

+---+---+---+---+

Stack: | s0| X1| s1| X2| s2 ...

+---+---+---+---+

Parsing table:

ACTION[s, a] = shift/reduce/accept/error

GOTO[s, A] = next state

- **Stack**: States and grammar symbols are stacked alternately.

- **ACTION table**: Determines the action based on current state and input token.

- **GOTO table**: Determines the state to transition to after a reduction.

LR Parsing Algorithm

Push initial state s0 onto the stack

loop:

s = top state on stack

a = current input symbol

case ACTION[s, a]:

shift sn:

Push a and state sn onto the stack

Advance to next input symbol

reduce (A -> beta):

Pop 2*|beta| symbols from the stack

s' = new top state on stack

Push A and GOTO[s', A] onto the stack

accept:

Parsing success, terminate

error:

Call error handling routine

4. LR(0) Items and Automaton

LR(0) Items

An LR(0) item is a production rule with a **dot** added, representing the parsing progress.

Production: A -> X Y Z

Possible LR(0) items:

A -> . X Y Z (nothing seen yet)

A -> X . Y Z (X has been seen)

A -> X Y . Z (X Y have been seen)

A -> X Y Z . (ready to reduce!)

Closure Operation

Computes the closure of an item set.

closure(I):

Add all items in I to the result

If A -> alpha . B beta is in the result:

For all productions B -> gamma:

Add B -> . gamma to the result (if not already present)

Repeat until no changes

GOTO Operation

goto(I, X):

For items A -> alpha . X beta in I,

collect A -> alpha X . beta, then take their closure

LR(0) Automaton Construction Example

Grammar:

E' -> E (augmented grammar start)

E -> E + T

E -> T

T -> T * F

T -> F

F -> ( E )

F -> id

I0 = closure({E' -> . E}):

E' -> . E

E -> . E + T

E -> . T

T -> . T * F

T -> . F

F -> . ( E )

F -> . id

goto(I0, E) = I1:

E' -> E .

E -> E . + T

goto(I0, T) = I2:

E -> T .

T -> T . * F

goto(I0, F) = I3:

T -> F .

goto(I0, '(') = I4:

F -> ( . E )

E -> . E + T

E -> . T

...

goto(I0, id) = I5:

F -> id .

5. SLR Parsing Table Construction

**SLR (Simple LR)** parsing uses LR(0) items and FOLLOW sets.

SLR Parsing Table Construction Algorithm

1. Construct the LR(0) item set collection C for the augmented grammar

2. For item set Ii:

a) If A -> alpha . a beta is in Ii and goto(Ii, a) = Ij:

ACTION[i, a] = shift j

b) If A -> alpha . is in Ii (A != S'):

For all a in FOLLOW(A):

ACTION[i, a] = reduce (A -> alpha)

c) If S' -> S . is in Ii:

ACTION[i, $] = accept

3. If goto(Ii, A) = Ij: GOTO[i, A] = j

SLR Parsing Table Example

ACTION GOTO

State | id | + | * | ( | ) | $ | E | T | F

-----+-----+-----+-----+-----+-----+---+---+---+---

0 | s5 | | | s4 | | | 1 | 2 | 3

1 | | s6 | | | | acc| | |

2 | | r2 | s7 | | r2 | r2| | |

3 | | r4 | r4 | | r4 | r4| | |

4 | s5 | | | s4 | | | 8 | 2 | 3

5 | | r6 | r6 | | r6 | r6| | |

6 | s5 | | | s4 | | | | 9 | 3

7 | s5 | | | s4 | | | | | 10

8 | | s6 | | | s11 | | | |

9 | | r1 | s7 | | r1 | r1| | |

10 | | r3 | r3 | | r3 | r3| | |

11 | | r5 | r5 | | r5 | r5| | |

Here `s5` means shift 5, `r2` means reduce by production 2, and `acc` means accept.

6. Canonical LR(1) Items

Some grammars cannot be handled by SLR parsers. **Canonical LR(1) items** add a **lookahead symbol** for more powerful parsing.

Form of LR(1) Items

[A -> alpha . beta, a]

- A -> alpha . beta: Same as LR(0) item

- a: Lookahead symbol (consulted when reducing with this item)

LR(1) Closure

closure(I):

If [A -> alpha . B beta, a] is in I:

For all productions B -> gamma:

For all b in FIRST(beta a):

Add [B -> . gamma, b]

Pros and Cons of Canonical LR(1) Parsing

Pros: Can handle more grammars than SLR

Cons: Number of states can become very large

(tens to hundreds of times more than LR(0))

7. LALR Parsing

**LALR (Look-Ahead LR)** is a compromise between the power of canonical LR(1) and the efficiency of SLR.

Core Idea

Merge item sets in LR(1) that have the same **core**.

LR(1) states:

I4: {[C -> .c C, c], [C -> .c C, d], [C -> .d, c], [C -> .d, d]}

I7: {[C -> .c C, $], [C -> .d, $]}

Same core, so merge:

I47: {[C -> .c C, c/d/$], [C -> .d, c/d/$]}

Characteristics of LALR Parsers

- Same table size as SLR with the same number of states as canonical LR(1)

- More powerful than SLR, slightly weaker than canonical LR(1)

- Can handle most programming languages

- **Yacc** and **Bison** generate LALR parsers

Relationship Between SLR, LR(1), and LALR

LR(0) grammars < SLR grammars < LALR grammars < LR(1) grammars

Practical choices:

- Simple grammars: SLR

- Most programming languages: LALR (Yacc/Bison)

- Complex grammars: Canonical LR(1)

8. Yacc Parser Generator

**Yacc (Yet Another Compiler Compiler)** is a tool that automatically generates LALR(1) parsers. The GNU version, **Bison**, is widely used.

Yacc Program Structure

Declarations

Grammar Rules and Actions

Supporting C Functions

Yacc Program Example: Calculator

#include <stdio.h>

#include <ctype.h>

int yylex();

void yyerror(const char *s);

%token NUM

line : expr '\n' { printf("Result: %d\n", $1); }

;

expr : expr '+' term { $$ = $1 + $3; }

| expr '-' term { $$ = $1 - $3; }

| term { $$ = $1; }

;

term : term '*' factor { $$ = $1 * $3; }

| term '/' factor { $$ = $1 / $3; }

| factor { $$ = $1; }

;

factor : '(' expr ')' { $$ = $2; }

| NUM { $$ = $1; }

;

int yylex() {

int c = getchar();

if (isdigit(c)) {

yylval = c - '0';

while (isdigit(c = getchar()))

yylval = yylval * 10 + c - '0';

ungetc(c, stdin);

return NUM;

}

return c;

}

void yyerror(const char *s) {

fprintf(stderr, "Error: %s\n", s);

}

int main() {

return yyparse();

}

Yacc Conflict Resolution Rules

When conflicts exist in the LALR parsing table, Yacc applies the following rules:

1. **Shift-Reduce Conflict**: Defaults to **shift** (resolves the dangling else problem).

2. **Reduce-Reduce Conflict**: Reduces by the **first listed rule**.

Yacc Precedence Directives

%left '+' '-' // Left-associative, lower precedence

%left '*' '/' // Left-associative, higher precedence

%right UMINUS // Right-associative, unary minus

Using these directives, ambiguous grammars can be handled without conflicts.

9. Error Recovery in LR Parsers

Panic Mode Recovery

1. Pop the stack until a state that can handle A is found

2. Skip input until a token in FOLLOW(A) is found

3. Push GOTO[s, A] and continue parsing

Yacc Error Recovery: error Token

stmt : expr ';'

| error ';' { yyerrok; printf("Skipping bad statement\n"); }

;

`error` is a special Yacc token that automatically matches in error situations. `yyerrok` declares that error recovery is complete.

Summary

| Concept | Key Content |

| ----------------- | ------------------------------------------------------- |

| Bottom-Up Parsing | Starts from input and reduces to start symbol |

| Handle | Reducible substring and production rule combination |

| Shift-Reduce | Stack-based 4 operations (shift, reduce, accept, error) |

| LR(0) Items | Dot represents parsing progress |

| SLR | LR(0) items + FOLLOW sets |

| Canonical LR(1) | Items with lookahead, most powerful |

| LALR | Merges LR(1) states with same core, practical |

| Yacc/Bison | Automatic LALR(1) parser generation tool |

LR parsing is a powerful technique that can efficiently handle most programming languages. In the next article, we will cover syntax-directed translation, which performs meaningful translation using parsing results.