Syntax Analysis (2) - Bottom-Up Parsing and LR Parsers
Bottom-Up parsing starts from the input string and reduces it toward the start symbol. It can handle a wider range of grammars than LL parsers and is widely used in actual compilers.
1. Bottom-Up Parsing Overview
Basic Idea
Bottom-Up parsing repeatedly performs **reductions** to turn the input string into the start symbol.
Input: id * id + id
Reduction process (reverse of rightmost derivation):
id * id + id
F * id + id (F -> id)
T * id + id (T -> F)
T * F + id (F -> id)
T + id (T -> T * F)
E + id (E -> T)
E + F (F -> id)
E + T (T -> F)
E (E -> E + T)
This traces the rightmost derivation in reverse.
Handle
A **handle** is the combination of a reducible substring and the corresponding production rule. Finding and reducing the correct handle leads to the start symbol.
Sentential form: T * F + id
Handle: T * F (production: T -> T * F)
After reduction: T + id
Choosing the wrong handle makes it impossible to reach the start symbol. Handle selection is the core challenge of Bottom-Up parsing.
2. Shift-Reduce Parsing
A shift-reduce parser uses a **stack** and an **input buffer**.
Four Operations
1. **Shift**: Pushes the next input token onto the stack.
2. **Reduce**: Replaces the handle at the top of the stack with the nonterminal on the left side of a production.
3. **Accept**: Parsing succeeds.
4. **Error**: A syntax error is found.
Parsing Example: `id * id + id`
Stack Input Action
$ id * id + id $ Shift
$ id * id + id $ Reduce (F -> id)
$ F * id + id $ Reduce (T -> F)
$ T * id + id $ Shift
$ T * id + id $ Shift
$ T * id + id $ Reduce (F -> id)
$ T * F + id $ Reduce (T -> T * F)
$ T + id $ Reduce (E -> T)
$ E + id $ Shift
$ E + id $ Shift
$ E + id $ Reduce (F -> id)
$ E + F $ Reduce (T -> F)
$ E + T $ Reduce (E -> E + T)
$ E $ Accept!
Shift-Reduce Conflicts
Sometimes the parser cannot decide whether to shift or reduce.
- **Shift-Reduce Conflict**: Both shift and reduce are possible
- **Reduce-Reduce Conflict**: Multiple production rules can be used for reduction
Resolving these conflicts is the core of LR parsing techniques.
3. LR Parsing Overview
LR parsers are the most powerful shift-reduce parsers.
Meaning of LR(k)
- **L**: Scans input from **left to right**
- **R**: Produces a **rightmost derivation** in reverse
- **(k)**: Makes decisions based on **k lookahead tokens**
Components of an LR Parser
Input: a1 a2 ... an $
+---+---+---+---+
Stack: | s0| X1| s1| X2| s2 ...
+---+---+---+---+
Parsing table:
ACTION[s, a] = shift/reduce/accept/error
GOTO[s, A] = next state
- **Stack**: States and grammar symbols are stacked alternately.
- **ACTION table**: Determines the action based on current state and input token.
- **GOTO table**: Determines the state to transition to after a reduction.
LR Parsing Algorithm
Push initial state s0 onto the stack
loop:
s = top state on stack
a = current input symbol
case ACTION[s, a]:
shift sn:
Push a and state sn onto the stack
Advance to next input symbol
reduce (A -> beta):
Pop 2*|beta| symbols from the stack
s' = new top state on stack
Push A and GOTO[s', A] onto the stack
accept:
Parsing success, terminate
error:
Call error handling routine
4. LR(0) Items and Automaton
LR(0) Items
An LR(0) item is a production rule with a **dot** added, representing the parsing progress.
Production: A -> X Y Z
Possible LR(0) items:
A -> . X Y Z (nothing seen yet)
A -> X . Y Z (X has been seen)
A -> X Y . Z (X Y have been seen)
A -> X Y Z . (ready to reduce!)
Closure Operation
Computes the closure of an item set.
closure(I):
Add all items in I to the result
If A -> alpha . B beta is in the result:
For all productions B -> gamma:
Add B -> . gamma to the result (if not already present)
Repeat until no changes
GOTO Operation
goto(I, X):
For items A -> alpha . X beta in I,
collect A -> alpha X . beta, then take their closure
LR(0) Automaton Construction Example
Grammar:
E' -> E (augmented grammar start)
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )
F -> id
I0 = closure({E' -> . E}):
E' -> . E
E -> . E + T
E -> . T
T -> . T * F
T -> . F
F -> . ( E )
F -> . id
goto(I0, E) = I1:
E' -> E .
E -> E . + T
goto(I0, T) = I2:
E -> T .
T -> T . * F
goto(I0, F) = I3:
T -> F .
goto(I0, '(') = I4:
F -> ( . E )
E -> . E + T
E -> . T
...
goto(I0, id) = I5:
F -> id .
5. SLR Parsing Table Construction
**SLR (Simple LR)** parsing uses LR(0) items and FOLLOW sets.
SLR Parsing Table Construction Algorithm
1. Construct the LR(0) item set collection C for the augmented grammar
2. For item set Ii:
a) If A -> alpha . a beta is in Ii and goto(Ii, a) = Ij:
ACTION[i, a] = shift j
b) If A -> alpha . is in Ii (A != S'):
For all a in FOLLOW(A):
ACTION[i, a] = reduce (A -> alpha)
c) If S' -> S . is in Ii:
ACTION[i, $] = accept
3. If goto(Ii, A) = Ij: GOTO[i, A] = j
SLR Parsing Table Example
ACTION GOTO
State | id | + | * | ( | ) | $ | E | T | F
-----+-----+-----+-----+-----+-----+---+---+---+---
0 | s5 | | | s4 | | | 1 | 2 | 3
1 | | s6 | | | | acc| | |
2 | | r2 | s7 | | r2 | r2| | |
3 | | r4 | r4 | | r4 | r4| | |
4 | s5 | | | s4 | | | 8 | 2 | 3
5 | | r6 | r6 | | r6 | r6| | |
6 | s5 | | | s4 | | | | 9 | 3
7 | s5 | | | s4 | | | | | 10
8 | | s6 | | | s11 | | | |
9 | | r1 | s7 | | r1 | r1| | |
10 | | r3 | r3 | | r3 | r3| | |
11 | | r5 | r5 | | r5 | r5| | |
Here `s5` means shift 5, `r2` means reduce by production 2, and `acc` means accept.
6. Canonical LR(1) Items
Some grammars cannot be handled by SLR parsers. **Canonical LR(1) items** add a **lookahead symbol** for more powerful parsing.
Form of LR(1) Items
[A -> alpha . beta, a]
- A -> alpha . beta: Same as LR(0) item
- a: Lookahead symbol (consulted when reducing with this item)
LR(1) Closure
closure(I):
If [A -> alpha . B beta, a] is in I:
For all productions B -> gamma:
For all b in FIRST(beta a):
Add [B -> . gamma, b]
Pros and Cons of Canonical LR(1) Parsing
Pros: Can handle more grammars than SLR
Cons: Number of states can become very large
(tens to hundreds of times more than LR(0))
7. LALR Parsing
**LALR (Look-Ahead LR)** is a compromise between the power of canonical LR(1) and the efficiency of SLR.
Core Idea
Merge item sets in LR(1) that have the same **core**.
LR(1) states:
I4: {[C -> .c C, c], [C -> .c C, d], [C -> .d, c], [C -> .d, d]}
I7: {[C -> .c C, $], [C -> .d, $]}
Same core, so merge:
I47: {[C -> .c C, c/d/$], [C -> .d, c/d/$]}
Characteristics of LALR Parsers
- Same table size as SLR with the same number of states as canonical LR(1)
- More powerful than SLR, slightly weaker than canonical LR(1)
- Can handle most programming languages
- **Yacc** and **Bison** generate LALR parsers
Relationship Between SLR, LR(1), and LALR
LR(0) grammars < SLR grammars < LALR grammars < LR(1) grammars
Practical choices:
- Simple grammars: SLR
- Most programming languages: LALR (Yacc/Bison)
- Complex grammars: Canonical LR(1)
8. Yacc Parser Generator
**Yacc (Yet Another Compiler Compiler)** is a tool that automatically generates LALR(1) parsers. The GNU version, **Bison**, is widely used.
Yacc Program Structure
Declarations
%%
Grammar Rules and Actions
%%
Supporting C Functions
Yacc Program Example: Calculator
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
void yyerror(const char *s);
%}
%token NUM
%%
line : expr '\n' { printf("Result: %d\n", $1); }
;
expr : expr '+' term { $$ = $1 + $3; }
| expr '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| term '/' factor { $$ = $1 / $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| NUM { $$ = $1; }
;
%%
int yylex() {
int c = getchar();
if (isdigit(c)) {
yylval = c - '0';
while (isdigit(c = getchar()))
yylval = yylval * 10 + c - '0';
ungetc(c, stdin);
return NUM;
}
return c;
}
void yyerror(const char *s) {
fprintf(stderr, "Error: %s\n", s);
}
int main() {
return yyparse();
}
Yacc Conflict Resolution Rules
When conflicts exist in the LALR parsing table, Yacc applies the following rules:
1. **Shift-Reduce Conflict**: Defaults to **shift** (resolves the dangling else problem).
2. **Reduce-Reduce Conflict**: Reduces by the **first listed rule**.
Yacc Precedence Directives
%left '+' '-' // Left-associative, lower precedence
%left '*' '/' // Left-associative, higher precedence
%right UMINUS // Right-associative, unary minus
Using these directives, ambiguous grammars can be handled without conflicts.
9. Error Recovery in LR Parsers
Panic Mode Recovery
1. Pop the stack until a state that can handle A is found
2. Skip input until a token in FOLLOW(A) is found
3. Push GOTO[s, A] and continue parsing
Yacc Error Recovery: error Token
stmt : expr ';'
| error ';' { yyerrok; printf("Skipping bad statement\n"); }
;
`error` is a special Yacc token that automatically matches in error situations. `yyerrok` declares that error recovery is complete.
Summary
| Concept | Key Content |
| ----------------- | ------------------------------------------------------- |
| Bottom-Up Parsing | Starts from input and reduces to start symbol |
| Handle | Reducible substring and production rule combination |
| Shift-Reduce | Stack-based 4 operations (shift, reduce, accept, error) |
| LR(0) Items | Dot represents parsing progress |
| SLR | LR(0) items + FOLLOW sets |
| Canonical LR(1) | Items with lookahead, most powerful |
| LALR | Merges LR(1) states with same core, practical |
| Yacc/Bison | Automatic LALR(1) parser generation tool |
LR parsing is a powerful technique that can efficiently handle most programming languages. In the next article, we will cover syntax-directed translation, which performs meaningful translation using parsing results.
현재 단락 (1/244)
Bottom-Up parsing starts from the input string and reduces it toward the start symbol. It can handle...