[Compiler] 07. Syntax Analysis (2) - Bottom-Up Parsing and LR Parsers

Syntax Analysis (2) - Bottom-Up Parsing and LR Parsers

Bottom-Up parsing starts from the input string and reduces it toward the start symbol. It can handle a wider range of grammars than LL parsers and is widely used in actual compilers.

1. Bottom-Up Parsing Overview

Basic Idea

Bottom-Up parsing repeatedly performs reductions to turn the input string into the start symbol.

Input: id * id + id

Reduction process (reverse of rightmost derivation):
  id * id + id
  F  * id + id     (F -> id)
  T  * id + id     (T -> F)
  T  * F  + id     (F -> id)
  T     + id       (T -> T * F)
  E     + id       (E -> T)
  E     + F        (F -> id)
  E     + T        (T -> F)
  E                (E -> E + T)

This traces the rightmost derivation in reverse.

Handle

A handle is the combination of a reducible substring and the corresponding production rule. Finding and reducing the correct handle leads to the start symbol.

Sentential form: T * F + id
Handle: T * F (production: T -> T * F)
After reduction: T + id

Choosing the wrong handle makes it impossible to reach the start symbol. Handle selection is the core challenge of Bottom-Up parsing.

2. Shift-Reduce Parsing

A shift-reduce parser uses a stack and an input buffer.

Four Operations

Shift: Pushes the next input token onto the stack.
Reduce: Replaces the handle at the top of the stack with the nonterminal on the left side of a production.
Accept: Parsing succeeds.
Error: A syntax error is found.

Parsing Example: `id * id + id`

Stack          Input              Action
$            id * id + id $   Shift
$ id         * id + id $      Reduce (F -> id)
$ F          * id + id $      Reduce (T -> F)
$ T          * id + id $      Shift
$ T *        id + id $        Shift
$ T * id     + id $           Reduce (F -> id)
$ T * F      + id $           Reduce (T -> T * F)
$ T          + id $           Reduce (E -> T)
$ E          + id $           Shift
$ E +        id $             Shift
$ E + id     $                Reduce (F -> id)
$ E + F      $                Reduce (T -> F)
$ E + T      $                Reduce (E -> E + T)
$ E          $                Accept!

Shift-Reduce Conflicts

Sometimes the parser cannot decide whether to shift or reduce.

Shift-Reduce Conflict: Both shift and reduce are possible
Reduce-Reduce Conflict: Multiple production rules can be used for reduction

Resolving these conflicts is the core of LR parsing techniques.

3. LR Parsing Overview

LR parsers are the most powerful shift-reduce parsers.

Meaning of LR(k)

L: Scans input from left to right
R: Produces a rightmost derivation in reverse
(k): Makes decisions based on k lookahead tokens

Components of an LR Parser

Input: a1 a2 ... an $

        +---+---+---+---+
Stack:  | s0| X1| s1| X2| s2 ...
        +---+---+---+---+

Parsing table:
  ACTION[s, a] = shift/reduce/accept/error
  GOTO[s, A]   = next state

Stack: States and grammar symbols are stacked alternately.
ACTION table: Determines the action based on current state and input token.
GOTO table: Determines the state to transition to after a reduction.

LR Parsing Algorithm

Push initial state s0 onto the stack

loop:
  s = top state on stack
  a = current input symbol

  case ACTION[s, a]:
    shift sn:
      Push a and state sn onto the stack
      Advance to next input symbol

    reduce (A -> beta):
      Pop 2*|beta| symbols from the stack
      s' = new top state on stack
      Push A and GOTO[s', A] onto the stack

    accept:
      Parsing success, terminate

    error:
      Call error handling routine

4. LR(0) Items and Automaton

LR(0) Items

An LR(0) item is a production rule with a dot added, representing the parsing progress.

Production: A -> X Y Z

Possible LR(0) items:
  A -> . X Y Z    (nothing seen yet)
  A -> X . Y Z    (X has been seen)
  A -> X Y . Z    (X Y have been seen)
  A -> X Y Z .    (ready to reduce!)

Closure Operation

Computes the closure of an item set.

closure(I):
  Add all items in I to the result
  If A -> alpha . B beta is in the result:
    For all productions B -> gamma:
    Add B -> . gamma to the result (if not already present)
  Repeat until no changes

GOTO Operation

goto(I, X):
  For items A -> alpha . X beta in I,
  collect A -> alpha X . beta, then take their closure

LR(0) Automaton Construction Example

Grammar:
  E' -> E       (augmented grammar start)
  E  -> E + T
  E  -> T
  T  -> T * F
  T  -> F
  F  -> ( E )
  F  -> id

I0 = closure({E' -> . E}):
  E' -> . E
  E  -> . E + T
  E  -> . T
  T  -> . T * F
  T  -> . F
  F  -> . ( E )
  F  -> . id

goto(I0, E) = I1:
  E' -> E .
  E  -> E . + T

goto(I0, T) = I2:
  E  -> T .
  T  -> T . * F

goto(I0, F) = I3:
  T  -> F .

goto(I0, '(') = I4:
  F  -> ( . E )
  E  -> . E + T
  E  -> . T
  ...

goto(I0, id) = I5:
  F  -> id .

5. SLR Parsing Table Construction

SLR (Simple LR) parsing uses LR(0) items and FOLLOW sets.

SLR Parsing Table Construction Algorithm

1. Construct the LR(0) item set collection C for the augmented grammar

2. For item set Ii:
   a) If A -> alpha . a beta is in Ii and goto(Ii, a) = Ij:
      ACTION[i, a] = shift j

   b) If A -> alpha . is in Ii (A != S'):
      For all a in FOLLOW(A):
      ACTION[i, a] = reduce (A -> alpha)

   c) If S' -> S . is in Ii:
      ACTION[i, $] = accept

3. If goto(Ii, A) = Ij: GOTO[i, A] = j

SLR Parsing Table Example

       ACTION                          GOTO
State | id  | +   | *   | (   | )   | $ | E | T | F
-----+-----+-----+-----+-----+-----+---+---+---+---
  0  | s5  |     |     | s4  |     |   | 1 | 2 | 3
  1  |     | s6  |     |     |     | acc|   |   |
  2  |     | r2  | s7  |     | r2  | r2|   |   |
  3  |     | r4  | r4  |     | r4  | r4|   |   |
  4  | s5  |     |     | s4  |     |   | 8 | 2 | 3
  5  |     | r6  | r6  |     | r6  | r6|   |   |
  6  | s5  |     |     | s4  |     |   |   | 9 | 3
  7  | s5  |     |     | s4  |     |   |   |   | 10
  8  |     | s6  |     |     | s11 |   |   |   |
  9  |     | r1  | s7  |     | r1  | r1|   |   |
  10 |     | r3  | r3  |     | r3  | r3|   |   |
  11 |     | r5  | r5  |     | r5  | r5|   |   |

Here s5 means shift 5, r2 means reduce by production 2, and acc means accept.

6. Canonical LR(1) Items

Some grammars cannot be handled by SLR parsers. Canonical LR(1) items add a lookahead symbol for more powerful parsing.

Form of LR(1) Items

[A -> alpha . beta, a]

- A -> alpha . beta: Same as LR(0) item
- a: Lookahead symbol (consulted when reducing with this item)

LR(1) Closure

closure(I):
  If [A -> alpha . B beta, a] is in I:
    For all productions B -> gamma:
    For all b in FIRST(beta a):
    Add [B -> . gamma, b]

Pros and Cons of Canonical LR(1) Parsing

Pros: Can handle more grammars than SLR
Cons: Number of states can become very large
      (tens to hundreds of times more than LR(0))

7. LALR Parsing

LALR (Look-Ahead LR) is a compromise between the power of canonical LR(1) and the efficiency of SLR.

Core Idea

Merge item sets in LR(1) that have the same core.

LR(1) states:
  I4: {[C -> .c C, c], [C -> .c C, d], [C -> .d, c], [C -> .d, d]}
  I7: {[C -> .c C, $], [C -> .d, $]}

Same core, so merge:
  I47: {[C -> .c C, c/d/$], [C -> .d, c/d/$]}

Characteristics of LALR Parsers

Same table size as SLR with the same number of states as canonical LR(1)
More powerful than SLR, slightly weaker than canonical LR(1)
Can handle most programming languages
Yacc and Bison generate LALR parsers

Relationship Between SLR, LR(1), and LALR

LR(0) grammars < SLR grammars < LALR grammars < LR(1) grammars

Practical choices:
- Simple grammars: SLR
- Most programming languages: LALR (Yacc/Bison)
- Complex grammars: Canonical LR(1)

8. Yacc Parser Generator

Yacc (Yet Another Compiler Compiler) is a tool that automatically generates LALR(1) parsers. The GNU version, Bison, is widely used.

Yacc Program Structure

Declarations
%%
Grammar Rules and Actions
%%
Supporting C Functions

Yacc Program Example: Calculator

%{
#include <stdio.h>
#include <ctype.h>
int yylex();
void yyerror(const char *s);
%}

%token NUM

%%

line   : expr '\n'          { printf("Result: %d\n", $1); }
       ;

expr   : expr '+' term      { $$ = $1 + $3; }
       | expr '-' term      { $$ = $1 - $3; }
       | term               { $$ = $1; }
       ;

term   : term '*' factor    { $$ = $1 * $3; }
       | term '/' factor    { $$ = $1 / $3; }
       | factor             { $$ = $1; }
       ;

factor : '(' expr ')'       { $$ = $2; }
       | NUM                 { $$ = $1; }
       ;

%%

int yylex() {
    int c = getchar();
    if (isdigit(c)) {
        yylval = c - '0';
        while (isdigit(c = getchar()))
            yylval = yylval * 10 + c - '0';
        ungetc(c, stdin);
        return NUM;
    }
    return c;
}

void yyerror(const char *s) {
    fprintf(stderr, "Error: %s\n", s);
}

int main() {
    return yyparse();
}

Yacc Conflict Resolution Rules

When conflicts exist in the LALR parsing table, Yacc applies the following rules:

Shift-Reduce Conflict: Defaults to shift (resolves the dangling else problem).
Reduce-Reduce Conflict: Reduces by the first listed rule.

Yacc Precedence Directives

%left '+' '-'       // Left-associative, lower precedence
%left '*' '/'       // Left-associative, higher precedence
%right UMINUS       // Right-associative, unary minus

Using these directives, ambiguous grammars can be handled without conflicts.

9. Error Recovery in LR Parsers

Panic Mode Recovery

1. Pop the stack until a state that can handle A is found
2. Skip input until a token in FOLLOW(A) is found
3. Push GOTO[s, A] and continue parsing

Yacc Error Recovery: error Token

stmt : expr ';'
     | error ';'   { yyerrok; printf("Skipping bad statement\n"); }
     ;

error is a special Yacc token that automatically matches in error situations. yyerrok declares that error recovery is complete.

Summary

Concept	Key Content
Bottom-Up Parsing	Starts from input and reduces to start symbol
Handle	Reducible substring and production rule combination
Shift-Reduce	Stack-based 4 operations (shift, reduce, accept, error)
LR(0) Items	Dot represents parsing progress
SLR	LR(0) items + FOLLOW sets
Canonical LR(1)	Items with lookahead, most powerful
LALR	Merges LR(1) states with same core, practical
Yacc/Bison	Automatic LALR(1) parser generation tool

LR parsing is a powerful technique that can efficiently handle most programming languages. In the next article, we will cover syntax-directed translation, which performs meaningful translation using parsing results.