[Compiler] 06. Syntax Analysis (1) - Top-Down Parsing and LL Parsers

Syntax Analysis (1) - Top-Down Parsing and LL Parsers

Syntax Analysis verifies that the token stream conforms to the grammar of a programming language and generates a syntax tree. This article covers Top-Down syntax analysis.

1. Role of the Parser

  Token Stream
       |
       v
  +---------+
  |  Parser  | <---> Symbol Table
  +---------+
       |
       v
  Parse Tree / AST

The main roles of the parser are:

Syntax Verification: Confirms that the token stream conforms to the grammar.
Syntax Tree Generation: Used for subsequent semantic analysis and code generation.
Error Reporting: Reports the location and type of syntax errors.
Error Recovery: Continues parsing after errors to find additional errors.

2. Context-Free Grammar (CFG) Review

A context-free grammar G is defined as a 4-tuple (V, T, P, S):

V: Set of nonterminals
T: Set of terminals
P: Set of productions
S: Start symbol (an element of V)

Notation Conventions

Uppercase:      Nonterminals (A, B, S, E, T, F, ...)
Lowercase:      Terminals (a, b, c, +, *, (, ), ...)
Greek letters:  Strings of grammar symbols (alpha, beta, gamma, ...)

Formal Definition of Derivation

If A -> gamma is a production, then alpha A beta can derive alpha gamma beta.

E -> E + T | T
T -> T * F | F
F -> ( E ) | id

Leftmost derivation example: id + id * id
  E => E + T => T + T => F + T => id + T
    => id + T * F => id + F * F => id + id * F => id + id * id

3. Grammar Refinement

There are cases where the grammar needs to be transformed for Top-Down parsing.

3.1 Ambiguity Elimination

A classic example of ambiguous grammar is the dangling else problem.

// Ambiguous grammar
stmt -> if expr then stmt
      | if expr then stmt else stmt
      | other

In if E1 then if E2 then S1 else S2, it is ambiguous which if the else matches.

// Disambiguated: else matches the nearest if
matched_stmt   -> if expr then matched_stmt else matched_stmt
                | other
unmatched_stmt -> if expr then stmt
                | if expr then matched_stmt else unmatched_stmt
stmt           -> matched_stmt | unmatched_stmt

3.2 Left Recursion Elimination

Left recursion causes a recursive descent parser to enter an infinite loop.

// Left-recursive grammar
A -> A alpha | beta

// After left recursion elimination
A  -> beta A'
A' -> alpha A' | epsilon

Let us look at a concrete example:

// Arithmetic expression grammar with left recursion
E -> E + T | T
T -> T * F | F
F -> ( E ) | id

// After left recursion elimination
E  -> T E'
E' -> + T E' | epsilon
T  -> F T'
T' -> * F T' | epsilon
F  -> ( E ) | id

Indirect Left Recursion

// Indirect left recursion
S -> A a | b
A -> S c | d

// S -> A a -> S c a (indirectly left-recursive)

Indirect left recursion can also be systematically eliminated. Assign an order to nonterminals, substitute in order to convert to direct left recursion, then eliminate it.

3.3 Left Factoring

When two productions share the same prefix, it is difficult to decide which rule to choose.

// Grammar with common prefix
A -> alpha beta1 | alpha beta2

// After left factoring
A  -> alpha A'
A' -> beta1 | beta2

A concrete example:

// if-else grammar
stmt -> if expr then stmt else stmt
      | if expr then stmt

// After left factoring
stmt       -> if expr then stmt else_part
else_part  -> else stmt | epsilon

4. Recursive Descent Parser

A recursive descent parser is a Top-Down parsing method that writes one function for each nonterminal.

Implementation Example

Grammar: E -> T E', E' -> + T E' | epsilon, T -> F T', T' -> * F T' | epsilon, F -> ( E ) | id

public class RecursiveDescentParser {
    private Token lookahead;
    private Lexer lexer;

    public RecursiveDescentParser(Lexer lexer) {
        this.lexer = lexer;
        this.lookahead = lexer.nextToken();
    }

    // E -> T E'
    void E() {
        T();
        Eprime();
    }

    // E' -> + T E' | epsilon
    void Eprime() {
        if (lookahead.type == TokenType.PLUS) {
            match(TokenType.PLUS);
            T();
            Eprime();
        }
        // else: epsilon (do nothing)
    }

    // T -> F T'
    void T() {
        F();
        Tprime();
    }

    // T' -> * F T' | epsilon
    void Tprime() {
        if (lookahead.type == TokenType.STAR) {
            match(TokenType.STAR);
            F();
            Tprime();
        }
        // else: epsilon
    }

    // F -> ( E ) | id
    void F() {
        if (lookahead.type == TokenType.LPAREN) {
            match(TokenType.LPAREN);
            E();
            match(TokenType.RPAREN);
        } else if (lookahead.type == TokenType.ID) {
            match(TokenType.ID);
        } else {
            throw new SyntaxError("Unexpected token: " + lookahead);
        }
    }

    void match(TokenType expected) {
        if (lookahead.type == expected) {
            lookahead = lexer.nextToken();
        } else {
            throw new SyntaxError(
                "Expected " + expected + ", got " + lookahead.type
            );
        }
    }
}

5. FIRST and FOLLOW Sets

The FIRST and FOLLOW sets are needed for a predictive parser to decide which production rule to choose.

FIRST(alpha) Set

The set of first terminals that can be derived from alpha.

Computation rules:
1. If X is a terminal: FIRST(X) = {X}
2. If X -> epsilon is a rule: add epsilon to FIRST(X)
3. If X -> Y1 Y2 ... Yk:
   - Add FIRST(Y1) minus epsilon to FIRST(X)
   - If Y1 derives epsilon, also add FIRST(Y2) (repeat)
   - If all Yi derive epsilon, add epsilon to FIRST(X)

FIRST Set Computation Example

Grammar:
  E  -> T E'
  E' -> + T E' | epsilon
  T  -> F T'
  T' -> * F T' | epsilon
  F  -> ( E ) | id

FIRST(F)  = { (, id }
FIRST(T') = { *, epsilon }
FIRST(T)  = FIRST(F) = { (, id }
FIRST(E') = { +, epsilon }
FIRST(E)  = FIRST(T) = { (, id }

FOLLOW(A) Set

The set of terminals that can appear immediately after nonterminal A.

Computation rules:
1. If S is the start symbol: add $ to FOLLOW(S)
2. If A -> alpha B beta:
   Add FIRST(beta) minus epsilon to FOLLOW(B)
3. If A -> alpha B, or if A -> alpha B beta
   where epsilon is in FIRST(beta):
   Add FOLLOW(A) to FOLLOW(B)

FOLLOW Set Computation Example

FOLLOW(E)  = { ), $ }
FOLLOW(E') = FOLLOW(E) = { ), $ }
FOLLOW(T)  = FIRST(E') - {epsilon} + FOLLOW(E) + FOLLOW(E')
           = { +, ), $ }
FOLLOW(T') = FOLLOW(T) = { +, ), $ }
FOLLOW(F)  = FIRST(T') - {epsilon} + FOLLOW(T) + FOLLOW(T')
           = { *, +, ), $ }

6. LL(1) Grammar and Predictive Parsing Table

Meaning of LL(1)

L: Scans input from left to right
L: Produces a leftmost derivation
(1): Makes decisions based on 1 lookahead token

Predictive Parsing Table Construction Algorithm

For each production A -> alpha:
  1. For each terminal a in FIRST(alpha):
     Add A -> alpha to M[A, a]

  2. If epsilon is in FIRST(alpha):
     For each terminal b in FOLLOW(A):
     Add A -> alpha to M[A, b]

  3. If epsilon is in FIRST(alpha) and $ is in FOLLOW(A):
     Add A -> alpha to M[A, $]

Predictive Parsing Table Example

       |  id         |  +          |  *          |  (           |  )          |  $
-------+-------------+-------------+-------------+--------------+-------------+--------
  E    | E -> T E'   |             |             | E -> T E'    |             |
  E'   |             | E' -> +TE'  |             |              | E' -> eps   | E' -> eps
  T    | T -> F T'   |             |             | T -> F T'    |             |
  T'   |             | T' -> eps   | T' -> *FT'  |              | T' -> eps   | T' -> eps
  F    | F -> id     |             |             | F -> ( E )   |             |

Conditions for LL(1) Grammar

For a grammar to be LL(1), for two productions A -> alpha | beta of the same nonterminal, the following must hold:

FIRST(alpha) and FIRST(beta) must not overlap.
At most one of alpha and beta can derive epsilon.
If beta can derive epsilon, FIRST(alpha) and FOLLOW(A) must not overlap.

7. Table-Driven Predictive Parser

Let us examine the operation of a predictive parser using a stack and parsing table.

Input buffer: id + id * id $
Stack:        $ E

Parsing table: M[A, a]

Parsing Algorithm

Push $ and start symbol onto the stack

repeat:
  X = stack top, a = current input token

  if X == a:
    Pop from stack, advance input  (match)
  else if X is a terminal:
    Error
  else if M[X, a] is empty:
    Error
  else if M[X, a] = X -> Y1 Y2 ... Yk:
    Pop X from stack
    Push Yk, ..., Y2, Y1 in order  (reverse order!)

until stack is empty and input is $ (accept)

Parsing Trace Example: `id + id * id`

Stack           Input              Action
$ E           id + id * id $   E -> T E'
$ E' T        id + id * id $   T -> F T'
$ E' T' F     id + id * id $   F -> id
$ E' T' id    id + id * id $   match id
$ E' T'       + id * id $      T' -> epsilon
$ E'          + id * id $      E' -> + T E'
$ E' T +      + id * id $      match +
$ E' T        id * id $        T -> F T'
$ E' T' F     id * id $        F -> id
$ E' T' id    id * id $        match id
$ E' T'       * id $           T' -> * F T'
$ E' T' F *   * id $           match *
$ E' T' F     id $             F -> id
$ E' T' id    id $             match id
$ E' T'       $                T' -> epsilon
$ E'          $                E' -> epsilon
$             $                Accept!

8. Error Recovery

Recovery strategies when the parser finds a syntax error.

Panic Mode

Skips input until a specific synchronizing token is encountered.

Synchronizing token selection criteria:
- Use elements of FOLLOW(A) as synchronizing tokens
- Statement delimiters (semicolons, braces, etc.)
- Keywords (if, while, return, etc.)

Syntax Error Reporting Example

Input: int x = + 5;

Error message:
  line 1:9 - syntax error: unexpected '+', expected expression

Can find additional errors after recovery by continuing to parse

Error Productions

Commonly occurring errors are included in the grammar in advance.

// Normal grammar
stmt -> if ( expr ) stmt

// Error production added (missing parentheses case)
stmt -> if ( expr ) stmt
      | if expr stmt    { error("missing parentheses") }

Summary

Concept	Key Content
Top-Down Parsing	Starts from the start symbol and derives input
Left Recursion Elim.	Converts `A -> A alpha` form to right recursion
Left Factoring	Separates common prefix to clarify choice
FIRST Set	First terminal derivable from a string
FOLLOW Set	Terminals that can follow a nonterminal
LL(1) Grammar	Parsing decision possible with 1 lookahead
Recursive Descent	One function per nonterminal
Predictive Parser	Stack and parsing table based

Top-Down parsing is intuitive and easy to implement, but has limitations on the grammars it can express. In the next article, we will explore the more powerful Bottom-Up parsing and LR parsers.