- Authors

- Name
- Youngju Kim
- @fjvbn20031
Syntax Analysis (1) - Top-Down Parsing and LL Parsers
Syntax Analysis verifies that the token stream conforms to the grammar of a programming language and generates a syntax tree. This article covers Top-Down syntax analysis.
1. Role of the Parser
Token Stream
|
v
+---------+
| Parser | <---> Symbol Table
+---------+
|
v
Parse Tree / AST
The main roles of the parser are:
- Syntax Verification: Confirms that the token stream conforms to the grammar.
- Syntax Tree Generation: Used for subsequent semantic analysis and code generation.
- Error Reporting: Reports the location and type of syntax errors.
- Error Recovery: Continues parsing after errors to find additional errors.
2. Context-Free Grammar (CFG) Review
A context-free grammar G is defined as a 4-tuple (V, T, P, S):
- V: Set of nonterminals
- T: Set of terminals
- P: Set of productions
- S: Start symbol (an element of V)
Notation Conventions
Uppercase: Nonterminals (A, B, S, E, T, F, ...)
Lowercase: Terminals (a, b, c, +, *, (, ), ...)
Greek letters: Strings of grammar symbols (alpha, beta, gamma, ...)
Formal Definition of Derivation
If A -> gamma is a production, then alpha A beta can derive alpha gamma beta.
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
Leftmost derivation example: id + id * id
E => E + T => T + T => F + T => id + T
=> id + T * F => id + F * F => id + id * F => id + id * id
3. Grammar Refinement
There are cases where the grammar needs to be transformed for Top-Down parsing.
3.1 Ambiguity Elimination
A classic example of ambiguous grammar is the dangling else problem.
// Ambiguous grammar
stmt -> if expr then stmt
| if expr then stmt else stmt
| other
In if E1 then if E2 then S1 else S2, it is ambiguous which if the else matches.
// Disambiguated: else matches the nearest if
matched_stmt -> if expr then matched_stmt else matched_stmt
| other
unmatched_stmt -> if expr then stmt
| if expr then matched_stmt else unmatched_stmt
stmt -> matched_stmt | unmatched_stmt
3.2 Left Recursion Elimination
Left recursion causes a recursive descent parser to enter an infinite loop.
// Left-recursive grammar
A -> A alpha | beta
// After left recursion elimination
A -> beta A'
A' -> alpha A' | epsilon
Let us look at a concrete example:
// Arithmetic expression grammar with left recursion
E -> E + T | T
T -> T * F | F
F -> ( E ) | id
// After left recursion elimination
E -> T E'
E' -> + T E' | epsilon
T -> F T'
T' -> * F T' | epsilon
F -> ( E ) | id
Indirect Left Recursion
// Indirect left recursion
S -> A a | b
A -> S c | d
// S -> A a -> S c a (indirectly left-recursive)
Indirect left recursion can also be systematically eliminated. Assign an order to nonterminals, substitute in order to convert to direct left recursion, then eliminate it.
3.3 Left Factoring
When two productions share the same prefix, it is difficult to decide which rule to choose.
// Grammar with common prefix
A -> alpha beta1 | alpha beta2
// After left factoring
A -> alpha A'
A' -> beta1 | beta2
A concrete example:
// if-else grammar
stmt -> if expr then stmt else stmt
| if expr then stmt
// After left factoring
stmt -> if expr then stmt else_part
else_part -> else stmt | epsilon
4. Recursive Descent Parser
A recursive descent parser is a Top-Down parsing method that writes one function for each nonterminal.
Implementation Example
Grammar: E -> T E', E' -> + T E' | epsilon, T -> F T', T' -> * F T' | epsilon, F -> ( E ) | id
public class RecursiveDescentParser {
private Token lookahead;
private Lexer lexer;
public RecursiveDescentParser(Lexer lexer) {
this.lexer = lexer;
this.lookahead = lexer.nextToken();
}
// E -> T E'
void E() {
T();
Eprime();
}
// E' -> + T E' | epsilon
void Eprime() {
if (lookahead.type == TokenType.PLUS) {
match(TokenType.PLUS);
T();
Eprime();
}
// else: epsilon (do nothing)
}
// T -> F T'
void T() {
F();
Tprime();
}
// T' -> * F T' | epsilon
void Tprime() {
if (lookahead.type == TokenType.STAR) {
match(TokenType.STAR);
F();
Tprime();
}
// else: epsilon
}
// F -> ( E ) | id
void F() {
if (lookahead.type == TokenType.LPAREN) {
match(TokenType.LPAREN);
E();
match(TokenType.RPAREN);
} else if (lookahead.type == TokenType.ID) {
match(TokenType.ID);
} else {
throw new SyntaxError("Unexpected token: " + lookahead);
}
}
void match(TokenType expected) {
if (lookahead.type == expected) {
lookahead = lexer.nextToken();
} else {
throw new SyntaxError(
"Expected " + expected + ", got " + lookahead.type
);
}
}
}
5. FIRST and FOLLOW Sets
The FIRST and FOLLOW sets are needed for a predictive parser to decide which production rule to choose.
FIRST(alpha) Set
The set of first terminals that can be derived from alpha.
Computation rules:
1. If X is a terminal: FIRST(X) = {X}
2. If X -> epsilon is a rule: add epsilon to FIRST(X)
3. If X -> Y1 Y2 ... Yk:
- Add FIRST(Y1) minus epsilon to FIRST(X)
- If Y1 derives epsilon, also add FIRST(Y2) (repeat)
- If all Yi derive epsilon, add epsilon to FIRST(X)
FIRST Set Computation Example
Grammar:
E -> T E'
E' -> + T E' | epsilon
T -> F T'
T' -> * F T' | epsilon
F -> ( E ) | id
FIRST(F) = { (, id }
FIRST(T') = { *, epsilon }
FIRST(T) = FIRST(F) = { (, id }
FIRST(E') = { +, epsilon }
FIRST(E) = FIRST(T) = { (, id }
FOLLOW(A) Set
The set of terminals that can appear immediately after nonterminal A.
Computation rules:
1. If S is the start symbol: add $ to FOLLOW(S)
2. If A -> alpha B beta:
Add FIRST(beta) minus epsilon to FOLLOW(B)
3. If A -> alpha B, or if A -> alpha B beta
where epsilon is in FIRST(beta):
Add FOLLOW(A) to FOLLOW(B)
FOLLOW Set Computation Example
FOLLOW(E) = { ), $ }
FOLLOW(E') = FOLLOW(E) = { ), $ }
FOLLOW(T) = FIRST(E') - {epsilon} + FOLLOW(E) + FOLLOW(E')
= { +, ), $ }
FOLLOW(T') = FOLLOW(T) = { +, ), $ }
FOLLOW(F) = FIRST(T') - {epsilon} + FOLLOW(T) + FOLLOW(T')
= { *, +, ), $ }
6. LL(1) Grammar and Predictive Parsing Table
Meaning of LL(1)
- L: Scans input from left to right
- L: Produces a leftmost derivation
- (1): Makes decisions based on 1 lookahead token
Predictive Parsing Table Construction Algorithm
For each production A -> alpha:
1. For each terminal a in FIRST(alpha):
Add A -> alpha to M[A, a]
2. If epsilon is in FIRST(alpha):
For each terminal b in FOLLOW(A):
Add A -> alpha to M[A, b]
3. If epsilon is in FIRST(alpha) and $ is in FOLLOW(A):
Add A -> alpha to M[A, $]
Predictive Parsing Table Example
| id | + | * | ( | ) | $
-------+-------------+-------------+-------------+--------------+-------------+--------
E | E -> T E' | | | E -> T E' | |
E' | | E' -> +TE' | | | E' -> eps | E' -> eps
T | T -> F T' | | | T -> F T' | |
T' | | T' -> eps | T' -> *FT' | | T' -> eps | T' -> eps
F | F -> id | | | F -> ( E ) | |
Conditions for LL(1) Grammar
For a grammar to be LL(1), for two productions A -> alpha | beta of the same nonterminal, the following must hold:
FIRST(alpha)andFIRST(beta)must not overlap.- At most one of alpha and beta can derive epsilon.
- If beta can derive epsilon,
FIRST(alpha)andFOLLOW(A)must not overlap.
7. Table-Driven Predictive Parser
Let us examine the operation of a predictive parser using a stack and parsing table.
Input buffer: id + id * id $
Stack: $ E
Parsing table: M[A, a]
Parsing Algorithm
Push $ and start symbol onto the stack
repeat:
X = stack top, a = current input token
if X == a:
Pop from stack, advance input (match)
else if X is a terminal:
Error
else if M[X, a] is empty:
Error
else if M[X, a] = X -> Y1 Y2 ... Yk:
Pop X from stack
Push Yk, ..., Y2, Y1 in order (reverse order!)
until stack is empty and input is $ (accept)
Parsing Trace Example: id + id * id
Stack Input Action
$ E id + id * id $ E -> T E'
$ E' T id + id * id $ T -> F T'
$ E' T' F id + id * id $ F -> id
$ E' T' id id + id * id $ match id
$ E' T' + id * id $ T' -> epsilon
$ E' + id * id $ E' -> + T E'
$ E' T + + id * id $ match +
$ E' T id * id $ T -> F T'
$ E' T' F id * id $ F -> id
$ E' T' id id * id $ match id
$ E' T' * id $ T' -> * F T'
$ E' T' F * * id $ match *
$ E' T' F id $ F -> id
$ E' T' id id $ match id
$ E' T' $ T' -> epsilon
$ E' $ E' -> epsilon
$ $ Accept!
8. Error Recovery
Recovery strategies when the parser finds a syntax error.
Panic Mode
Skips input until a specific synchronizing token is encountered.
Synchronizing token selection criteria:
- Use elements of FOLLOW(A) as synchronizing tokens
- Statement delimiters (semicolons, braces, etc.)
- Keywords (if, while, return, etc.)
Syntax Error Reporting Example
Input: int x = + 5;
Error message:
line 1:9 - syntax error: unexpected '+', expected expression
Can find additional errors after recovery by continuing to parse
Error Productions
Commonly occurring errors are included in the grammar in advance.
// Normal grammar
stmt -> if ( expr ) stmt
// Error production added (missing parentheses case)
stmt -> if ( expr ) stmt
| if expr stmt { error("missing parentheses") }
Summary
| Concept | Key Content |
|---|---|
| Top-Down Parsing | Starts from the start symbol and derives input |
| Left Recursion Elim. | Converts A -> A alpha form to right recursion |
| Left Factoring | Separates common prefix to clarify choice |
| FIRST Set | First terminal derivable from a string |
| FOLLOW Set | Terminals that can follow a nonterminal |
| LL(1) Grammar | Parsing decision possible with 1 lookahead |
| Recursive Descent | One function per nonterminal |
| Predictive Parser | Stack and parsing table based |
Top-Down parsing is intuitive and easy to implement, but has limitations on the grammars it can express. In the next article, we will explore the more powerful Bottom-Up parsing and LR parsers.