[Compiler] 03. Building a Simple Syntax-Directed Translator

Building a Simple Syntax-Directed Translator

In this article, we will build a simple translator to understand the core principles of compilers. Starting from grammar definition, we cover parse trees and the basics of syntax-directed translation.

1. Syntax Definition

The syntax of a programming language is defined using Context-Free Grammar (CFG).

Components of a Grammar

A grammar consists of four elements:

Terminal: Also called tokens, these are the basic units of the language (e.g., +, *, id, num).
Nonterminal: Variables that represent syntactic constructs (e.g., expr, stmt).
Production: Defines how a nonterminal is composed.
Start Symbol: The nonterminal that represents the entire program.

Grammar Example: Arithmetic Expressions

expr   -> expr + term | expr - term | term
term   -> term * factor | term / factor | factor
factor -> ( expr ) | num | id

This grammar expresses that multiplication/division has higher precedence than addition/subtraction.

2. Derivation

Derivation is the process of starting from the start symbol and repeatedly applying production rules to generate a string.

Example: Derivation of `9 - 5 + 2`

Grammar:
  list -> list + digit | list - digit | digit
  digit -> 0 | 1 | 2 | ... | 9

Leftmost Derivation:
  list
  => list + digit
  => list - digit + digit
  => digit - digit + digit
  => 9 - digit + digit
  => 9 - 5 + digit
  => 9 - 5 + 2

Types of Derivation

Leftmost Derivation: Always replaces the leftmost nonterminal first.
Rightmost Derivation: Always replaces the rightmost nonterminal first.

3. Parse Tree

A parse tree represents the derivation process in tree form.

Let us examine the parse tree for 9 - 5 + 2:

         list
        / | \
       /  |  \
     list '+' digit
    / | \       |
   /  |  \      2
 list '-' digit
  |         |
digit       5
  |
  9

The characteristics of a parse tree are:

Root: Start symbol
Internal nodes: Nonterminals (application of production rules)
Leaf nodes: Terminals (actual tokens)
Reading the leaf nodes from left to right yields the original input.

4. Ambiguity

If two or more parse trees exist for a single string, the grammar is said to be ambiguous.

Example of an Ambiguous Grammar

string -> string + string | string - string | 0 | 1 | ... | 9

Parsing 9 - 5 + 2 with this grammar allows two interpretations:

Interpretation 1: (9 - 5) + 2 = 6    Interpretation 2: 9 - (5 + 2) = 2

     string                    string
    / | \                     / | \
   /  |  \                   /  |  \
 string + string          string - string
 / | \      |               |      / | \
/  |  \     2               9     /  |  \
string - string              string + string
  |        |                   |        |
  9        5                   5        2

Resolving Ambiguity

Ambiguity is resolved by reflecting operator precedence and associativity in the grammar.

// Ambiguous grammar
expr -> expr + expr | expr * expr | num

// Disambiguated grammar (multiplication first, left-associative)
expr   -> expr + term | term
term   -> term * factor | factor
factor -> num

5. Operator Precedence and Associativity

Precedence

Expressed through the hierarchical structure of the grammar. Lower precedence operators are placed higher in the tree.

Precedence (low -> high):
  ||  (logical OR)
  &&  (logical AND)
  ==  !=
  <   >   <=  >=
  +   -
  *   /   %
  !   -   (unary)

Associativity

Defines the order in which operators of the same precedence are combined when they appear consecutively.

Left-Associative:
  a - b - c = (a - b) - c
  Grammar: expr -> expr - term | term

Right-Associative:
  a = b = c  is  a = (b = c)
  Grammar: right -> letter = right | letter

Most binary operators are left-associative, while the assignment operator (=) and exponentiation operator (**) are right-associative.

6. Syntax-Directed Translation Basics

Syntax-directed translation is a technique that attaches semantic actions to each production rule of the grammar, performing translation simultaneously with parsing.

Attributes

Values associated with grammar symbols are called attributes. There are two types:

Synthesized Attribute: Computed from the attributes of child nodes.
Inherited Attribute: Computed from the attributes of parent or sibling nodes.

Syntax-Directed Definition

Links semantic rules to each production rule.

Production                   Semantic Rule
----------------------------------------------
expr -> expr1 + term        expr.val = expr1.val + term.val
expr -> term                expr.val = term.val
term -> num                 term.val = num.lexval

For example, evaluating 9 + 5 yields:

         expr        (val = 14)
        / | \
       /  |  \
     expr + term     (val = 9, val = 5)
      |        |
    term      num
      |       (5)
    num
    (9)

7. Postfix Notation

Postfix notation (Reverse Polish Notation) is a representation where the operator comes after its operands.

Infix Notation              Postfix Notation
-------------------------------------------
9 - 5 + 2              9 5 - 2 +
9 - (5 + 2)            9 5 2 + -
a * b + c              a b * c +
a + b * c              a b c * +

The advantages of postfix notation are:

No parentheses needed: The order of operations is embedded in the notation itself.
Easy evaluation with a stack.

Postfix Notation Evaluation Algorithm

Input: 9 5 - 2 +

Stack changes:
  Read 9 -> [9]
  Read 5 -> [9, 5]
  Read - -> [4]       (9 - 5 = 4)
  Read 2 -> [4, 2]
  Read + -> [6]       (4 + 2 = 6)

Result: 6

8. Syntax-Directed Definition for Infix-to-Postfix Translation

Let us write a syntax-directed definition that translates infix expressions to postfix expressions.

Production                   Semantic Action
----------------------------------------------
expr -> expr1 + term        print('+')
expr -> expr1 - term        print('-')
expr -> term                (none)
term -> 0                   print('0')
term -> 1                   print('1')
...
term -> 9                   print('9')

Applying this translation scheme to 9 - 5 + 2 outputs 9 5 - 2 +.

9. Implementing a Simple Translator

Let us implement the above translator in Java using the recursive descent approach.

public class PostfixTranslator {
    private String input;
    private int pos;
    private char lookahead;

    public PostfixTranslator(String input) {
        this.input = input;
        this.pos = 0;
        this.lookahead = input.charAt(0);
    }

    // expr -> term rest
    // rest -> + term { print('+') } rest
    //       | - term { print('-') } rest
    //       | (epsilon)
    void expr() {
        term();
        while (true) {
            if (lookahead == '+') {
                match('+');
                term();
                System.out.print('+');
            } else if (lookahead == '-') {
                match('-');
                term();
                System.out.print('-');
            } else {
                return;
            }
        }
    }

    // term -> 0 { print('0') } | 1 { print('1') } | ... | 9 { print('9') }
    void term() {
        if (Character.isDigit(lookahead)) {
            System.out.print(lookahead);
            match(lookahead);
        } else {
            throw new RuntimeException("Syntax error");
        }
    }

    void match(char expected) {
        if (lookahead == expected) {
            pos++;
            if (pos < input.length()) {
                lookahead = input.charAt(pos);
            } else {
                lookahead = '\0'; // End of input
            }
        } else {
            throw new RuntimeException("Expected " + expected);
        }
    }

    public static void main(String[] args) {
        PostfixTranslator translator = new PostfixTranslator("9-5+2");
        translator.expr();
        // Output: 95-2+
    }
}

Execution Trace

Input: 9-5+2

expr() called
  term() -> prints '9', lookahead = '-'
  lookahead == '-':
    match('-'), lookahead = '5'
    term() -> prints '5', lookahead = '+'
    not '+', so prints '-'
  lookahead == '+':
    match('+'), lookahead = '2'
    term() -> prints '2', lookahead = '\0'
    prints '+'
  lookahead == '\0': return

Final output: 95-2+

10. Constructing Abstract Syntax Trees (AST)

Instead of postfix notation, a translator can also generate an Abstract Syntax Tree (AST).

An AST is a simplified version of the parse tree, removing unnecessary intermediate nodes.

Parse Tree:                    AST:
      expr                      +
     / | \                     / \
   expr + term                -   2
  / | \     |                / \
expr - term num:2           9   5
  |      |
term   num:5
  |
num:9

AST Node Implementation

abstract class ASTNode {
    abstract int eval();
}

class NumNode extends ASTNode {
    int value;
    NumNode(int value) { this.value = value; }
    int eval() { return value; }
}

class BinOpNode extends ASTNode {
    char op;
    ASTNode left, right;

    BinOpNode(char op, ASTNode left, ASTNode right) {
        this.op = op;
        this.left = left;
        this.right = right;
    }

    int eval() {
        switch (op) {
            case '+': return left.eval() + right.eval();
            case '-': return left.eval() - right.eval();
            case '*': return left.eval() * right.eval();
            default: throw new RuntimeException("Unknown op: " + op);
        }
    }
}

Summary

Concept	Key Content
Context-Free Grammar	Composed of terminals, nonterminals, productions, start symbol
Derivation	Generating strings by applying production rules from the start symbol
Parse Tree	Tree representation of the derivation process
Ambiguity	Two or more parse trees exist for a single string
Precedence/Associativity	Expressed through grammar hierarchy and recursion direction
Syntax-Directed Translation	Attaching semantic actions to production rules for translation
Postfix Notation	Notation where operators follow their operands
Synthesized Attribute	Attributes computed from child nodes

The simple translator built in this article is a miniature that demonstrates the core principles of compilers. In the next article, we will take a deeper look at the lexical analysis phase.