[Compiler] 01. Compiler Structure and How It Works

Compiler Structure and How It Works

A compiler is a program that takes a source program as input and transforms it into a target program. In this article, we will examine each phase that makes up the overall structure of a compiler.

1. Types of Language Processors

There are several forms of systems that process programming languages.

Compiler: Translates the entire source code into object code, then executes it.
Interpreter: Interprets and executes source code one statement at a time.
Hybrid approach: Like Java, which first compiles to bytecode and then the JVM interprets it.

The key differences between compilers and interpreters can be summarized as follows:

[Compiler]
  Source Program --> [Compiler] --> Target Program --> [Execution] --> Output

[Interpreter]
  Source Program + Input --> [Interpreter] --> Output

A compiler takes longer for the translation process, but the compiled object code runs fast. An interpreter can execute immediately without translation, but execution speed is relatively slower.

2. Overall Compiler Structure

A compiler is broadly divided into an Analysis part and a Synthesis part.

              Source Program
                   |
                   v
         +-------------------+
         |  Lexical Analysis  |
         +-------------------+
                   |  Token Stream
                   v
         +-------------------+
         |  Syntax Analysis   |
         +-------------------+
                   |  Syntax Tree
                   v
         +-------------------+
         |  Semantic Analysis |
         +-------------------+
                   |  Annotated Syntax Tree
                   v
         +-------------------+
         |  Intermediate Code |
         |  Generation        |
         +-------------------+
                   |  Intermediate Representation
                   v
         +-------------------+
         |  Code Optimization |
         +-------------------+
                   |  Optimized IR
                   v
         +-------------------+
         |  Code Generation   |
         +-------------------+
                   |
                   v
              Target Program

Each phase takes the output of the previous phase as its input. The analysis part (front end) identifies the structure of the source program, and the synthesis part (back end) generates the target code.

3. Lexical Analysis

The lexical analyzer (lexer or scanner) reads the character stream of the source code and breaks it into token units.

For example, consider the following C code:

position = initial + rate * 60;

The lexical analyzer breaks this code into the following tokens:

<id, "position">  <=>  <id, "initial">  <+>  <id, "rate">  <*>  <num, 60>  <;>

Each token is represented as a pair of a token name and an attribute value. Identifiers are registered in the symbol table and referenced in later phases.

Additional Roles of the Lexical Analyzer

Removing whitespace and comments
Tracking line numbers in the source program (for error message generation)
Interfacing with macro preprocessing

4. Syntax Analysis

The syntax analyzer (parser) takes the token stream as input and produces a parse tree or abstract syntax tree (AST).

The AST for the example position = initial + rate * 60 looks like this:

         =
        / \
       /   \
    id:     +
  position / \
          /   \
        id:    *
      initial / \
             /   \
           id:  num:
          rate   60

The parser operates based on Context-Free Grammar (CFG). The grammar defines the structure of the programming language, and the parser verifies that the input conforms to this grammar.

Main Parsing Methods

Method	Description	Examples
Top-Down	Starts from the start symbol and derives the input	Recursive descent parser, LL parser
Bottom-Up	Starts from the input and reduces to the start symbol	LR parser, LALR parser

5. Semantic Analysis

The semantic analyzer uses the syntax tree to check the semantic consistency of the source program.

Its main roles are:

Type Checking: Verifies that the types of operators and operands are compatible.
Type Coercion: Inserts automatic type conversions where necessary.
Scope Checking: Verifies that variables are used within the scope where they are declared.

For example, if rate is float and 60 is int, the semantic analyzer inserts a node to convert 60 to float.

         =
        / \
       /   \
    id:     +
  position / \
          /   \
        id:    *
      initial / \
             /   \
           id:  inttofloat
          rate      |
                  num: 60

6. Intermediate Code Generation

The compiler does not convert the source program directly into target code; instead, it first generates an Intermediate Representation (IR).

The most widely used IR format is Three-Address Code.

t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3

The characteristics of three-address code are:

Each instruction contains at most one operator.
The compiler generates temporary variables (t1, t2, t3).
The order of operations for complex expressions is explicitly determined.

Advantages of Intermediate Code

Machine independence: A representation that is not tied to specific hardware.
Portability: Can be easily retargeted to different target machines.
Ease of optimization: Well-suited for applying machine-independent optimizations.

7. Code Optimization

The code optimization phase transforms the intermediate code to generate more efficient target code.

// Before optimization
t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3

// After optimization
t1 = rate * 60.0
position = initial + t1

The optimizations performed in the above example are:

Constant Folding: Computes inttofloat(60) as 60.0 at compile time.
Eliminating unnecessary temporaries: Removes t3 and assigns directly to position.

Key Optimization Techniques

Common Subexpression Elimination
Dead Code Elimination
Loop-Invariant Code Motion
Strength Reduction: Replaces expensive operations with cheaper ones.
Inlining: Replaces function calls with the function body.

8. Code Generation

The code generator converts the optimized intermediate code into target machine instructions.

// Intermediate code
t1 = rate * 60.0
position = initial + t1

// x86-like assembly code
LDF  R2, rate       // Load rate into register R2
MULF R2, R2, #60.0  // R2 = R2 * 60.0
LDF  R1, initial    // Load initial into register R1
ADDF R1, R1, R2     // R1 = R1 + R2
STF  position, R1   // Store R1 value into position

The key challenges in the code generation phase are:

Register Allocation: Efficiently placing variables in limited registers.
Instruction Selection: Choosing optimal instructions for the target machine.
Instruction Scheduling: Maximizing pipeline efficiency.

9. Symbol Table

The symbol table is a core data structure shared across all phases of the compiler.

+--------+-------+--------+---------+
| Name    | Type   | Scope  | Memory  |
+--------+-------+--------+---------+
| position| float | global | 0x1000  |
| initial | float | global | 0x1004  |
| rate    | float | global | 0x1008  |
+--------+-------+--------+---------+

The information stored in the symbol table includes:

Variable name (identifier)
Type information: int, float, array, etc.
Scope information: The range in which a variable is valid
Memory location: The address where a variable is allocated

The symbol table is typically implemented as a hash table, supporting fast insertion and lookup. Scope chaining or stack-based structures are used alongside it to handle scopes.

10. Compilation Phase vs Pass

Phase

Each logical functional unit of a compiler is called a phase. Lexical analysis, syntax analysis, etc., are each a phase. Phases are conceptual divisions, and in practice, multiple phases may be combined.

Pass

A pass is the process of reading the source program (or intermediate representation) from beginning to end once.

[1-Pass Compiler]
  Processes all phases in a single traversal
  Example: Simple assemblers, early Pascal compilers

[Multi-Pass Compiler]
  Processes each phase through multiple traversals
  Example: Modern optimizing compilers (GCC, LLVM, etc.)

A 1-Pass compiler is fast but limited in optimization, while a Multi-Pass compiler can perform more optimizations but takes longer to compile.

Actual Structure of Modern Compilers

Modern compilers (e.g., LLVM) have the following modular structure:

[C/C++ Frontend]  [Rust Frontend]  [Swift Frontend]
        \              |              /
         v             v             v
     +-------------------------------+
     |         LLVM IR (IR)          |
     +-------------------------------+
              |              |
              v              v
        [x86 Backend]  [ARM Backend]

Thanks to this structure, supporting M languages and N target machines requires only M+N modules (not M x N).

Summary

Phase	Input	Output	Key Role
Lexical Analysis	Character stream	Token stream	Token recognition, whitespace/comment removal
Syntax Analysis	Token stream	Syntax tree	Grammar verification, tree construction
Semantic Analysis	Syntax tree	Annotated syntax tree	Type checking, semantic verification
Intermediate Code Gen.	Syntax tree	Three-address code	Machine-independent IR generation
Code Optimization	Intermediate code	Optimized IR	Performance improvement transformations
Code Generation	Intermediate code	Target code	Instruction selection, register allocation

A compiler is not just a simple translator; it is a sophisticated system that accurately understands the semantics of a programming language and transforms it into efficient machine code. In the next article, we will explore the basics of programming languages and the various applications of compiler technology.