Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Compiler Structure and How It Works

A compiler is a program that takes a **source program** as input and transforms it into a **target program**. In this article, we will examine each phase that makes up the overall structure of a compiler.

1. Types of Language Processors

There are several forms of systems that process programming languages.

- **Compiler**: Translates the entire source code into object code, then executes it.

- **Interpreter**: Interprets and executes source code one statement at a time.

- **Hybrid approach**: Like Java, which first compiles to bytecode and then the JVM interprets it.

The key differences between compilers and interpreters can be summarized as follows:

[Compiler]

Source Program --> [Compiler] --> Target Program --> [Execution] --> Output

[Interpreter]

Source Program + Input --> [Interpreter] --> Output

A compiler takes longer for the translation process, but the compiled object code runs fast. An interpreter can execute immediately without translation, but execution speed is relatively slower.

2. Overall Compiler Structure

A compiler is broadly divided into an **Analysis** part and a **Synthesis** part.

Source Program

+-------------------+

| Lexical Analysis |

+-------------------+

| Token Stream

+-------------------+

| Syntax Analysis |

+-------------------+

| Syntax Tree

+-------------------+

| Semantic Analysis |

+-------------------+

| Annotated Syntax Tree

+-------------------+

| Intermediate Code |

| Generation |

+-------------------+

| Intermediate Representation

+-------------------+

| Code Optimization |

+-------------------+

| Optimized IR

+-------------------+

| Code Generation |

+-------------------+

Target Program

Each phase takes the output of the previous phase as its input. The analysis part (front end) identifies the structure of the source program, and the synthesis part (back end) generates the target code.

3. Lexical Analysis

The lexical analyzer (lexer or scanner) reads the character stream of the source code and breaks it into **token** units.

For example, consider the following C code:

position = initial + rate * 60;

The lexical analyzer breaks this code into the following tokens:

Each token is represented as a pair of a **token name** and an **attribute value**. Identifiers are registered in the symbol table and referenced in later phases.

Additional Roles of the Lexical Analyzer

- Removing whitespace and comments

- Tracking line numbers in the source program (for error message generation)

- Interfacing with macro preprocessing

4. Syntax Analysis

The syntax analyzer (parser) takes the token stream as input and produces a **parse tree** or **abstract syntax tree (AST)**.

The AST for the example `position = initial + rate * 60` looks like this:

/ \

id: +

position / \

/ \

id: *

initial / \

/ \

id: num:

rate 60

The parser operates based on **Context-Free Grammar (CFG)**. The grammar defines the structure of the programming language, and the parser verifies that the input conforms to this grammar.

Main Parsing Methods

| Method | Description | Examples |

| --------- | ----------------------------------------------------- | ----------------------------------- |

| Top-Down | Starts from the start symbol and derives the input | Recursive descent parser, LL parser |

| Bottom-Up | Starts from the input and reduces to the start symbol | LR parser, LALR parser |

5. Semantic Analysis

The semantic analyzer uses the syntax tree to check the **semantic consistency** of the source program.

Its main roles are:

- **Type Checking**: Verifies that the types of operators and operands are compatible.

- **Type Coercion**: Inserts automatic type conversions where necessary.

- **Scope Checking**: Verifies that variables are used within the scope where they are declared.

For example, if `rate` is `float` and `60` is `int`, the semantic analyzer inserts a node to convert `60` to `float`.

/ \

id: +

position / \

/ \

id: *

initial / \

/ \

id: inttofloat

rate |

num: 60

6. Intermediate Code Generation

The compiler does not convert the source program directly into target code; instead, it first generates an **Intermediate Representation (IR)**.

The most widely used IR format is **Three-Address Code**.

t1 = inttofloat(60)

t2 = rate * t1

t3 = initial + t2

position = t3

The characteristics of three-address code are:

- Each instruction contains at most one operator.

- The compiler generates temporary variables (`t1`, `t2`, `t3`).

- The order of operations for complex expressions is explicitly determined.

Advantages of Intermediate Code

1. **Machine independence**: A representation that is not tied to specific hardware.

2. **Portability**: Can be easily retargeted to different target machines.

3. **Ease of optimization**: Well-suited for applying machine-independent optimizations.

7. Code Optimization

The code optimization phase transforms the intermediate code to generate more efficient target code.

// Before optimization

t1 = inttofloat(60)

t2 = rate * t1

t3 = initial + t2

position = t3

// After optimization

t1 = rate * 60.0

position = initial + t1

The optimizations performed in the above example are:

- **Constant Folding**: Computes `inttofloat(60)` as `60.0` at compile time.

- **Eliminating unnecessary temporaries**: Removes `t3` and assigns directly to `position`.

Key Optimization Techniques

- **Common Subexpression Elimination**

- **Dead Code Elimination**

- **Loop-Invariant Code Motion**

- **Strength Reduction**: Replaces expensive operations with cheaper ones.

- **Inlining**: Replaces function calls with the function body.

8. Code Generation

The code generator converts the optimized intermediate code into **target machine instructions**.

// Intermediate code

t1 = rate * 60.0

position = initial + t1

// x86-like assembly code

LDF R2, rate // Load rate into register R2

MULF R2, R2, #60.0 // R2 = R2 * 60.0

LDF R1, initial // Load initial into register R1

ADDF R1, R1, R2 // R1 = R1 + R2

STF position, R1 // Store R1 value into position

The key challenges in the code generation phase are:

- **Register Allocation**: Efficiently placing variables in limited registers.

- **Instruction Selection**: Choosing optimal instructions for the target machine.

- **Instruction Scheduling**: Maximizing pipeline efficiency.

9. Symbol Table

The symbol table is a core data structure **shared across all phases** of the compiler.

+--------+-------+--------+---------+

+--------+-------+--------+---------+

+--------+-------+--------+---------+

The information stored in the symbol table includes:

- **Variable name (identifier)**

- **Type information**: `int`, `float`, array, etc.

- **Scope information**: The range in which a variable is valid

- **Memory location**: The address where a variable is allocated

The symbol table is typically implemented as a **hash table**, supporting fast insertion and lookup. Scope chaining or stack-based structures are used alongside it to handle scopes.

10. Compilation Phase vs Pass

Phase

Each **logical functional unit** of a compiler is called a phase. Lexical analysis, syntax analysis, etc., are each a phase. Phases are conceptual divisions, and in practice, multiple phases may be combined.

Pass

A pass is the process of **reading the source program (or intermediate representation) from beginning to end once**.

[1-Pass Compiler]

Processes all phases in a single traversal

Example: Simple assemblers, early Pascal compilers

[Multi-Pass Compiler]

Processes each phase through multiple traversals

Example: Modern optimizing compilers (GCC, LLVM, etc.)

A **1-Pass compiler** is fast but limited in optimization, while a **Multi-Pass compiler** can perform more optimizations but takes longer to compile.

Actual Structure of Modern Compilers

Modern compilers (e.g., LLVM) have the following modular structure:

[C/C++ Frontend] [Rust Frontend] [Swift Frontend]

\ | /

v v v

+-------------------------------+

| LLVM IR (IR) |

+-------------------------------+

| |

v v

[x86 Backend] [ARM Backend]

Thanks to this structure, supporting **M languages** and **N target machines** requires only M+N modules (not M x N).

Summary

| ---------------------- | ----------------- | --------------------- | --------------------------------------------- |

A compiler is not just a simple translator; it is a sophisticated system that accurately understands the semantics of a programming language and transforms it into efficient machine code. In the next article, we will explore the basics of programming languages and the various applications of compiler technology.