- Authors

- Name
- Youngju Kim
- @fjvbn20031
Compiler Structure and How It Works
A compiler is a program that takes a source program as input and transforms it into a target program. In this article, we will examine each phase that makes up the overall structure of a compiler.
1. Types of Language Processors
There are several forms of systems that process programming languages.
- Compiler: Translates the entire source code into object code, then executes it.
- Interpreter: Interprets and executes source code one statement at a time.
- Hybrid approach: Like Java, which first compiles to bytecode and then the JVM interprets it.
The key differences between compilers and interpreters can be summarized as follows:
[Compiler]
Source Program --> [Compiler] --> Target Program --> [Execution] --> Output
[Interpreter]
Source Program + Input --> [Interpreter] --> Output
A compiler takes longer for the translation process, but the compiled object code runs fast. An interpreter can execute immediately without translation, but execution speed is relatively slower.
2. Overall Compiler Structure
A compiler is broadly divided into an Analysis part and a Synthesis part.
Source Program
|
v
+-------------------+
| Lexical Analysis |
+-------------------+
| Token Stream
v
+-------------------+
| Syntax Analysis |
+-------------------+
| Syntax Tree
v
+-------------------+
| Semantic Analysis |
+-------------------+
| Annotated Syntax Tree
v
+-------------------+
| Intermediate Code |
| Generation |
+-------------------+
| Intermediate Representation
v
+-------------------+
| Code Optimization |
+-------------------+
| Optimized IR
v
+-------------------+
| Code Generation |
+-------------------+
|
v
Target Program
Each phase takes the output of the previous phase as its input. The analysis part (front end) identifies the structure of the source program, and the synthesis part (back end) generates the target code.
3. Lexical Analysis
The lexical analyzer (lexer or scanner) reads the character stream of the source code and breaks it into token units.
For example, consider the following C code:
position = initial + rate * 60;
The lexical analyzer breaks this code into the following tokens:
<id, "position"> <=> <id, "initial"> <+> <id, "rate"> <*> <num, 60> <;>
Each token is represented as a pair of a token name and an attribute value. Identifiers are registered in the symbol table and referenced in later phases.
Additional Roles of the Lexical Analyzer
- Removing whitespace and comments
- Tracking line numbers in the source program (for error message generation)
- Interfacing with macro preprocessing
4. Syntax Analysis
The syntax analyzer (parser) takes the token stream as input and produces a parse tree or abstract syntax tree (AST).
The AST for the example position = initial + rate * 60 looks like this:
=
/ \
/ \
id: +
position / \
/ \
id: *
initial / \
/ \
id: num:
rate 60
The parser operates based on Context-Free Grammar (CFG). The grammar defines the structure of the programming language, and the parser verifies that the input conforms to this grammar.
Main Parsing Methods
| Method | Description | Examples |
|---|---|---|
| Top-Down | Starts from the start symbol and derives the input | Recursive descent parser, LL parser |
| Bottom-Up | Starts from the input and reduces to the start symbol | LR parser, LALR parser |
5. Semantic Analysis
The semantic analyzer uses the syntax tree to check the semantic consistency of the source program.
Its main roles are:
- Type Checking: Verifies that the types of operators and operands are compatible.
- Type Coercion: Inserts automatic type conversions where necessary.
- Scope Checking: Verifies that variables are used within the scope where they are declared.
For example, if rate is float and 60 is int, the semantic analyzer inserts a node to convert 60 to float.
=
/ \
/ \
id: +
position / \
/ \
id: *
initial / \
/ \
id: inttofloat
rate |
num: 60
6. Intermediate Code Generation
The compiler does not convert the source program directly into target code; instead, it first generates an Intermediate Representation (IR).
The most widely used IR format is Three-Address Code.
t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3
The characteristics of three-address code are:
- Each instruction contains at most one operator.
- The compiler generates temporary variables (
t1,t2,t3). - The order of operations for complex expressions is explicitly determined.
Advantages of Intermediate Code
- Machine independence: A representation that is not tied to specific hardware.
- Portability: Can be easily retargeted to different target machines.
- Ease of optimization: Well-suited for applying machine-independent optimizations.
7. Code Optimization
The code optimization phase transforms the intermediate code to generate more efficient target code.
// Before optimization
t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3
// After optimization
t1 = rate * 60.0
position = initial + t1
The optimizations performed in the above example are:
- Constant Folding: Computes
inttofloat(60)as60.0at compile time. - Eliminating unnecessary temporaries: Removes
t3and assigns directly toposition.
Key Optimization Techniques
- Common Subexpression Elimination
- Dead Code Elimination
- Loop-Invariant Code Motion
- Strength Reduction: Replaces expensive operations with cheaper ones.
- Inlining: Replaces function calls with the function body.
8. Code Generation
The code generator converts the optimized intermediate code into target machine instructions.
// Intermediate code
t1 = rate * 60.0
position = initial + t1
// x86-like assembly code
LDF R2, rate // Load rate into register R2
MULF R2, R2, #60.0 // R2 = R2 * 60.0
LDF R1, initial // Load initial into register R1
ADDF R1, R1, R2 // R1 = R1 + R2
STF position, R1 // Store R1 value into position
The key challenges in the code generation phase are:
- Register Allocation: Efficiently placing variables in limited registers.
- Instruction Selection: Choosing optimal instructions for the target machine.
- Instruction Scheduling: Maximizing pipeline efficiency.
9. Symbol Table
The symbol table is a core data structure shared across all phases of the compiler.
+--------+-------+--------+---------+
| Name | Type | Scope | Memory |
+--------+-------+--------+---------+
| position| float | global | 0x1000 |
| initial | float | global | 0x1004 |
| rate | float | global | 0x1008 |
+--------+-------+--------+---------+
The information stored in the symbol table includes:
- Variable name (identifier)
- Type information:
int,float, array, etc. - Scope information: The range in which a variable is valid
- Memory location: The address where a variable is allocated
The symbol table is typically implemented as a hash table, supporting fast insertion and lookup. Scope chaining or stack-based structures are used alongside it to handle scopes.
10. Compilation Phase vs Pass
Phase
Each logical functional unit of a compiler is called a phase. Lexical analysis, syntax analysis, etc., are each a phase. Phases are conceptual divisions, and in practice, multiple phases may be combined.
Pass
A pass is the process of reading the source program (or intermediate representation) from beginning to end once.
[1-Pass Compiler]
Processes all phases in a single traversal
Example: Simple assemblers, early Pascal compilers
[Multi-Pass Compiler]
Processes each phase through multiple traversals
Example: Modern optimizing compilers (GCC, LLVM, etc.)
A 1-Pass compiler is fast but limited in optimization, while a Multi-Pass compiler can perform more optimizations but takes longer to compile.
Actual Structure of Modern Compilers
Modern compilers (e.g., LLVM) have the following modular structure:
[C/C++ Frontend] [Rust Frontend] [Swift Frontend]
\ | /
v v v
+-------------------------------+
| LLVM IR (IR) |
+-------------------------------+
| |
v v
[x86 Backend] [ARM Backend]
Thanks to this structure, supporting M languages and N target machines requires only M+N modules (not M x N).
Summary
| Phase | Input | Output | Key Role |
|---|---|---|---|
| Lexical Analysis | Character stream | Token stream | Token recognition, whitespace/comment removal |
| Syntax Analysis | Token stream | Syntax tree | Grammar verification, tree construction |
| Semantic Analysis | Syntax tree | Annotated syntax tree | Type checking, semantic verification |
| Intermediate Code Gen. | Syntax tree | Three-address code | Machine-independent IR generation |
| Code Optimization | Intermediate code | Optimized IR | Performance improvement transformations |
| Code Generation | Intermediate code | Target code | Instruction selection, register allocation |
A compiler is not just a simple translator; it is a sophisticated system that accurately understands the semantics of a programming language and transforms it into efficient machine code. In the next article, we will explore the basics of programming languages and the various applications of compiler technology.