✍️ 필사 모드: Compiler & Interpreter Design Complete Guide 2025: Lexer, Parser, AST, Code Generation
EnglishTable of Contents
1. Why Learn Compilers
Compiler theory is one of the core fields of computer science. Understanding compilers gives you deep insight into how programming languages work and has numerous practical applications.
1.1 Practical Applications
| Application Area | Examples |
|---|---|
| DSL Design | Configuration languages, query languages, template engines |
| Transpilers | TypeScript, Babel, PostCSS |
| Linters/Formatters | ESLint, Prettier, rustfmt |
| IDE Features | Autocomplete, refactoring, syntax highlighting |
| Code Analysis | Static analysis, security scanners, dependency analysis |
| Build Tools | webpack, esbuild, SWC, Vite |
1.2 Compilation Pipeline Overview
Source Code -> Compilation Pipeline:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Source │───>│ Lexer │───>│ Parser │
│ Code │ │(Tokenizer)│ │ │
└──────────┘ └──────────┘ └──────────┘
│
v
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Code │<───│Optimizer │<───│ Semantic │
│Generator │ │ │ │ Analysis │
└──────────┘ └──────────┘ └──────────┘
│
v
┌──────────┐
│ Target │
│ Code │
└──────────┘
Role of each stage:
1. Lexer: Source code -> Token stream
2. Parser: Token stream -> AST (Abstract Syntax Tree)
3. Semantic Analysis: Type checking, scope resolution
4. Optimizer: Optimization at IR level
5. Code Generator: Generate target code (machine code, bytecode, WASM, etc.)
2. Lexer Implementation
The Lexer (Tokenizer) transforms source code strings into a stream of Tokens.
2.1 Token Type Definition
// Token type enumeration
enum TokenType {
// Literals
Number = 'Number',
String = 'String',
Boolean = 'Boolean',
// Identifiers & Keywords
Identifier = 'Identifier',
Let = 'Let',
Const = 'Const',
Function = 'Function',
If = 'If',
Else = 'Else',
Return = 'Return',
While = 'While',
For = 'For',
// Operators
Plus = 'Plus', // +
Minus = 'Minus', // -
Star = 'Star', // *
Slash = 'Slash', // /
Percent = 'Percent', // %
Assign = 'Assign', // =
Equal = 'Equal', // ==
NotEqual = 'NotEqual', // !=
LessThan = 'LessThan', // <
GreaterThan = 'GreaterThan', // >
And = 'And', // &&
Or = 'Or', // ||
Not = 'Not', // !
// Delimiters
LeftParen = 'LeftParen', // (
RightParen = 'RightParen', // )
LeftBrace = 'LeftBrace', // { (inside code blocks only)
RightBrace = 'RightBrace', // } (inside code blocks only)
LeftBracket = 'LeftBracket', // [
RightBracket = 'RightBracket', // ]
Comma = 'Comma',
Semicolon = 'Semicolon',
Colon = 'Colon',
Dot = 'Dot',
Arrow = 'Arrow', // =>
// Special
EOF = 'EOF',
}
// Token interface
interface Token {
type: TokenType;
value: string;
line: number;
column: number;
}
2.2 Lexer Class Implementation
class Lexer {
private source: string;
private pos: number = 0;
private line: number = 1;
private column: number = 1;
private tokens: Token[] = [];
// Keyword map
private static keywords: Map<string, TokenType> = new Map([
['let', TokenType.Let],
['const', TokenType.Const],
['function', TokenType.Function],
['if', TokenType.If],
['else', TokenType.Else],
['return', TokenType.Return],
['while', TokenType.While],
['for', TokenType.For],
['true', TokenType.Boolean],
['false', TokenType.Boolean],
]);
constructor(source: string) {
this.source = source;
}
tokenize(): Token[] {
while (this.pos < this.source.length) {
this.skipWhitespace();
this.skipComments();
if (this.pos >= this.source.length) break;
const char = this.source[this.pos];
if (this.isDigit(char)) {
this.readNumber();
} else if (this.isAlpha(char) || char === '_') {
this.readIdentifier();
} else if (char === '"' || char === "'") {
this.readString(char);
} else {
this.readOperator();
}
}
this.tokens.push(this.makeToken(TokenType.EOF, ''));
return this.tokens;
}
private readNumber(): void {
const start = this.pos;
let isFloat = false;
while (this.pos < this.source.length && this.isDigit(this.current())) {
this.advance();
}
// Handle decimal point
if (this.current() === '.' && this.isDigit(this.peek())) {
isFloat = true;
this.advance(); // Skip '.'
while (this.pos < this.source.length && this.isDigit(this.current())) {
this.advance();
}
}
const value = this.source.slice(start, this.pos);
this.tokens.push(this.makeToken(TokenType.Number, value));
}
private readIdentifier(): void {
const start = this.pos;
while (
this.pos < this.source.length &&
(this.isAlphaNumeric(this.current()) || this.current() === '_')
) {
this.advance();
}
const value = this.source.slice(start, this.pos);
const type = Lexer.keywords.get(value) || TokenType.Identifier;
this.tokens.push(this.makeToken(type, value));
}
private readString(quote: string): void {
this.advance(); // Skip opening quote
const start = this.pos;
let value = '';
while (this.pos < this.source.length && this.current() !== quote) {
if (this.current() === '\\') {
this.advance();
switch (this.current()) {
case 'n': value += '\n'; break;
case 't': value += '\t'; break;
case '\\': value += '\\'; break;
default: value += this.current();
}
} else {
value += this.current();
}
this.advance();
}
if (this.current() !== quote) {
throw new Error(
'Unterminated string at line ' + this.line
);
}
this.advance(); // Skip closing quote
this.tokens.push(this.makeToken(TokenType.String, value));
}
private readOperator(): void {
const char = this.current();
const next = this.peek();
// Check two-character operators
const twoChar = char + (next || '');
const twoCharOps: Record<string, TokenType> = {
'==': TokenType.Equal,
'!=': TokenType.NotEqual,
'&&': TokenType.And,
'||': TokenType.Or,
'=>': TokenType.Arrow,
};
if (twoCharOps[twoChar]) {
this.tokens.push(this.makeToken(twoCharOps[twoChar], twoChar));
this.advance();
this.advance();
return;
}
// Single-character operators
const oneCharOps: Record<string, TokenType> = {
'+': TokenType.Plus,
'-': TokenType.Minus,
'*': TokenType.Star,
'/': TokenType.Slash,
'%': TokenType.Percent,
'=': TokenType.Assign,
'<': TokenType.LessThan,
'>': TokenType.GreaterThan,
'!': TokenType.Not,
'(': TokenType.LeftParen,
')': TokenType.RightParen,
'[': TokenType.LeftBracket,
']': TokenType.RightBracket,
',': TokenType.Comma,
';': TokenType.Semicolon,
':': TokenType.Colon,
'.': TokenType.Dot,
};
// Handle braces
if (char === '{') {
this.tokens.push(this.makeToken(TokenType.LeftBrace, char));
this.advance();
return;
}
if (char === '}') {
this.tokens.push(this.makeToken(TokenType.RightBrace, char));
this.advance();
return;
}
if (oneCharOps[char]) {
this.tokens.push(this.makeToken(oneCharOps[char], char));
this.advance();
} else {
throw new Error(
'Unexpected character: ' + char + ' at line ' + this.line
);
}
}
// Utility methods
private current(): string { return this.source[this.pos]; }
private peek(): string { return this.source[this.pos + 1]; }
private advance(): void {
if (this.current() === '\n') {
this.line++;
this.column = 1;
} else {
this.column++;
}
this.pos++;
}
private isDigit(c: string): boolean { return /[0-9]/.test(c); }
private isAlpha(c: string): boolean { return /[a-zA-Z_]/.test(c); }
private isAlphaNumeric(c: string): boolean { return /[a-zA-Z0-9_]/.test(c); }
private skipWhitespace(): void {
while (this.pos < this.source.length && /\s/.test(this.current())) {
this.advance();
}
}
private skipComments(): void {
if (this.current() === '/' && this.peek() === '/') {
while (this.pos < this.source.length && this.current() !== '\n') {
this.advance();
}
}
}
private makeToken(type: TokenType, value: string): Token {
return { type, value, line: this.line, column: this.column };
}
}
2.3 Error Recovery
// Lexer with error recovery
class RobustLexer extends Lexer {
private errors: LexerError[] = [];
tokenize(): Token[] {
while (this.pos < this.source.length) {
try {
this.skipWhitespace();
if (this.pos >= this.source.length) break;
// ... normal tokenization logic
} catch (error) {
// Record error and advance to next valid position
this.errors.push({
message: error.message,
line: this.line,
column: this.column,
});
this.synchronize(); // Skip to next valid token start
}
}
return this.tokens;
}
private synchronize(): void {
// Skip to next whitespace or delimiter
while (
this.pos < this.source.length &&
!/[\s;,(){}[\]]/.test(this.current())
) {
this.advance();
}
}
}
3. Parser Implementation
The Parser transforms a token stream into an AST (Abstract Syntax Tree).
3.1 Parser Type Comparison
Parser Type Comparison:
┌─────────────────────┬────────────┬────────────┬──────────────┐
│ Type │ Difficulty │ Expressive │ Notable Tool │
├─────────────────────┼────────────┼────────────┼──────────────┤
│ Recursive Descent │ Easy │ LL(k) │ GCC, Go │
│ Pratt Parser │ Medium │ Precedence │ V8, Rust │
│ PEG │ Medium │ Unlimited │ pest, PEG.js │
│ Parser Combinators │ Medium │ Flexible │ nom, parsec │
│ LR/LALR │ Hard │ LR(1) │ yacc, bison │
│ Earley │ Medium │ All CFGs │ nearley │
└─────────────────────┴────────────┴────────────┴──────────────┘
3.2 Recursive Descent Parser
// AST Node type definition
type ASTNode =
| NumberLiteral
| StringLiteral
| BooleanLiteral
| Identifier
| BinaryExpression
| UnaryExpression
| CallExpression
| VariableDeclaration
| FunctionDeclaration
| IfStatement
| WhileStatement
| ReturnStatement
| BlockStatement
| Program;
interface NumberLiteral {
type: 'NumberLiteral';
value: number;
}
interface BinaryExpression {
type: 'BinaryExpression';
operator: string;
left: ASTNode;
right: ASTNode;
}
interface VariableDeclaration {
type: 'VariableDeclaration';
kind: 'let' | 'const';
name: string;
initializer: ASTNode | null;
}
interface FunctionDeclaration {
type: 'FunctionDeclaration';
name: string;
params: string[];
body: BlockStatement;
}
interface IfStatement {
type: 'IfStatement';
condition: ASTNode;
consequent: BlockStatement;
alternate: BlockStatement | IfStatement | null;
}
interface Program {
type: 'Program';
body: ASTNode[];
}
// Recursive Descent Parser
class RecursiveDescentParser {
private tokens: Token[];
private pos: number = 0;
constructor(tokens: Token[]) {
this.tokens = tokens;
}
parse(): Program {
const body: ASTNode[] = [];
while (!this.isAtEnd()) {
body.push(this.parseStatement());
}
return { type: 'Program', body };
}
private parseStatement(): ASTNode {
const token = this.current();
switch (token.type) {
case TokenType.Let:
case TokenType.Const:
return this.parseVariableDeclaration();
case TokenType.Function:
return this.parseFunctionDeclaration();
case TokenType.If:
return this.parseIfStatement();
case TokenType.While:
return this.parseWhileStatement();
case TokenType.Return:
return this.parseReturnStatement();
case TokenType.LeftBrace:
return this.parseBlockStatement();
default:
return this.parseExpressionStatement();
}
}
private parseVariableDeclaration(): VariableDeclaration {
const kind = this.consume().value as 'let' | 'const';
const name = this.expect(TokenType.Identifier).value;
let initializer: ASTNode | null = null;
if (this.match(TokenType.Assign)) {
initializer = this.parseExpression();
}
this.expect(TokenType.Semicolon);
return { type: 'VariableDeclaration', kind, name, initializer };
}
private parseFunctionDeclaration(): FunctionDeclaration {
this.expect(TokenType.Function);
const name = this.expect(TokenType.Identifier).value;
this.expect(TokenType.LeftParen);
const params: string[] = [];
if (!this.check(TokenType.RightParen)) {
do {
params.push(this.expect(TokenType.Identifier).value);
} while (this.match(TokenType.Comma));
}
this.expect(TokenType.RightParen);
const body = this.parseBlockStatement();
return { type: 'FunctionDeclaration', name, params, body };
}
private parseIfStatement(): IfStatement {
this.expect(TokenType.If);
this.expect(TokenType.LeftParen);
const condition = this.parseExpression();
this.expect(TokenType.RightParen);
const consequent = this.parseBlockStatement();
let alternate: BlockStatement | IfStatement | null = null;
if (this.match(TokenType.Else)) {
if (this.check(TokenType.If)) {
alternate = this.parseIfStatement();
} else {
alternate = this.parseBlockStatement();
}
}
return { type: 'IfStatement', condition, consequent, alternate };
}
// Utility methods
private current(): Token { return this.tokens[this.pos]; }
private consume(): Token { return this.tokens[this.pos++]; }
private isAtEnd(): boolean { return this.current().type === TokenType.EOF; }
private check(type: TokenType): boolean { return this.current().type === type; }
private match(type: TokenType): boolean {
if (this.check(type)) { this.consume(); return true; }
return false;
}
private expect(type: TokenType): Token {
if (this.check(type)) return this.consume();
throw new Error(
'Expected ' + type + ' but got ' + this.current().type +
' at line ' + this.current().line
);
}
}
3.3 Pratt Parser (Operator Precedence Parser)
The Pratt Parser elegantly handles operator precedence. It is used in many modern language implementations including V8 (JavaScript) and the Rust compiler.
// Operator precedence definition
enum Precedence {
None = 0,
Assignment = 1, // =
Or = 2, // ||
And = 3, // &&
Equality = 4, // == !=
Comparison = 5, // < > (including less-than-or-equal, greater-than-or-equal)
Term = 6, // + -
Factor = 7, // * / %
Unary = 8, // ! -
Call = 9, // ()
Member = 10, // .
}
// Prefix parser function type (numbers, identifiers, parentheses, unary operators)
type PrefixParseFn = () => ASTNode;
// Infix parser function type (binary operators, function calls)
type InfixParseFn = (left: ASTNode) => ASTNode;
class PrattParser {
private tokens: Token[];
private pos: number = 0;
// Prefix parser registry
private prefixParsers: Map<TokenType, PrefixParseFn> = new Map();
// Infix parser registry
private infixParsers: Map<TokenType, InfixParseFn> = new Map();
// Precedence registry
private precedences: Map<TokenType, Precedence> = new Map();
constructor(tokens: Token[]) {
this.tokens = tokens;
// Register prefix parsers
this.registerPrefix(TokenType.Number, this.parseNumber.bind(this));
this.registerPrefix(TokenType.String, this.parseString.bind(this));
this.registerPrefix(TokenType.Boolean, this.parseBoolean.bind(this));
this.registerPrefix(TokenType.Identifier, this.parseIdentifier.bind(this));
this.registerPrefix(TokenType.LeftParen, this.parseGrouping.bind(this));
this.registerPrefix(TokenType.Minus, this.parseUnary.bind(this));
this.registerPrefix(TokenType.Not, this.parseUnary.bind(this));
// Register infix parsers
this.registerInfix(TokenType.Plus, Precedence.Term, this.parseBinary.bind(this));
this.registerInfix(TokenType.Minus, Precedence.Term, this.parseBinary.bind(this));
this.registerInfix(TokenType.Star, Precedence.Factor, this.parseBinary.bind(this));
this.registerInfix(TokenType.Slash, Precedence.Factor, this.parseBinary.bind(this));
this.registerInfix(TokenType.Equal, Precedence.Equality, this.parseBinary.bind(this));
this.registerInfix(TokenType.NotEqual, Precedence.Equality, this.parseBinary.bind(this));
this.registerInfix(TokenType.LessThan, Precedence.Comparison, this.parseBinary.bind(this));
this.registerInfix(TokenType.GreaterThan, Precedence.Comparison, this.parseBinary.bind(this));
this.registerInfix(TokenType.And, Precedence.And, this.parseBinary.bind(this));
this.registerInfix(TokenType.Or, Precedence.Or, this.parseBinary.bind(this));
this.registerInfix(TokenType.LeftParen, Precedence.Call, this.parseCall.bind(this));
}
// Core: Pratt parsing algorithm
parseExpression(precedence: Precedence = Precedence.None): ASTNode {
const token = this.current();
const prefixFn = this.prefixParsers.get(token.type);
if (!prefixFn) {
throw new Error('No prefix parser for ' + token.type);
}
let left = prefixFn();
// Continue parsing if higher precedence infix operator exists
while (
!this.isAtEnd() &&
precedence < this.getPrecedence(this.current().type)
) {
const infixFn = this.infixParsers.get(this.current().type);
if (!infixFn) break;
left = infixFn(left);
}
return left;
}
// Binary operator parsing
private parseBinary(left: ASTNode): ASTNode {
const operator = this.consume();
const precedence = this.precedences.get(operator.type) || Precedence.None;
const right = this.parseExpression(precedence); // Left-associative
return {
type: 'BinaryExpression',
operator: operator.value,
left,
right,
};
}
// Function call parsing
private parseCall(left: ASTNode): ASTNode {
this.consume(); // Consume '('
const args: ASTNode[] = [];
if (!this.check(TokenType.RightParen)) {
do {
args.push(this.parseExpression());
} while (this.match(TokenType.Comma));
}
this.expect(TokenType.RightParen);
return {
type: 'CallExpression',
callee: left,
arguments: args,
} as CallExpression;
}
// Utility methods
private registerPrefix(type: TokenType, fn: PrefixParseFn): void {
this.prefixParsers.set(type, fn);
}
private registerInfix(type: TokenType, prec: Precedence, fn: InfixParseFn): void {
this.infixParsers.set(type, fn);
this.precedences.set(type, prec);
}
private getPrecedence(type: TokenType): Precedence {
return this.precedences.get(type) || Precedence.None;
}
private parseNumber(): ASTNode {
const token = this.consume();
return { type: 'NumberLiteral', value: parseFloat(token.value) };
}
private parseString(): ASTNode {
const token = this.consume();
return { type: 'StringLiteral', value: token.value } as StringLiteral;
}
private parseBoolean(): ASTNode {
const token = this.consume();
return { type: 'BooleanLiteral', value: token.value === 'true' } as BooleanLiteral;
}
private parseIdentifier(): ASTNode {
const token = this.consume();
return { type: 'Identifier', name: token.value } as Identifier;
}
private parseGrouping(): ASTNode {
this.consume(); // Consume '('
const expr = this.parseExpression();
this.expect(TokenType.RightParen);
return expr;
}
private parseUnary(): ASTNode {
const operator = this.consume();
const operand = this.parseExpression(Precedence.Unary);
return {
type: 'UnaryExpression',
operator: operator.value,
operand,
} as UnaryExpression;
}
// Common utilities
private current(): Token { return this.tokens[this.pos]; }
private consume(): Token { return this.tokens[this.pos++]; }
private isAtEnd(): boolean { return this.current().type === TokenType.EOF; }
private check(type: TokenType): boolean { return this.current().type === type; }
private match(type: TokenType): boolean {
if (this.check(type)) { this.consume(); return true; }
return false;
}
private expect(type: TokenType): Token {
if (this.check(type)) return this.consume();
throw new Error('Expected ' + type + ' but got ' + this.current().type);
}
}
4. AST Design and Visitor Pattern
4.1 AST Node Design Principles
// Each AST node includes location information
interface ASTNodeBase {
type: string;
loc: SourceLocation;
}
interface SourceLocation {
start: Position;
end: Position;
source?: string;
}
interface Position {
line: number;
column: number;
offset: number;
}
4.2 Visitor Pattern
// Visitor interface
interface ASTVisitor<T> {
visitProgram(node: Program): T;
visitNumberLiteral(node: NumberLiteral): T;
visitStringLiteral(node: StringLiteral): T;
visitBinaryExpression(node: BinaryExpression): T;
visitUnaryExpression(node: UnaryExpression): T;
visitIdentifier(node: Identifier): T;
visitVariableDeclaration(node: VariableDeclaration): T;
visitFunctionDeclaration(node: FunctionDeclaration): T;
visitCallExpression(node: CallExpression): T;
visitIfStatement(node: IfStatement): T;
visitReturnStatement(node: ReturnStatement): T;
visitBlockStatement(node: BlockStatement): T;
}
// AST Printer (for debugging)
class ASTPrinter implements ASTVisitor<string> {
private indent: number = 0;
visitProgram(node: Program): string {
return node.body.map((stmt) => this.visit(stmt)).join('\n');
}
visitBinaryExpression(node: BinaryExpression): string {
return (
'(' +
this.visit(node.left) +
' ' + node.operator + ' ' +
this.visit(node.right) +
')'
);
}
visitNumberLiteral(node: NumberLiteral): string {
return node.value.toString();
}
visitIdentifier(node: Identifier): string {
return node.name;
}
// Generic visit dispatcher
visit(node: ASTNode): string {
switch (node.type) {
case 'Program': return this.visitProgram(node as Program);
case 'NumberLiteral': return this.visitNumberLiteral(node as NumberLiteral);
case 'BinaryExpression': return this.visitBinaryExpression(node as BinaryExpression);
case 'Identifier': return this.visitIdentifier(node as Identifier);
default: return 'Unknown: ' + node.type;
}
}
}
5. Semantic Analysis
5.1 Symbol Table and Scope
// Symbol information
interface Symbol {
name: string;
type: DataType;
kind: 'variable' | 'function' | 'parameter';
mutable: boolean;
scope: Scope;
}
// Scope (lexical scope chain)
class Scope {
private symbols: Map<string, Symbol> = new Map();
private parent: Scope | null;
constructor(parent: Scope | null = null) {
this.parent = parent;
}
define(symbol: Symbol): void {
if (this.symbols.has(symbol.name)) {
throw new Error('Symbol already defined: ' + symbol.name);
}
this.symbols.set(symbol.name, symbol);
}
resolve(name: string): Symbol | null {
if (this.symbols.has(name)) {
return this.symbols.get(name)!;
}
// Search in parent scope
if (this.parent) {
return this.parent.resolve(name);
}
return null;
}
createChild(): Scope {
return new Scope(this);
}
}
5.2 Type Checking
// Type system
type DataType =
| { kind: 'number' }
| { kind: 'string' }
| { kind: 'boolean' }
| { kind: 'void' }
| { kind: 'function'; params: DataType[]; returnType: DataType }
| { kind: 'array'; elementType: DataType };
class TypeChecker implements ASTVisitor<DataType> {
private currentScope: Scope = new Scope();
visitBinaryExpression(node: BinaryExpression): DataType {
const leftType = this.visit(node.left);
const rightType = this.visit(node.right);
// Arithmetic operators
if (['+', '-', '*', '/', '%'].includes(node.operator)) {
if (node.operator === '+') {
// Allow string concatenation
if (leftType.kind === 'string' || rightType.kind === 'string') {
return { kind: 'string' };
}
}
if (leftType.kind !== 'number' || rightType.kind !== 'number') {
throw new TypeError(
'Operator ' + node.operator +
' requires number operands, got ' +
leftType.kind + ' and ' + rightType.kind
);
}
return { kind: 'number' };
}
// Comparison operators
if (['==', '!=', '<', '>'].includes(node.operator)) {
if (leftType.kind !== rightType.kind) {
throw new TypeError(
'Cannot compare ' + leftType.kind + ' with ' + rightType.kind
);
}
return { kind: 'boolean' };
}
// Logical operators
if (['&&', '||'].includes(node.operator)) {
if (leftType.kind !== 'boolean' || rightType.kind !== 'boolean') {
throw new TypeError('Logical operators require boolean operands');
}
return { kind: 'boolean' };
}
throw new TypeError('Unknown operator: ' + node.operator);
}
visit(node: ASTNode): DataType {
switch (node.type) {
case 'NumberLiteral': return { kind: 'number' };
case 'StringLiteral': return { kind: 'string' };
case 'BooleanLiteral': return { kind: 'boolean' };
case 'BinaryExpression': return this.visitBinaryExpression(node as BinaryExpression);
default: return { kind: 'void' };
}
}
}
6. Intermediate Representation (IR)
6.1 Three-Address Code
// Three-Address Code
// Each instruction has at most 3 addresses (operands)
type IRInstruction =
| { op: 'CONST'; dest: string; value: number }
| { op: 'ADD' | 'SUB' | 'MUL' | 'DIV'; dest: string; left: string; right: string }
| { op: 'COPY'; dest: string; src: string }
| { op: 'LABEL'; name: string }
| { op: 'JUMP'; target: string }
| { op: 'JUMP_IF_FALSE'; condition: string; target: string }
| { op: 'CALL'; dest: string; fn: string; args: string[] }
| { op: 'RETURN'; value: string | null };
// AST -> IR conversion
class IRGenerator implements ASTVisitor<string> {
private instructions: IRInstruction[] = [];
private tempCounter: number = 0;
private labelCounter: number = 0;
private newTemp(): string {
return '_t' + (this.tempCounter++);
}
private newLabel(): string {
return '_L' + (this.labelCounter++);
}
private emit(instruction: IRInstruction): void {
this.instructions.push(instruction);
}
visitBinaryExpression(node: BinaryExpression): string {
const left = this.visit(node.left);
const right = this.visit(node.right);
const dest = this.newTemp();
const opMap: Record<string, string> = {
'+': 'ADD', '-': 'SUB', '*': 'MUL', '/': 'DIV',
};
this.emit({
op: opMap[node.operator] as 'ADD',
dest,
left,
right,
});
return dest;
}
visitIfStatement(node: IfStatement): string {
const condition = this.visit(node.condition);
const elseLabel = this.newLabel();
const endLabel = this.newLabel();
this.emit({ op: 'JUMP_IF_FALSE', condition, target: elseLabel });
// then block
this.visit(node.consequent as ASTNode);
this.emit({ op: 'JUMP', target: endLabel });
// else block
this.emit({ op: 'LABEL', name: elseLabel });
if (node.alternate) {
this.visit(node.alternate as ASTNode);
}
this.emit({ op: 'LABEL', name: endLabel });
return '';
}
visit(node: ASTNode): string {
switch (node.type) {
case 'NumberLiteral': {
const dest = this.newTemp();
this.emit({ op: 'CONST', dest, value: (node as NumberLiteral).value });
return dest;
}
case 'BinaryExpression':
return this.visitBinaryExpression(node as BinaryExpression);
case 'IfStatement':
return this.visitIfStatement(node as IfStatement);
default:
return '';
}
}
getInstructions(): IRInstruction[] {
return this.instructions;
}
}
6.2 SSA Form and LLVM IR
LLVM IR Basic Structure:
; Function definition
define i32 @add(i32 %a, i32 %b) {
entry:
%sum = add i32 %a, %b
ret i32 %sum
}
; if-else structure
define i32 @max(i32 %a, i32 %b) {
entry:
%cmp = icmp sgt i32 %a, %b
br i1 %cmp, label %then, label %else
then:
ret i32 %a
else:
ret i32 %b
}
; Loop structure
define i32 @sum_to_n(i32 %n) {
entry:
br label %loop
loop:
%i = phi i32 [ 0, %entry ], [ %next_i, %loop ]
%sum = phi i32 [ 0, %entry ], [ %next_sum, %loop ]
%next_sum = add i32 %sum, %i
%next_i = add i32 %i, 1
%cond = icmp slt i32 %next_i, %n
br i1 %cond, label %loop, label %exit
exit:
ret i32 %next_sum
}
SSA (Static Single Assignment) Characteristics:
- Each variable is assigned exactly once
- Phi nodes handle control flow join points
- Facilitates optimization passes
7. Optimization Passes
7.1 Constant Folding
// Constant Folding: Evaluate constant expressions at compile time
class ConstantFolding {
optimize(node: ASTNode): ASTNode {
if (node.type === 'BinaryExpression') {
const binary = node as BinaryExpression;
const left = this.optimize(binary.left);
const right = this.optimize(binary.right);
// If both sides are constants, compute at compile time
if (left.type === 'NumberLiteral' && right.type === 'NumberLiteral') {
const leftVal = (left as NumberLiteral).value;
const rightVal = (right as NumberLiteral).value;
let result: number;
switch (binary.operator) {
case '+': result = leftVal + rightVal; break;
case '-': result = leftVal - rightVal; break;
case '*': result = leftVal * rightVal; break;
case '/': result = leftVal / rightVal; break;
default: return { ...binary, left, right };
}
return { type: 'NumberLiteral', value: result };
}
return { ...binary, left, right };
}
return node;
}
}
// Example: 2 + 3 * 4 -> 14
7.2 Dead Code Elimination
// Dead Code Elimination
class DeadCodeElimination {
optimize(program: Program): Program {
const usedVariables = this.collectUsedVariables(program);
const optimizedBody = program.body.filter((stmt) => {
if (stmt.type === 'VariableDeclaration') {
const decl = stmt as VariableDeclaration;
// Remove if unused and has no side effects
if (!usedVariables.has(decl.name) && !this.hasSideEffects(decl.initializer)) {
return false;
}
}
return true;
});
return { ...program, body: optimizedBody };
}
private collectUsedVariables(program: Program): Set<string> {
const used = new Set<string>();
function walk(node: ASTNode): void {
if (node.type === 'Identifier') {
used.add((node as Identifier).name);
}
for (const child of getChildren(node)) {
walk(child);
}
}
program.body.forEach(walk);
return used;
}
private hasSideEffects(node: ASTNode | null): boolean {
if (!node) return false;
return node.type === 'CallExpression';
}
}
7.3 Loop Optimization
Loop Optimization Techniques:
1. Loop Invariant Code Motion (LICM)
// Before // After
for (i = 0; i < n; i++) x = a + b; // Moved outside loop
arr[i] = a + b + i; for (i = 0; i < n; i++)
arr[i] = x + i;
2. Loop Unrolling
// Before // After
for (i = 0; i < 4; i++) arr[0] = 0;
arr[i] = 0; arr[1] = 0;
arr[2] = 0;
arr[3] = 0;
3. Strength Reduction
// Before // After
for (i = 0; i < n; i++) t = 0;
arr[i] = i * 4; for (i = 0; i < n; i++)
arr[i] = t;
t = t + 4; // Multiply -> Addition
8. Code Generation
8.1 WebAssembly Code Generation
// AST -> WebAssembly Text Format (WAT) code generation
class WasmCodeGenerator {
private output: string[] = [];
generate(program: Program): string {
this.output.push('(module');
for (const stmt of program.body) {
if (stmt.type === 'FunctionDeclaration') {
this.generateFunction(stmt as FunctionDeclaration);
}
}
this.output.push(')');
return this.output.join('\n');
}
private generateFunction(node: FunctionDeclaration): void {
const params = node.params.map((p) => '(param $' + p + ' i32)').join(' ');
this.output.push(' (func $' + node.name + ' ' + params + ' (result i32)');
this.output.push(' (local $__result i32)');
for (const stmt of node.body.body) {
this.generateStatement(stmt);
}
this.output.push(' )');
this.output.push(' (export "' + node.name + '" (func $' + node.name + '))');
}
private generateExpression(node: ASTNode): void {
switch (node.type) {
case 'NumberLiteral':
this.output.push(' i32.const ' + (node as NumberLiteral).value);
break;
case 'Identifier':
this.output.push(' local.get $' + (node as Identifier).name);
break;
case 'BinaryExpression': {
const binary = node as BinaryExpression;
this.generateExpression(binary.left);
this.generateExpression(binary.right);
const opMap: Record<string, string> = {
'+': 'i32.add',
'-': 'i32.sub',
'*': 'i32.mul',
'/': 'i32.div_s',
'<': 'i32.lt_s',
'>': 'i32.gt_s',
'==': 'i32.eq',
'!=': 'i32.ne',
};
this.output.push(' ' + opMap[binary.operator]);
break;
}
}
}
private generateStatement(node: ASTNode): void {
if (node.type === 'ReturnStatement') {
const ret = node as ReturnStatement;
if (ret.value) {
this.generateExpression(ret.value);
}
this.output.push(' return');
}
}
}
8.2 Bytecode Generation
// Stack-based bytecode
enum OpCode {
CONST = 0x01, // Push constant
ADD = 0x02, // Addition
SUB = 0x03, // Subtraction
MUL = 0x04, // Multiplication
DIV = 0x05, // Division
LOAD = 0x10, // Load variable
STORE = 0x11, // Store variable
JUMP = 0x20, // Unconditional jump
JUMP_IF_FALSE = 0x21,// Conditional jump
CALL = 0x30, // Function call
RETURN = 0x31, // Return
PRINT = 0x40, // Print
HALT = 0xFF, // Halt
}
class BytecodeCompiler {
private bytecode: number[] = [];
private constants: any[] = [];
private locals: Map<string, number> = new Map();
compile(program: Program): Uint8Array {
for (const stmt of program.body) {
this.compileNode(stmt);
}
this.emit(OpCode.HALT);
return new Uint8Array(this.bytecode);
}
private compileNode(node: ASTNode): void {
switch (node.type) {
case 'NumberLiteral': {
const index = this.addConstant((node as NumberLiteral).value);
this.emit(OpCode.CONST);
this.emit(index);
break;
}
case 'BinaryExpression': {
const binary = node as BinaryExpression;
this.compileNode(binary.left);
this.compileNode(binary.right);
switch (binary.operator) {
case '+': this.emit(OpCode.ADD); break;
case '-': this.emit(OpCode.SUB); break;
case '*': this.emit(OpCode.MUL); break;
case '/': this.emit(OpCode.DIV); break;
}
break;
}
case 'VariableDeclaration': {
const decl = node as VariableDeclaration;
const slot = this.locals.size;
this.locals.set(decl.name, slot);
if (decl.initializer) {
this.compileNode(decl.initializer);
this.emit(OpCode.STORE);
this.emit(slot);
}
break;
}
case 'Identifier': {
const id = node as Identifier;
const slot = this.locals.get(id.name);
if (slot === undefined) {
throw new Error('Undefined variable: ' + id.name);
}
this.emit(OpCode.LOAD);
this.emit(slot);
break;
}
}
}
private emit(byte: number): void {
this.bytecode.push(byte);
}
private addConstant(value: any): number {
this.constants.push(value);
return this.constants.length - 1;
}
}
9. Interpreter Design
9.1 Tree-Walking Interpreter
// Tree-Walking Interpreter
class TreeWalkInterpreter {
private environment: Environment;
constructor() {
this.environment = new Environment();
this.environment.define('print', (args: any[]) => {
console.log(...args);
return null;
});
}
interpret(program: Program): any {
let result: any = null;
for (const stmt of program.body) {
result = this.execute(stmt);
}
return result;
}
private execute(node: ASTNode): any {
switch (node.type) {
case 'NumberLiteral':
return (node as NumberLiteral).value;
case 'StringLiteral':
return (node as StringLiteral).value;
case 'BooleanLiteral':
return (node as BooleanLiteral).value;
case 'Identifier':
return this.environment.get((node as Identifier).name);
case 'BinaryExpression': {
const binary = node as BinaryExpression;
const left = this.execute(binary.left);
const right = this.execute(binary.right);
switch (binary.operator) {
case '+': return left + right;
case '-': return left - right;
case '*': return left * right;
case '/': return left / right;
case '%': return left % right;
case '==': return left === right;
case '!=': return left !== right;
case '<': return left < right;
case '>': return left > right;
case '&&': return left && right;
case '||': return left || right;
default: throw new Error('Unknown operator: ' + binary.operator);
}
}
case 'FunctionDeclaration': {
const fn = node as FunctionDeclaration;
const closure = this.environment;
const callable = (args: any[]) => {
const fnEnv = new Environment(closure);
fn.params.forEach((param, i) => {
fnEnv.define(param, args[i]);
});
const prevEnv = this.environment;
this.environment = fnEnv;
try {
for (const stmt of fn.body.body) {
const result = this.execute(stmt);
if (result && result.__return__) {
return result.value;
}
}
} finally {
this.environment = prevEnv;
}
return null;
};
this.environment.define(fn.name, callable);
return null;
}
case 'CallExpression': {
const call = node as CallExpression;
const fn = this.execute(call.callee);
const args = call.arguments.map((arg: ASTNode) => this.execute(arg));
return fn(args);
}
case 'IfStatement': {
const ifStmt = node as IfStatement;
const condition = this.execute(ifStmt.condition);
if (condition) {
return this.executeBlock(ifStmt.consequent);
} else if (ifStmt.alternate) {
if (ifStmt.alternate.type === 'BlockStatement') {
return this.executeBlock(ifStmt.alternate as BlockStatement);
}
return this.execute(ifStmt.alternate);
}
return null;
}
case 'ReturnStatement': {
const ret = node as ReturnStatement;
return {
__return__: true,
value: ret.value ? this.execute(ret.value) : null,
};
}
default:
throw new Error('Unknown node type: ' + node.type);
}
}
private executeBlock(block: BlockStatement): any {
const prevEnv = this.environment;
this.environment = new Environment(prevEnv);
try {
for (const stmt of block.body) {
const result = this.execute(stmt);
if (result && result.__return__) return result;
}
} finally {
this.environment = prevEnv;
}
return null;
}
}
// Environment - manages variable bindings
class Environment {
private bindings: Map<string, any> = new Map();
private parent: Environment | null;
constructor(parent: Environment | null = null) {
this.parent = parent;
}
define(name: string, value: any): void {
this.bindings.set(name, value);
}
get(name: string): any {
if (this.bindings.has(name)) {
return this.bindings.get(name);
}
if (this.parent) return this.parent.get(name);
throw new Error('Undefined variable: ' + name);
}
set(name: string, value: any): void {
if (this.bindings.has(name)) {
this.bindings.set(name, value);
return;
}
if (this.parent) { this.parent.set(name, value); return; }
throw new Error('Undefined variable: ' + name);
}
}
9.2 Bytecode VM
// Stack-based Bytecode VM
class VirtualMachine {
private stack: any[] = [];
private locals: any[] = new Array(256).fill(null);
private ip: number = 0; // Instruction pointer
execute(bytecode: Uint8Array, constants: any[]): any {
while (this.ip < bytecode.length) {
const opcode = bytecode[this.ip++];
switch (opcode) {
case OpCode.CONST: {
const index = bytecode[this.ip++];
this.stack.push(constants[index]);
break;
}
case OpCode.ADD: {
const b = this.stack.pop();
const a = this.stack.pop();
this.stack.push(a + b);
break;
}
case OpCode.SUB: {
const b = this.stack.pop();
const a = this.stack.pop();
this.stack.push(a - b);
break;
}
case OpCode.MUL: {
const b = this.stack.pop();
const a = this.stack.pop();
this.stack.push(a * b);
break;
}
case OpCode.DIV: {
const b = this.stack.pop();
const a = this.stack.pop();
this.stack.push(a / b);
break;
}
case OpCode.LOAD: {
const slot = bytecode[this.ip++];
this.stack.push(this.locals[slot]);
break;
}
case OpCode.STORE: {
const slot = bytecode[this.ip++];
this.locals[slot] = this.stack.pop();
break;
}
case OpCode.JUMP: {
const target = bytecode[this.ip++];
this.ip = target;
break;
}
case OpCode.JUMP_IF_FALSE: {
const target = bytecode[this.ip++];
const condition = this.stack.pop();
if (!condition) {
this.ip = target;
}
break;
}
case OpCode.PRINT: {
console.log(this.stack.pop());
break;
}
case OpCode.HALT:
return this.stack.length > 0 ? this.stack.pop() : null;
default:
throw new Error('Unknown opcode: 0x' + opcode.toString(16));
}
}
return this.stack.length > 0 ? this.stack.pop() : null;
}
}
10. Garbage Collection
10.1 GC Algorithm Comparison
GC Algorithm Comparison:
┌──────────────────────┬────────────┬─────────────┬────────────────┐
│ Algorithm │ Throughput │ Pause Time │ Memory Overhead│
├──────────────────────┼────────────┼─────────────┼────────────────┤
│ Reference Counting │ High │ Short │ Counter/object │
│ Mark-Sweep │ High │ Can be long │ Mark bits │
│ Mark-Compact │ Medium │ Can be long │ Mark bits │
│ Generational │ Very High │ Mostly short│ Card table │
│ Concurrent │ High │ Very short │ Write barriers │
└──────────────────────┴────────────┴─────────────┴────────────────┘
10.2 Mark-Sweep Implementation
// Simple Mark-Sweep GC
class GarbageCollector {
private heap: GCObject[] = [];
private roots: Set<GCObject> = new Set();
allocate(value: any): GCObject {
const obj: GCObject = {
value,
marked: false,
references: [],
};
this.heap.push(obj);
// Run GC if heap is too large
if (this.heap.length > 1000) {
this.collect();
}
return obj;
}
addRoot(obj: GCObject): void {
this.roots.add(obj);
}
collect(): void {
// Mark phase
this.markAll();
// Sweep phase
this.sweep();
}
private markAll(): void {
for (const root of this.roots) {
this.mark(root);
}
}
private mark(obj: GCObject): void {
if (obj.marked) return; // Already marked (cycle prevention)
obj.marked = true;
// Mark referenced objects
for (const ref of obj.references) {
this.mark(ref);
}
}
private sweep(): void {
this.heap = this.heap.filter((obj) => {
if (obj.marked) {
obj.marked = false; // Reset for next GC
return true;
}
// Unmarked objects are freed
return false;
});
}
}
interface GCObject {
value: any;
marked: boolean;
references: GCObject[];
}
11. Building a Mini Language
11.1 Full Pipeline Integration
// Mini Language "MiniLang" full pipeline
class MiniLang {
// Source code -> Execution result
static run(source: string): any {
// 1. Tokenize
const lexer = new Lexer(source);
const tokens = lexer.tokenize();
// 2. Parse
const parser = new RecursiveDescentParser(tokens);
const ast = parser.parse();
// 3. Type check
const typeChecker = new TypeChecker();
// typeChecker.check(ast); // Optional
// 4. Optimize
const optimizer = new ConstantFolding();
// const optimizedAst = optimizer.optimize(ast);
// 5. Execute
const interpreter = new TreeWalkInterpreter();
return interpreter.interpret(ast);
}
}
// Usage example
const source = `
let x = 10;
let y = 20;
function add(a, b) {
return a + b;
}
function factorial(n) {
if (n == 0) {
return 1;
}
return n * factorial(n - 1);
}
let result = add(x, y);
print(result);
print(factorial(5));
`;
MiniLang.run(source);
// Output: 30
// Output: 120
12. Modern Compiler Projects
12.1 Major Project Architectures
Modern Compiler/Transpiler Comparison:
Rust Compiler (rustc):
Source -> HIR -> MIR -> LLVM IR -> Machine Code
Features: Ownership system, borrow checker, monomorphization
V8 TurboFan (JavaScript):
Source -> AST -> Bytecode(Ignition) -> Optimized Code(TurboFan)
Features: Inline caches, speculative optimization, deoptimization
TypeScript Compiler (tsc):
Source -> Scanner -> Parser -> AST -> Binder -> Checker -> Emitter
Features: Structural type system, type inference, .d.ts generation
SWC (Speedy Web Compiler):
Source -> Lexer -> Parser -> AST -> Transform -> Codegen
Features: Written in Rust, 20-70x faster than Babel
esbuild:
Source -> Lexer -> Parser -> AST -> Linker -> Codegen
Features: Written in Go, extremely fast bundling
Zig Compiler:
Source -> AST -> ZIR -> AIR -> Machine Code/LLVM IR
Features: comptime execution, incremental compilation
12.2 Learning Resource Roadmap
Compiler Learning Roadmap:
Beginner:
├── Crafting Interpreters (Robert Nystrom)
│ - Tree-Walking interpreter (Java)
│ - Bytecode VM (C)
├── Writing An Interpreter In Go (Thorsten Ball)
└── Mini Lisp interpreter implementation
Intermediate:
├── Engineering a Compiler (Cooper & Torczon)
│ - Data flow analysis
│ - Register allocation
├── LLVM Tutorial: Kaleidoscope
│ - LLVM IR generation
│ - JIT compilation
└── WebAssembly compiler implementation
Advanced:
├── Compilers: Principles, Techniques, and Tools (Dragon Book)
├── Advanced Compiler Design and Implementation (Muchnick)
├── SSA-based Compiler Design (Springer)
└── Contributing to real open-source compilers
- Rust, V8, LLVM, GCC, etc.
13. Quiz
Test your knowledge with the quizzes below.
Q1. Explain the difference between Lexer and Parser roles, and what are their respective inputs and outputs?
A1.
Lexer (Tokenizer):
- Role: Transform source code strings into a stream of Tokens
- Input: Source code string (e.g.,
let x = 10 + 20;) - Output: Token array (e.g.,
[LET, IDENT("x"), ASSIGN, NUM(10), PLUS, NUM(20), SEMICOLON]) - Removes whitespace, comments and distinguishes keywords, identifiers, literals, operators.
Parser (Syntax Analyzer):
- Role: Transform token stream into AST (Abstract Syntax Tree)
- Input: Token array
- Output: AST (tree structure)
- Determines relationships and structure between tokens according to grammar rules. Syntax errors are also detected at this stage.
Q2. Why is the Pratt Parser advantageous over recursive descent for handling operator precedence?
A2. In a recursive descent parser, handling operator precedence requires creating separate parsing functions for each precedence level. For example, parseAddition(), parseMultiplication(), parseComparison() -- functions keep growing as precedence levels increase.
Pratt Parser advantages:
- Single function handles all precedence levels (
parseExpression(precedence)) - Precedence is expressed as numeric values making comparison simple
- Adding new operators only requires registering in a table
- Prefix and Infix are registered independently, allowing the same symbol to have different meanings (e.g.,
-as both unary and binary)
This approach is widely used in practice by V8 (JavaScript), the Rust compiler, and others.
Q3. Why is SSA (Static Single Assignment) form beneficial for compiler optimizations?
A3. In SSA form, each variable is assigned exactly once. This provides:
- Simplified def-use analysis: With only one definition point, tracking uses is straightforward
- Easier constant propagation: Safely propagate constant values since variables are assigned once
- Accurate dead code elimination: Precisely identify unused definitions
- Better register allocation: Clear variable lifetimes enable efficient allocation
- Control flow merging: Phi nodes clearly express values coming from different paths
Most modern compiler infrastructure including LLVM IR and GCC's GIMPLE are SSA-based.
Q4. What are the pros and cons of Tree-Walking interpreters vs Bytecode VM interpreters?
A4.
Tree-Walking Interpreter:
- Pros: Simple and intuitive implementation, easy debugging with direct AST traversal
- Cons: Slow due to node traversal overhead, poor cache efficiency, recursive call stack consumption
Bytecode VM Interpreter:
- Pros: 3-10x faster than tree-walking, compact bytecode is cache-friendly, serializable (bytecode can be saved)
- Cons: Complex implementation (requires both compiler and VM), harder to maintain debug info
Generally, tree-walking is used for prototyping and bytecode VM for production. CPython, Java JVM, and Lua all use the bytecode VM approach.
Q5. How does Mark-Sweep GC work, and how does Generational GC improve upon it?
A5.
Mark-Sweep GC:
- Mark phase: Starting from roots (globals, stack), mark all reachable objects
- Sweep phase: Scan entire heap and free unmarked objects
- Downside: Must scan the entire heap, so pauses get longer as the heap grows
Generational GC improvements:
- Based on the observation that "most objects die young"
- Young Generation: Collected frequently and quickly (Minor GC)
- Old Generation: Collected infrequently (Major GC)
- Only objects surviving Young are promoted to Old
- Write barriers track inter-generational references
- V8, JVM, Go and most modern runtimes use generational GC.
14. References
- Crafting Interpreters - Robert Nystrom's free online compiler textbook
- Writing An Interpreter In Go - Thorsten Ball's Go interpreter book
- LLVM Tutorial: Kaleidoscope - LLVM official tutorial
- WebAssembly Specification - WebAssembly official spec
- Simple but Powerful Pratt Parsing - Pratt Parser explanation
- The Rust Reference - Rust language reference
- V8 Blog - TurboFan - V8 TurboFan JIT compiler
- Engineering a Compiler - Compiler engineering textbook
- PEG.js - PEG parser generator
- Tree-sitter - Incremental parsing library
- Binaryen - WebAssembly compiler infrastructure
- The GC Handbook - Garbage Collection handbook
- SWC Project - Rust-based TypeScript/JavaScript compiler
현재 단락 (1/1523)
Compiler theory is one of the core fields of computer science. Understanding compilers gives you dee...