Split View: [컴파일러] 01. 컴파일러의 구조와 동작 원리

[컴파일러] 01. 컴파일러의 구조와 동작 원리

컴파일러의 구조와 동작 원리

컴파일러는 소스 프로그램(source program)을 입력받아 목적 프로그램(target program)으로 변환하는 프로그램입니다. 이 글에서는 컴파일러의 전체 구조를 이루는 각 단계를 하나씩 살펴보겠습니다.

1. 언어 처리기(Language Processor)의 종류

프로그래밍 언어를 처리하는 시스템에는 여러 형태가 있습니다.

컴파일러(Compiler): 소스 코드 전체를 목적 코드로 번역한 뒤 실행합니다.
인터프리터(Interpreter): 소스 코드를 한 문장씩 해석하며 바로 실행합니다.
하이브리드 방식: Java처럼 먼저 바이트코드로 컴파일한 뒤 JVM이 인터프리트하는 방식입니다.

컴파일러와 인터프리터의 핵심 차이를 정리하면 다음과 같습니다.

[컴파일러]
  소스 프로그램 --> [컴파일러] --> 목적 프로그램 --> [실행] --> 출력

[인터프리터]
  소스 프로그램 + 입력 --> [인터프리터] --> 출력

컴파일러는 번역 과정이 오래 걸리지만, 한번 컴파일된 목적 코드는 빠르게 실행됩니다. 인터프리터는 번역 없이 바로 실행할 수 있지만, 실행 속도가 상대적으로 느립니다.

2. 컴파일러의 전체 구조

컴파일러는 크게 분석(Analysis) 부분과 합성(Synthesis) 부분으로 나뉩니다.

              소스 프로그램
                   |
                   v
         +-------------------+
         |  어휘 분석         |  (Lexical Analysis)
         +-------------------+
                   |  토큰 스트림
                   v
         +-------------------+
         |  구문 분석         |  (Syntax Analysis)
         +-------------------+
                   |  구문 트리
                   v
         +-------------------+
         |  의미 분석         |  (Semantic Analysis)
         +-------------------+
                   |  주석 달린 구문 트리
                   v
         +-------------------+
         |  중간 코드 생성     |  (Intermediate Code Generation)
         +-------------------+
                   |  중간 표현
                   v
         +-------------------+
         |  코드 최적화       |  (Code Optimization)
         +-------------------+
                   |  최적화된 중간 표현
                   v
         +-------------------+
         |  코드 생성         |  (Code Generation)
         +-------------------+
                   |
                   v
              목적 프로그램

각 단계는 이전 단계의 출력을 입력으로 받아 처리합니다. 분석 부분(전단부, front end)은 소스 프로그램의 구조를 파악하고, 합성 부분(후단부, back end)은 목적 코드를 생성합니다.

3. 어휘 분석(Lexical Analysis)

어휘 분석기(lexer 또는 scanner)는 소스 코드의 문자 스트림을 읽어 토큰(token) 단위로 분리합니다.

예를 들어, 다음 C 코드를 보겠습니다.

position = initial + rate * 60;

어휘 분석기는 이 코드를 다음과 같은 토큰으로 분리합니다.

<id, "position">  <=>  <id, "initial">  <+>  <id, "rate">  <*>  <num, 60>  <;>

각 토큰은 토큰 이름과 속성값의 쌍으로 표현됩니다. 식별자는 심볼 테이블에 등록되고, 이후 단계에서 참조됩니다.

어휘 분석기의 추가 역할

공백(whitespace)과 주석(comment) 제거
소스 프로그램의 행 번호 추적 (에러 메시지 생성용)
매크로 전처리(preprocessor) 연동

4. 구문 분석(Syntax Analysis)

구문 분석기(parser)는 토큰 스트림을 입력받아 구문 트리(parse tree) 또는 **추상 구문 트리(AST)**를 생성합니다.

위의 예제 position = initial + rate * 60에 대한 AST는 다음과 같습니다.

         =
        / \
       /   \
    id:     +
  position / \
          /   \
        id:    *
      initial / \
             /   \
           id:  num:
          rate   60

구문 분석기는 **문맥 자유 문법(Context-Free Grammar, CFG)**을 기반으로 동작합니다. 문법은 프로그래밍 언어의 구조를 정의하며, 파서는 입력이 이 문법에 맞는지 검증합니다.

구문 분석의 주요 방식

방식	설명	예시
Top-Down	시작 기호에서 출발하여 입력을 유도	재귀 하강 파서, LL 파서
Bottom-Up	입력에서 출발하여 시작 기호로 축약	LR 파서, LALR 파서

5. 의미 분석(Semantic Analysis)

의미 분석기는 구문 트리를 사용하여 소스 프로그램의 의미적 일관성을 검사합니다.

주요 역할은 다음과 같습니다.

타입 검사(Type Checking): 연산자와 피연산자의 타입이 호환되는지 확인합니다.
타입 변환(Type Coercion): 필요한 경우 자동 타입 변환을 삽입합니다.
스코프 검사: 변수가 선언된 범위 내에서 사용되는지 확인합니다.

예를 들어, 위 예제에서 rate가 float이고 60이 int라면, 의미 분석기는 60을 float으로 변환하는 노드를 삽입합니다.

         =
        / \
       /   \
    id:     +
  position / \
          /   \
        id:    *
      initial / \
             /   \
           id:  inttofloat
          rate      |
                  num: 60

6. 중간 코드 생성(Intermediate Code Generation)

컴파일러는 소스 프로그램을 바로 목적 코드로 변환하지 않고, **중간 표현(Intermediate Representation, IR)**을 먼저 생성합니다.

가장 널리 사용되는 IR 형식은 **3주소 코드(Three-Address Code)**입니다.

t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3

3주소 코드의 특징은 다음과 같습니다.

각 명령문에는 최대 하나의 연산자만 포함됩니다.
컴파일러가 임시 변수(t1, t2, t3)를 생성합니다.
복잡한 수식의 연산 순서가 명시적으로 결정됩니다.

중간 코드의 장점

기계 독립성: 특정 하드웨어에 종속되지 않는 표현입니다.
이식성: 다른 목표 기계로 쉽게 재타겟팅할 수 있습니다.
최적화 용이: 기계 독립적인 최적화를 적용하기 좋습니다.

7. 코드 최적화(Code Optimization)

코드 최적화 단계에서는 중간 코드를 변환하여 더 효율적인 목적 코드를 생성할 수 있도록 합니다.

// 최적화 전
t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3

// 최적화 후
t1 = rate * 60.0
position = initial + t1

위 예제에서 수행된 최적화는 다음과 같습니다.

상수 접기(Constant Folding): inttofloat(60)을 컴파일 시간에 60.0으로 계산합니다.
불필요한 임시 변수 제거: t3를 없애고 바로 position에 대입합니다.

주요 최적화 기법

공통 부분식 제거(Common Subexpression Elimination)
죽은 코드 제거(Dead Code Elimination)
루프 불변 코드 이동(Loop-Invariant Code Motion)
강도 감소(Strength Reduction): 비용이 높은 연산을 저렴한 연산으로 교체합니다.
인라인 확장(Inlining): 함수 호출을 함수 본문으로 대체합니다.

8. 코드 생성(Code Generation)

코드 생성기는 최적화된 중간 코드를 목표 기계의 명령어로 변환합니다.

// 중간 코드
t1 = rate * 60.0
position = initial + t1

// x86-like 어셈블리 코드
LDF  R2, rate       // rate를 레지스터 R2에 로드
MULF R2, R2, #60.0  // R2 = R2 * 60.0
LDF  R1, initial    // initial을 레지스터 R1에 로드
ADDF R1, R1, R2     // R1 = R1 + R2
STF  position, R1   // R1의 값을 position에 저장

코드 생성 단계의 핵심 과제는 다음과 같습니다.

레지스터 할당(Register Allocation): 제한된 레지스터에 변수를 효율적으로 배치합니다.
명령어 선택(Instruction Selection): 목표 기계에 맞는 최적의 명령어를 선택합니다.
명령어 스케줄링(Instruction Scheduling): 파이프라인 효율을 극대화합니다.

9. 심볼 테이블(Symbol Table)

심볼 테이블은 컴파일러의 모든 단계에서 공유되는 핵심 자료 구조입니다.

+--------+-------+--------+---------+
| 이름    | 타입   | 스코프  | 메모리   |
+--------+-------+--------+---------+
| position| float | global | 0x1000  |
| initial | float | global | 0x1004  |
| rate    | float | global | 0x1008  |
+--------+-------+--------+---------+

심볼 테이블에 저장되는 정보는 다음과 같습니다.

변수 이름(identifier)
타입 정보: int, float, 배열 등
스코프 정보: 변수가 유효한 범위
메모리 위치: 변수가 할당된 주소

심볼 테이블은 보통 해시 테이블로 구현하며, 빠른 삽입과 검색을 지원합니다. 스코프를 처리하기 위해 스코프 체이닝이나 스택 기반 구조를 함께 사용합니다.

10. 컴파일 단계(Phase) vs 패스(Pass)

단계(Phase)

컴파일러의 각 논리적 기능 단위를 단계라고 합니다. 어휘 분석, 구문 분석 등이 각각 하나의 단계입니다. 단계는 개념적인 구분이며, 실제 구현에서는 여러 단계가 합쳐질 수 있습니다.

패스(Pass)

패스는 소스 프로그램(또는 중간 표현)을 처음부터 끝까지 한 번 읽는 과정입니다.

[1-Pass 컴파일러]
  한 번의 순회로 모든 단계를 처리
  예: 간단한 어셈블러, Pascal 초기 컴파일러

[Multi-Pass 컴파일러]
  여러 번의 순회로 각 단계를 처리
  예: 현대 최적화 컴파일러 (GCC, LLVM 등)

1-Pass 컴파일러는 빠르지만 최적화가 제한적이고, Multi-Pass 컴파일러는 더 많은 최적화를 수행할 수 있지만 컴파일 시간이 더 걸립니다.

현대 컴파일러의 실제 구조

현대 컴파일러(예: LLVM)는 다음과 같은 모듈 구조를 가집니다.

[C/C++ Frontend]  [Rust Frontend]  [Swift Frontend]
        \              |              /
         v             v             v
     +-------------------------------+
     |         LLVM IR (중간 표현)     |
     +-------------------------------+
              |              |
              v              v
        [x86 Backend]  [ARM Backend]

이 구조 덕분에 M개의 언어와 N개의 타겟 기계를 지원하는 데 M+N개의 모듈만 필요합니다 (M x N이 아님).

정리

단계	입력	출력	핵심 역할
어휘 분석	문자 스트림	토큰 스트림	토큰 인식, 공백/주석 제거
구문 분석	토큰 스트림	구문 트리	문법 검증, 트리 구성
의미 분석	구문 트리	주석 달린 구문 트리	타입 검사, 의미 검증
중간 코드 생성	구문 트리	3주소 코드	기계 독립 IR 생성
코드 최적화	중간 코드	최적화된 중간 코드	성능 향상 변환
코드 생성	중간 코드	목적 코드	명령어 선택, 레지스터 할당

컴파일러는 단순한 번역기가 아니라, 프로그래밍 언어의 의미를 정확히 이해하고 효율적인 기계어로 변환하는 정교한 시스템입니다. 다음 글에서는 프로그래밍 언어의 기초와 컴파일러 기술의 다양한 응용 분야를 살펴보겠습니다.

[Compiler] 01. Compiler Structure and How It Works

Compiler Structure and How It Works

A compiler is a program that takes a source program as input and transforms it into a target program. In this article, we will examine each phase that makes up the overall structure of a compiler.

1. Types of Language Processors

There are several forms of systems that process programming languages.

Compiler: Translates the entire source code into object code, then executes it.
Interpreter: Interprets and executes source code one statement at a time.
Hybrid approach: Like Java, which first compiles to bytecode and then the JVM interprets it.

The key differences between compilers and interpreters can be summarized as follows:

[Compiler]
  Source Program --> [Compiler] --> Target Program --> [Execution] --> Output

[Interpreter]
  Source Program + Input --> [Interpreter] --> Output

A compiler takes longer for the translation process, but the compiled object code runs fast. An interpreter can execute immediately without translation, but execution speed is relatively slower.

2. Overall Compiler Structure

A compiler is broadly divided into an Analysis part and a Synthesis part.

              Source Program
                   |
                   v
         +-------------------+
         |  Lexical Analysis  |
         +-------------------+
                   |  Token Stream
                   v
         +-------------------+
         |  Syntax Analysis   |
         +-------------------+
                   |  Syntax Tree
                   v
         +-------------------+
         |  Semantic Analysis |
         +-------------------+
                   |  Annotated Syntax Tree
                   v
         +-------------------+
         |  Intermediate Code |
         |  Generation        |
         +-------------------+
                   |  Intermediate Representation
                   v
         +-------------------+
         |  Code Optimization |
         +-------------------+
                   |  Optimized IR
                   v
         +-------------------+
         |  Code Generation   |
         +-------------------+
                   |
                   v
              Target Program

Each phase takes the output of the previous phase as its input. The analysis part (front end) identifies the structure of the source program, and the synthesis part (back end) generates the target code.

3. Lexical Analysis

The lexical analyzer (lexer or scanner) reads the character stream of the source code and breaks it into token units.

For example, consider the following C code:

position = initial + rate * 60;

The lexical analyzer breaks this code into the following tokens:

<id, "position">  <=>  <id, "initial">  <+>  <id, "rate">  <*>  <num, 60>  <;>

Each token is represented as a pair of a token name and an attribute value. Identifiers are registered in the symbol table and referenced in later phases.

Additional Roles of the Lexical Analyzer

Removing whitespace and comments
Tracking line numbers in the source program (for error message generation)
Interfacing with macro preprocessing

4. Syntax Analysis

The syntax analyzer (parser) takes the token stream as input and produces a parse tree or abstract syntax tree (AST).

The AST for the example position = initial + rate * 60 looks like this:

         =
        / \
       /   \
    id:     +
  position / \
          /   \
        id:    *
      initial / \
             /   \
           id:  num:
          rate   60

The parser operates based on Context-Free Grammar (CFG). The grammar defines the structure of the programming language, and the parser verifies that the input conforms to this grammar.

Main Parsing Methods

Method	Description	Examples
Top-Down	Starts from the start symbol and derives the input	Recursive descent parser, LL parser
Bottom-Up	Starts from the input and reduces to the start symbol	LR parser, LALR parser

5. Semantic Analysis

The semantic analyzer uses the syntax tree to check the semantic consistency of the source program.

Its main roles are:

Type Checking: Verifies that the types of operators and operands are compatible.
Type Coercion: Inserts automatic type conversions where necessary.
Scope Checking: Verifies that variables are used within the scope where they are declared.

For example, if rate is float and 60 is int, the semantic analyzer inserts a node to convert 60 to float.

         =
        / \
       /   \
    id:     +
  position / \
          /   \
        id:    *
      initial / \
             /   \
           id:  inttofloat
          rate      |
                  num: 60

6. Intermediate Code Generation

The compiler does not convert the source program directly into target code; instead, it first generates an Intermediate Representation (IR).

The most widely used IR format is Three-Address Code.

t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3

The characteristics of three-address code are:

Each instruction contains at most one operator.
The compiler generates temporary variables (t1, t2, t3).
The order of operations for complex expressions is explicitly determined.

Advantages of Intermediate Code

Machine independence: A representation that is not tied to specific hardware.
Portability: Can be easily retargeted to different target machines.
Ease of optimization: Well-suited for applying machine-independent optimizations.

7. Code Optimization

The code optimization phase transforms the intermediate code to generate more efficient target code.

// Before optimization
t1 = inttofloat(60)
t2 = rate * t1
t3 = initial + t2
position = t3

// After optimization
t1 = rate * 60.0
position = initial + t1

The optimizations performed in the above example are:

Constant Folding: Computes inttofloat(60) as 60.0 at compile time.
Eliminating unnecessary temporaries: Removes t3 and assigns directly to position.

Key Optimization Techniques

Common Subexpression Elimination
Dead Code Elimination
Loop-Invariant Code Motion
Strength Reduction: Replaces expensive operations with cheaper ones.
Inlining: Replaces function calls with the function body.

8. Code Generation

The code generator converts the optimized intermediate code into target machine instructions.

// Intermediate code
t1 = rate * 60.0
position = initial + t1

// x86-like assembly code
LDF  R2, rate       // Load rate into register R2
MULF R2, R2, #60.0  // R2 = R2 * 60.0
LDF  R1, initial    // Load initial into register R1
ADDF R1, R1, R2     // R1 = R1 + R2
STF  position, R1   // Store R1 value into position

The key challenges in the code generation phase are:

Register Allocation: Efficiently placing variables in limited registers.
Instruction Selection: Choosing optimal instructions for the target machine.
Instruction Scheduling: Maximizing pipeline efficiency.

9. Symbol Table

The symbol table is a core data structure shared across all phases of the compiler.

+--------+-------+--------+---------+
| Name    | Type   | Scope  | Memory  |
+--------+-------+--------+---------+
| position| float | global | 0x1000  |
| initial | float | global | 0x1004  |
| rate    | float | global | 0x1008  |
+--------+-------+--------+---------+

The information stored in the symbol table includes:

Variable name (identifier)
Type information: int, float, array, etc.
Scope information: The range in which a variable is valid
Memory location: The address where a variable is allocated

The symbol table is typically implemented as a hash table, supporting fast insertion and lookup. Scope chaining or stack-based structures are used alongside it to handle scopes.

10. Compilation Phase vs Pass

Phase

Each logical functional unit of a compiler is called a phase. Lexical analysis, syntax analysis, etc., are each a phase. Phases are conceptual divisions, and in practice, multiple phases may be combined.

Pass

A pass is the process of reading the source program (or intermediate representation) from beginning to end once.

[1-Pass Compiler]
  Processes all phases in a single traversal
  Example: Simple assemblers, early Pascal compilers

[Multi-Pass Compiler]
  Processes each phase through multiple traversals
  Example: Modern optimizing compilers (GCC, LLVM, etc.)

A 1-Pass compiler is fast but limited in optimization, while a Multi-Pass compiler can perform more optimizations but takes longer to compile.

Actual Structure of Modern Compilers

Modern compilers (e.g., LLVM) have the following modular structure:

[C/C++ Frontend]  [Rust Frontend]  [Swift Frontend]
        \              |              /
         v             v             v
     +-------------------------------+
     |         LLVM IR (IR)          |
     +-------------------------------+
              |              |
              v              v
        [x86 Backend]  [ARM Backend]

Thanks to this structure, supporting M languages and N target machines requires only M+N modules (not M x N).

Summary

Phase	Input	Output	Key Role
Lexical Analysis	Character stream	Token stream	Token recognition, whitespace/comment removal
Syntax Analysis	Token stream	Syntax tree	Grammar verification, tree construction
Semantic Analysis	Syntax tree	Annotated syntax tree	Type checking, semantic verification
Intermediate Code Gen.	Syntax tree	Three-address code	Machine-independent IR generation
Code Optimization	Intermediate code	Optimized IR	Performance improvement transformations
Code Generation	Intermediate code	Target code	Instruction selection, register allocation

A compiler is not just a simple translator; it is a sophisticated system that accurately understands the semantics of a programming language and transforms it into efficient machine code. In the next article, we will explore the basics of programming languages and the various applications of compiler technology.