Skip to content
Published on

[OS Concepts] 14. File-System Implementation

Authors

File-System Implementation

After examining the file system's user interface, we now look at how the operating system implements the file system behind that interface. This article covers on-disk data structures, in-memory data structures, allocation methods, free space management, and more.


1. File System Layer Hierarchy

┌──────────────────────────────────────────┐
Application (User Space)├──────────────────────────────────────────┤
Logical File System  (File name → inode translation,│   protection checks)├──────────────────────────────────────────┤
File Organization Module  (Logical block → physical block mapping)├──────────────────────────────────────────┤
Basic File System  (Issues block I/O requests)├──────────────────────────────────────────┤
I/O Control  (Device drivers, interrupt handlers)├──────────────────────────────────────────┤
Storage Device (Disk/SSD)└──────────────────────────────────────────┘

2. On-Disk Structures

File System Disk Layout

┌────────┬───────────┬──────────┬──────────┬─────────────────┐
BootSuperFree     │ inode    │ Data BlocksBlockBlockSpaceTable    │                 │
│        │           │ Mgmt     │          │                 │
BootMetaBitmap/FileActual file     │
│ loader │ data      │ list     │ meta     │ data            │
└────────┴───────────┴──────────┴──────────┴─────────────────┘
StructureRole
Boot BlockCode needed for booting (boot loader)
Super BlockOverall file system info (block count, inode count, etc.)
Free Space MgmtTracks available blocks (bitmap or linked list)
inode TableMetadata and block pointers for each file
Data BlocksActual file/directory data

inode Structure

┌─────────────────────────────────────┐
│           inode Structure├─────────────────────────────────────┤
File type, permissions (mode)Owner (uid, gid)SizeTimestamps (atime, mtime, ctime)Link count                          │
Block count                         │
├─────────────────────────────────────┤
Direct block pointers [0-11] → data │
Single indirect pointer → ptr block │
Double indirect pointer → ptr^2Triple indirect pointer → ptr^3└─────────────────────────────────────┘
// 간단화된 inode 구조체 (ext2 기반)
struct inode {
    uint16_t mode;        // 파일 유형 + 권한
    uint16_t uid;         // 소유자 ID
    uint32_t size;        // 파일 크기
    uint32_t atime;       // 접근 시간
    uint32_t mtime;       // 수정 시간
    uint32_t ctime;       // 상태 변경 시간
    uint16_t gid;         // 그룹 ID
    uint16_t links_count; // 하드 링크 수
    uint32_t blocks;      // 할당된 블록 수

    uint32_t direct[12];          // 직접 블록 포인터
    uint32_t single_indirect;     // 단일 간접 포인터
    uint32_t double_indirect;     // 이중 간접 포인터
    uint32_t triple_indirect;     // 삼중 간접 포인터
};

3. In-Memory Structures

The file system caches frequently used information in memory for performance.

┌───────────────────────────────────────────────┐
Kernel Memory│                                               │
│  ┌─────────────────┐  ┌───────────────────┐   │
│  │ Mount Table     │  │ Directory Cache   │   │
 (FS list) (recent paths)    │   │
│  └─────────────────┘  └───────────────────┘   │
│                                               │
│  ┌─────────────────┐  ┌───────────────────┐   │
│  │ System-wide     │  │ Per-process       │   │
│  │ Open File Table │  │ Open File Table   │   │
│  └─────────────────┘  └───────────────────┘   │
│                                               │
│  ┌─────────────────┐                          │
│  │ Buffer Cache    │ ← Disk block cache       │
│  └─────────────────┘                          │
└───────────────────────────────────────────────┘

4. VFS (Virtual File System)

VFS is an abstraction layer that unifies multiple file system types under a single interface.

           User Process
        open() read() write()
    ┌──────────┴──────────┐
VFS      (Virtual FileSystem)    │  vnode interface    └──┬──────┬──────┬─────┘
       │      │      │
    ┌──┴──┐┌──┴──┐┌──┴──┐
    │ext4 ││NFS  ││FAT    │     ││     ││     │
    └──┬──┘└──┬──┘└──┬──┘
       │      │      │
    Local  Network  USB
    disk   server   drive

VFS Key Objects

// VFS 추상화의 주요 객체 (Linux 기반 의사 코드)

// 1. 슈퍼블록 객체: 파일 시스템 인스턴스 정보
struct super_block {
    struct list_head s_list;       // 슈퍼블록 목록
    dev_t            s_dev;        // 장치 식별자
    unsigned long    s_blocksize;  // 블록 크기
    struct super_operations *s_op; // 연산 테이블
    // ...
};

// 2. inode 객체: 개별 파일 정보
struct inode {
    umode_t          i_mode;       // 파일 유형 + 권한
    uid_t            i_uid;        // 소유자
    loff_t           i_size;       // 파일 크기
    struct inode_operations *i_op; // inode 연산
    struct file_operations  *i_fop;// 파일 연산
    // ...
};

// 3. dentry 객체: 디렉토리 엔트리 (경로명 컴포넌트)
struct dentry {
    struct inode     *d_inode;     // 관련 inode
    struct dentry    *d_parent;    // 부모 디렉토리
    struct qstr      d_name;      // 엔트리 이름
    struct dentry_operations *d_op;
    // ...
};

// 4. file 객체: 열린 파일 인스턴스
struct file {
    struct dentry    *f_dentry;    // 관련 dentry
    loff_t           f_pos;       // 현재 오프셋
    unsigned int     f_flags;     // 열기 플래그
    struct file_operations *f_op; // 파일 연산
    // ...
};

5. Directory Implementation

Linear List

Directory Block:
┌──────────┬────────┬──────────┬────────┬──────────┬────────┐
"file1"  │ inode  │ "file2"  │ inode  │ "file3"  │ inode  │
│          │  23    │          │  47    │          │  12└──────────┴────────┴──────────┴────────┴──────────┴────────┘

Search: O(n) - sequential scan from beginning
Advantage: Simple implementation
Disadvantage: Slow for large directories

Hash Table

Hash function: hash("filename") → bucket index

Bucket 0: ["file_a", inode 23]["file_d", inode 89]NULL
Bucket 1: ["file_b", inode 47]NULL
Bucket 2: NULL
Bucket 3: ["file_c", inode 12]["file_e", inode 56]NULL
...

Search: O(1) average (linked list traversal on collision)
Advantage: Fast lookup
Disadvantage: Fixed size, collision management required

6. Allocation Methods

Determines how file data blocks are arranged on disk.

Contiguous Allocation

Directory Entry:
┌────────┬───────────┬──────┐
NameStart BlkLen├────────┼───────────┼──────┤
│ file_a │    23│ file_b │    72│ file_c │   144└────────┴───────────┴──────┘

Disk Blocks:
 0  1 [2  3  4] 5  6 [7  8] 9 10 11 12 13 [14 15 16 17]
         file_a          file_b                file_c

Advantages: Efficient for both sequential and direct access, minimizes disk seeks Disadvantages: External fragmentation, difficult to change file size

Linked Allocation

Directory:
┌────────┬──────┬──────┐
Name   │Start │ End├────────┼──────┼──────┤
│ file_a │  91└────────┴──────┴──────┘

Disk Blocks (each block contains next block pointer):
Block 9Block 16Block 25Block 1NULL
[data|16] [data|25] [data|1]  [data|-1]

Advantages: No external fragmentation, easy to resize files Disadvantages: Inefficient direct access (requires sequential traversal), pointer space overhead

FAT (File Allocation Table)

FAT Table (cached in memory):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
0123456  │ ← Block number
├─────┼─────┼─────┼─────┼─────┼─────┼─────┤
3EOF54EOF1FREE│ ← Next block
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘

File A: start=0034EOF (blocks 0, 3, 4)
File B: start=2251EOF (blocks 2, 5, 1)

Indexed Allocation

Directory:
┌────────┬─────────────┐
NameIndex Block├────────┼─────────────┤
│ file_a │     19└────────┴─────────────┘

Index Block 19:          Data Blocks:
┌─────────────┐          Block 9:  [data...]
9          │ ────→    Block 16: [data...]
16          │ ────→    Block 25: [data...]
25          │ ────→    Block 1:  [data...]
1          │ ────→
-1 (end)└─────────────┘

Advantages: Efficient direct access, no external fragmentation Disadvantages: Index block overhead, inefficient for small files

Allocation Method Comparison

MethodSequentialDirectFragmentationFile Growth
ContiguousVery fastFastExternalDifficult
LinkedModerateSlowNoneEasy
IndexedModerateFastNoneModerate

7. Free Space Management

Bitmap (Bit Vector)

Block number: 0  1  2  3  4  5  6  7  8  9  10 11 12
Bitmap:       1  1  0  1  1  0  0  1  0  0  0  1  0

1 = in use, 0 = free

Finding free blocks: search for first 0 bit in bitmap
Advantage: Easy to find contiguous free space
Disadvantage: Bitmap grows large for big disks
      (1TB disk, 4KB blocks → 32MB bitmap)
// 비트맵에서 빈 블록 찾기
int find_free_block(uint32_t *bitmap, int total_blocks) {
    int words = total_blocks / 32;
    for (int i = 0; i < words; i++) {
        if (bitmap[i] != 0xFFFFFFFF) {  // 모두 사용 중이 아니면
            // 첫 번째 0 비트 위치 찾기
            for (int j = 0; j < 32; j++) {
                if (!(bitmap[i] & (1 << j))) {
                    return i * 32 + j;
                }
            }
        }
    }
    return -1;  // 빈 블록 없음
}

Linked List

Free Space HeadBlock 2Block 5Block 6Block 8NULL

Each free block points to the next free block
Advantage: Simple implementation
Disadvantage: Hard to find contiguous free space, traversal cost

TRIM Command

Notifies the SSD of deleted blocks in advance to optimize garbage collection.

Without TRIM:                    With TRIM:
1. File delete → meta only      1. File delete → meta remove
2. SSD unaware block is free    2. TRIM command notifies SSD
3. On later write:              3. SSD erases block in advance
   ReadEraseWrite needed  4. On later write: write directly
   (write perf degradation)        (write perf maintained)

8. Journaling

A mechanism to maintain file system consistency in case of system crashes.

Journaling Operation Process

Normal write (no journaling):
1. Write data blocks
2. Update inode
3. Update bitmap
Crash at step 2 causes inconsistency!

Journaled write:
┌──────────────────────────────────────┐
Journal Area (Log)│                                      │
1. Transaction Begin (TxB)2. Record metadata changes          │
3. Transaction Commit (TxE)│                                      │
│  → Apply to actual location after    │
│    commit                            │
│  → Delete journal after apply        │
└──────────────────────────────────────┘

Journaling Modes

ModeJournal ContentsSafetyPerf
JournalMetadata + DataHighestSlow
OrderedMetadata only (data written first)HighMedium
WritebackMetadata only (no ordering guarantee)MediumFast
// 저널링 트랜잭션 의사 코드
void journaled_write(struct transaction *tx) {
    // 1. 저널에 트랜잭션 시작 기록
    journal_write_begin(tx);

    // 2. 변경할 메타데이터를 저널에 기록
    journal_write_metadata(tx, inode_data);
    journal_write_metadata(tx, bitmap_data);

    // 3. 저널에 트랜잭션 커밋 기록
    journal_write_commit(tx);

    // 4. 실제 파일 시스템 위치에 적용 (체크포인트)
    write_to_actual_location(inode_data);
    write_to_actual_location(bitmap_data);

    // 5. 저널에서 트랜잭션 제거
    journal_free(tx);
}

// 복구 시: 커밋된 트랜잭션은 재실행, 미커밋은 무시
void recover_from_crash() {
    for_each_transaction(journal) {
        if (transaction_committed(tx))
            replay_transaction(tx);  // 재실행
        else
            discard_transaction(tx); // 폐기
    }
}

9. Summary

  • File System Layers: Logical file system → File organization module → Basic file system → I/O control
  • Disk Structures: Boot block, super block, inode table, data blocks
  • VFS: Abstracts multiple file systems under a single interface
  • Allocation Methods: Contiguous (fast but fragmentation), linked (flexible but slow direct access), indexed (balanced)
  • Free Space Management: Bitmap is most common. TRIM optimizes SSDs
  • Journaling: Ensures file system consistency through transaction logs
Quiz: File-System Implementation

Q1. What is the role of VFS?

A1. VFS (Virtual File System) abstracts different file systems such as ext4, NFS, and FAT under a unified interface (open, read, write, etc.). This allows applications to access files using the same system calls regardless of the underlying file system type.

Q2. What are the trade-offs between contiguous allocation and indexed allocation?

A2. Contiguous allocation is fast for both sequential and direct access but suffers from external fragmentation and makes file resizing difficult. Indexed allocation has no external fragmentation and makes file growth easy, but requires additional space and I/O for the index block.

Q3. How does the ordered journaling mode work?

A3. In ordered mode, only metadata is journaled, but data blocks are written to disk before the metadata. This prevents a situation where metadata points to data that hasn't been written yet after a crash. It offers better performance than journal mode while maintaining data consistency.