💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

The operating system is the **intermediary** between hardware and applications. As an AI/ML engineer, you need to go beyond writing Python code and be able to answer:

- Is slow PyTorch training caused by CPU scheduling or memory bandwidth?

- Which is more efficient: multiprocessing or multithreading?

- How do you implement isolation when multiple teams share GPU resources?

This guide answers those questions through core OS concepts with practical examples.

1. Processes and Threads

What Is a Process?

A process is a **running instance of a program**. Each process has its own independent virtual address space, file descriptors, and signal handlers.

**Process State Diagram:**

New ──→ Ready ──→ Running

↑ │

│ scheduler │ I/O request

└── Waiting ←┘

│

↓

Terminated

The **PCB (Process Control Block)** is the data structure the kernel maintains for each process.

// Linux task_struct (simplified)

struct task_struct {

pid_t pid; // Process ID

int state; // Current state

struct mm_struct *mm; // Virtual memory mapping

struct files_struct *files; // Open file table

struct thread_info thread_info; // Saved registers

long prio; // Scheduling priority

};

Process Creation with fork() + exec()

#include <stdio.h>

#include <unistd.h>

#include <sys/wait.h>

int main(void) {

pid_t pid = fork();

if (pid < 0) {

perror("fork failed");

return 1;

} else if (pid == 0) {

// Child process

char *args[] = {"/bin/ls", "-la", NULL};

execv("/bin/ls", args);

perror("exec failed"); // Not reached if exec succeeds

} else {

// Parent process

int status;

waitpid(pid, &status, 0);

printf("Child exit code: %d\n", WEXITSTATUS(status));

}

return 0;

}

`fork()` copies the parent's virtual address space using **Copy-on-Write (CoW)**. Physical memory pages are only duplicated when a write actually occurs, making the operation efficient.

Threads vs Processes

| Aspect | Process | Thread |

| --------------- | -------------------- | ----------------------------- |

| Address Space | Independent | Shared |

| Creation Cost | High | Low |

| Communication | IPC (pipes, sockets) | Shared memory |

| Fault Isolation | Strong | Weak |

| Python GIL | Not affected | Bottleneck for CPU-bound work |

Context Switching

When the CPU switches from process A to process B:

1. Save A's register state to its PCB

2. Replace virtual memory mapping (page table pointer)

3. Flush TLB (cache invalidation)

4. Restore B's register state

Context switching costs a few microseconds. In AI inference servers, excessive threads can actually degrade performance.

2. CPU Scheduling

Algorithm Comparison

**FIFO (First-In, First-Out)**

- Simple but suffers from the **Convoy Effect**: short jobs wait behind long ones

**SJF (Shortest Job First)**

- Minimizes average wait time, but execution time is hard to predict

**Round Robin**

- Fixed time quantum assigned to each process

- Very short quantums cause excessive context switching overhead

**Linux CFS (Completely Fair Scheduler)**

- `vruntime`-based: tracks how much CPU each process has used

- Red-Black Tree selects the least-run process in O(log n)

// vruntime calculation (conceptual pseudocode)

void update_vruntime(struct task_struct *task, u64 delta_exec) {

// Higher priority means vruntime grows more slowly

u64 weight = prio_to_weight[task->nice + 20];

task->vruntime += delta_exec * NICE_0_WEIGHT / weight;

}

Priority Inversion Problem

High priority H ──────────────────────→ Waiting (needs lock)

Mid priority M ──────────────────────→ Runs before H!

Low priority L ──→ Holds lock ──→ Gets preempted

When L is preempted by M while holding a lock, H runs later than M. The solution is **Priority Inheritance**: L temporarily inherits H's priority while holding the lock.

3. Memory Management

Virtual Memory and Paging

Every process believes it owns all physical memory. The kernel uses **page tables** to translate virtual addresses to physical ones.

Virtual address (48-bit on x86-64):

┌──────────┬──────────┬──────────┬──────────┬────────────┐

│ PGD(9) │ PUD(9) │ PMD(9) │ PTE(9) │ offset(12) │

└──────────┴──────────┴──────────┴──────────┴────────────┘

TLB (Translation Lookaside Buffer)

Page table walks require memory accesses and are slow. The TLB caches recent translations.

**TLB miss handling steps:**

1. CPU looks up virtual address in TLB → miss

2. CPU references page table base address from CR3 register

3. 4-level page table walk (4 memory accesses)

4. Extract physical address from PTE → store in TLB

5. Retry the original memory access

Page Replacement Algorithms

**LRU (Least Recently Used):**

Evicts the page that hasn't been used for the longest time. Linux uses the clock algorithm (an LRU approximation).

Clock Algorithm:

Maintain a reference bit (R) per page

Pointer cycles through pages; select R=0 pages for eviction

If R=1, reset to R=0 and advance to next page

Memory-Mapped Files in Python

Access large datasets as memory-mapped files

with open("large_dataset.bin", "r+b") as f:

Map entire file into virtual address space

mm = mmap.mmap(f.fileno(), 0)

Slice to access only needed regions (actual I/O happens on demand)

header = mm[:128]

record = mm[128:256]

mm.close()

AI training: numpy memmap treats disk data like an array

data = np.memmap("features.npy", dtype="float32", mode="r", shape=(1_000_000, 512))

batch = data[0:1024] # Only loads the requested batch from disk

4. Synchronization

Mutex and Condition Variable

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#define BUFFER_SIZE 10

int buffer[BUFFER_SIZE];

int count = 0;

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

pthread_cond_t not_full = PTHREAD_COND_INITIALIZER;

pthread_cond_t not_empty = PTHREAD_COND_INITIALIZER;

void *producer(void *arg) {

for (int i = 0; i < 50; i++) {

pthread_mutex_lock(&mutex);

while (count == BUFFER_SIZE)

pthread_cond_wait(&not_full, &mutex); // Wait if buffer full

buffer[count++] = i;

printf("Produced: %d (count=%d)\n", i, count);

pthread_cond_signal(&not_empty);

pthread_mutex_unlock(&mutex);

}

return NULL;

}

void *consumer(void *arg) {

for (int i = 0; i < 50; i++) {

pthread_mutex_lock(&mutex);

while (count == 0)

pthread_cond_wait(&not_empty, &mutex); // Wait if buffer empty

int val = buffer[--count];

printf("Consumed: %d (count=%d)\n", val, count);

pthread_cond_signal(&not_full);

pthread_mutex_unlock(&mutex);

}

return NULL;

}

int main(void) {

pthread_t prod, cons;

pthread_create(&prod, NULL, producer, NULL);

pthread_create(&cons, NULL, consumer, NULL);

pthread_join(prod, NULL);

pthread_join(cons, NULL);

return 0;

}

Deadlock

**Coffman Conditions** — all four must hold simultaneously for deadlock:

1. **Mutual Exclusion**: A resource can only be used by one process at a time

2. **Hold and Wait**: A process holds resources while waiting for additional ones

3. **No Preemption**: Resources cannot be forcibly taken; only voluntary release

4. **Circular Wait**: P1 waits for P2, P2 waits for P3, P3 waits for P1

**Prevention Strategies:**

- Fix a global resource ordering (eliminates circular wait)

- Request all resources at once (eliminates hold-and-wait)

- Use the Banker's Algorithm to maintain a safe state

5. File Systems

ext4 and inodes

View inode information

stat /etc/hostname

File: /etc/hostname

Size: 12 Blocks: 8 IO Block: 4096 regular file

Inode: 131073 Links: 1

Access: 2026-03-17 10:00:00

Check remaining inodes (inode exhaustion has same effect as a full disk)

df -i /

**Inode structure:**

- File metadata (permissions, owner, timestamps)

- Data block pointers (direct / indirect / double-indirect)

- Filename is NOT in the inode — it lives in directory entries

Journaling

ext4 uses **journaling** to guarantee recovery after unexpected power loss.

Before write: Record changes in journal first (Write-Ahead Log)

After write: Apply changes to actual blocks

After commit: Remove journal entry

VFS Abstraction Layer

User space: open() read() write()

│

Kernel VFS: vfs_open() vfs_read() ← common interface

│

File systems: ext4 | btrfs | tmpfs | procfs | nfs

│

Block device layer → actual hardware

6. I/O and Interrupts

DMA and Interrupt Handlers

Instead of the CPU copying data to memory directly, the DMA controller handles it.

1. CPU → DMA: "Copy disk block X to memory address Y"

2. DMA transfers independently (CPU does other work)

3. DMA completes → interrupt fires

4. CPU: finishes current instruction → looks up interrupt vector → runs ISR

5. ISR: marks I/O complete, wakes waiting process

epoll vs io_uring

**epoll** (event-driven I/O multiplexing):

// epoll: still requires multiple syscalls

int epfd = epoll_create1(0);

epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);

epoll_wait(epfd, events, MAX_EVENTS, -1); // syscall

read(fd, buf, size); // another syscall

**io_uring** (Linux 5.1+, shared ring buffer):

// io_uring: submit I/O without per-operation syscalls

struct io_uring ring;

io_uring_queue_init(256, &ring, 0);

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);

io_uring_prep_read(sqe, fd, buf, size, 0);

io_uring_submit(&ring); // One syscall to submit many I/Os

// Wait for completion (communicates via shared memory)

struct io_uring_cqe *cqe;

io_uring_wait_cqe(&ring, &cqe);

**Why io_uring achieves higher performance:**

- SQ/CQ ring buffers shared between user and kernel → no data copying

- Batch multiple I/O submissions → fewer syscalls

- `IORING_SETUP_SQPOLL` mode: kernel thread polls SQ → zero syscalls

- Pre-registered buffers: `io_uring_register_buffers()` avoids per-I/O address mapping

7. Operating Systems from an AI/ML Perspective

NUMA Architecture

In multi-socket servers, each CPU socket has its own local memory.

Socket 0 Socket 1

┌─────────┐ ┌─────────┐

│ CPU 0 │──QPI────│ CPU 1 │

│ 32GB RAM│ │ 32GB RAM│

└─────────┘ └─────────┘

Local access: ~100ns Remote access: ~300ns

NUMA impact on AI training:

- If DataLoader workers run on Socket 0 but GPU is on Socket 1's PCIe, all tensors must traverse QPI

- Pin processes and memory to the correct NUMA node: `numactl --cpunodebind=0 --membind=0 python train.py`

Check NUMA topology

numactl --hardware

Monitor NUMA statistics for a Python process

numastat -p python

Python Multiprocessing vs Multithreading

def cpu_bound_task(n):

"""Prime counting (CPU-intensive)"""

count = 0

for i in range(2, n):

if all(i % j != 0 for j in range(2, int(i**0.5) + 1)):

count += 1

return count

N = 100_000

Multithreading: GIL prevents true parallelism for CPU-bound tasks

start = time.time()

threads = [threading.Thread(target=cpu_bound_task, args=(N,)) for _ in range(4)]

[t.start() for t in threads]

[t.join() for t in threads]

print(f"Threading: {time.time() - start:.2f}s")

Multiprocessing: separate processes bypass the GIL

start = time.time()

with multiprocessing.Pool(4) as pool:

pool.map(cpu_bound_task, [N] * 4)

print(f"Multiprocessing: {time.time() - start:.2f}s")

Result: multiprocessing is ~4x faster

GPU Resource Isolation with cgroups

cgroups v2 setup for per-team GPU resource limits

Create a group for Team A

mkdir /sys/fs/cgroup/team_a

CPU limit: use at most 25% of total CPU

echo "25000 100000" > /sys/fs/cgroup/team_a/cpu.max

Memory limit: 16GB

echo $((16 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/team_a/memory.max

Add current process to the group

echo $$ > /sys/fs/cgroup/team_a/cgroup.procs

GPU isolation using NVIDIA MIG + cgroups

MIG splits one A100 into multiple instances

nvidia-smi mig -cgi 3g.40gb -C # Create a 40GB instance

Limit GPU resources with Docker

docker run --gpus '"device=0,1"' \

--cpuset-cpus="0-15" \

--memory="32g" \

pytorch/pytorch:latest python train.py

Exploring the /proc Filesystem

View process memory mappings

cat /proc/$(pgrep python)/maps

Check process status

cat /proc/$(pgrep python)/status | grep -E "VmRSS|VmPeak|Threads"

CPU scheduling statistics

cat /proc/$(pgrep python)/schedstat

Check NUMA memory binding

cat /proc/$(pgrep python)/numa_maps | head -20

System-wide memory info

cat /proc/meminfo | grep -E "MemTotal|MemFree|Cached|HugePages"

Quiz

Test your understanding of core OS concepts.

**Answer**: The process has 5 steps.

**Explanation**:

1. The CPU looks up the virtual address in the TLB but finds no entry (TLB miss)

2. The MMU reads the page table base address (PGD) from the CR3 register

3. It walks 4 levels: PGD → PUD → PMD → PTE, reading the physical address (4 memory accesses)

4. The virtual-to-physical mapping is loaded into the TLB

5. The original memory access is retried and completes

A low TLB hit rate causes significant slowdowns. Using Huge Pages (2MB or 1GB) reduces the number of TLB entries needed and improves the hit rate.

**Answer**: vruntime is a weighted virtual runtime that drives fair CPU allocation.

**Explanation**:

- Each task accumulates `vruntime`, which is actual CPU time scaled by priority weight

- Higher-priority tasks have their vruntime increase more slowly for the same execution time

- CFS always picks the task with the smallest vruntime (leftmost node in the Red-Black Tree)

- As a result, every task receives CPU proportional to its priority — truly "completely fair"

- New tasks start with `min_vruntime` as their initial value to prevent starvation of existing tasks

**Answer**: Mutual exclusion, hold and wait, no preemption, circular wait.

**Explanation**:

1. **Mutual Exclusion**: A resource can only be held by one process at a time (e.g., a printer or mutex)

2. **Hold and Wait**: A process already holding resources waits to acquire more

3. **No Preemption**: Resources can only be released voluntarily; they cannot be forcibly taken

4. **Circular Wait**: A cycle exists: P1 waits for P2's resource, P2 waits for P3's, P3 waits for P1's

Deadlock is impossible if any single condition is eliminated. Prevention strategies are designed to break exactly one of these four conditions.

**Answer**: Minimized syscall overhead and zero-copy shared ring buffers.

**Explanation**:

- **Fewer syscalls**: epoll requires separate syscalls for event notification and each read/write; io_uring submits many I/O operations with a single `io_uring_submit()`

- **Shared ring buffers**: SQ and CQ are shared memory between user space and kernel, eliminating data copying

- **SQPOLL mode**: A kernel thread polls the SQ ring, so no `submit` syscall is needed at all

- **Pre-registered buffers**: `io_uring_register_buffers()` pre-pins memory so per-I/O address resolution is skipped

- The result is millions of I/O operations per second with minimal CPU overhead

**Answer**: Reduced memory bandwidth and increased latency degrade training throughput.

**Explanation**:

- Local NUMA memory access is approximately 100ns; remote access is approximately 300ns — 3x slower

- AI training moves large tensor data to the GPU on every iteration, making memory bandwidth the critical bottleneck

- When DataLoader workers are bound to Socket 0 but the GPU is connected via Socket 1's PCIe lanes, all data must traverse the QPI interconnect

- **Optimization**: use `numactl --cpunodebind=N --membind=N` to pin processes and memory to the same NUMA node as the GPU; combine with `torch.cuda.set_device()` and a NUMA-aware DataLoader

Summary

The OS should not be a black box for AI engineers. Key takeaways:

- **Scheduling**: CFS vruntime ensures fair CPU sharing; watch out for priority inversion

- **Memory**: Virtual memory provides isolation; minimize TLB misses; use Huge Pages

- **Synchronization**: mutex/condvar eliminate race conditions; lock ordering prevents deadlock

- **I/O**: io_uring minimizes syscall overhead for high-throughput scenarios

- **AI Optimization**: NUMA-aware placement, GPU isolation with cgroups, /proc for bottleneck diagnosis