필사 모드: Operating Systems: A Complete Guide — From Processes & Memory to AI Workload Optimization
EnglishIntroduction
The operating system is the **intermediary** between hardware and applications. As an AI/ML engineer, you need to go beyond writing Python code and be able to answer:
- Is slow PyTorch training caused by CPU scheduling or memory bandwidth?
- Which is more efficient: multiprocessing or multithreading?
- How do you implement isolation when multiple teams share GPU resources?
This guide answers those questions through core OS concepts with practical examples.
1. Processes and Threads
What Is a Process?
A process is a **running instance of a program**. Each process has its own independent virtual address space, file descriptors, and signal handlers.
**Process State Diagram:**
New ──→ Ready ──→ Running
↑ │
│ scheduler │ I/O request
└── Waiting ←┘
│
↓
Terminated
The **PCB (Process Control Block)** is the data structure the kernel maintains for each process.
// Linux task_struct (simplified)
struct task_struct {
pid_t pid; // Process ID
int state; // Current state
struct mm_struct *mm; // Virtual memory mapping
struct files_struct *files; // Open file table
struct thread_info thread_info; // Saved registers
long prio; // Scheduling priority
};
Process Creation with fork() + exec()
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
int main(void) {
pid_t pid = fork();
if (pid < 0) {
perror("fork failed");
return 1;
} else if (pid == 0) {
// Child process
char *args[] = {"/bin/ls", "-la", NULL};
execv("/bin/ls", args);
perror("exec failed"); // Not reached if exec succeeds
} else {
// Parent process
int status;
waitpid(pid, &status, 0);
printf("Child exit code: %d\n", WEXITSTATUS(status));
}
return 0;
}
`fork()` copies the parent's virtual address space using **Copy-on-Write (CoW)**. Physical memory pages are only duplicated when a write actually occurs, making the operation efficient.
Threads vs Processes
| Aspect | Process | Thread |
| --------------- | -------------------- | ----------------------------- |
| Address Space | Independent | Shared |
| Creation Cost | High | Low |
| Communication | IPC (pipes, sockets) | Shared memory |
| Fault Isolation | Strong | Weak |
| Python GIL | Not affected | Bottleneck for CPU-bound work |
Context Switching
When the CPU switches from process A to process B:
1. Save A's register state to its PCB
2. Replace virtual memory mapping (page table pointer)
3. Flush TLB (cache invalidation)
4. Restore B's register state
Context switching costs a few microseconds. In AI inference servers, excessive threads can actually degrade performance.
2. CPU Scheduling
Algorithm Comparison
**FIFO (First-In, First-Out)**
- Simple but suffers from the **Convoy Effect**: short jobs wait behind long ones
**SJF (Shortest Job First)**
- Minimizes average wait time, but execution time is hard to predict
**Round Robin**
- Fixed time quantum assigned to each process
- Very short quantums cause excessive context switching overhead
**Linux CFS (Completely Fair Scheduler)**
- `vruntime`-based: tracks how much CPU each process has used
- Red-Black Tree selects the least-run process in O(log n)
// vruntime calculation (conceptual pseudocode)
void update_vruntime(struct task_struct *task, u64 delta_exec) {
// Higher priority means vruntime grows more slowly
u64 weight = prio_to_weight[task->nice + 20];
task->vruntime += delta_exec * NICE_0_WEIGHT / weight;
}
Priority Inversion Problem
High priority H ──────────────────────→ Waiting (needs lock)
Mid priority M ──────────────────────→ Runs before H!
Low priority L ──→ Holds lock ──→ Gets preempted
When L is preempted by M while holding a lock, H runs later than M. The solution is **Priority Inheritance**: L temporarily inherits H's priority while holding the lock.
3. Memory Management
Virtual Memory and Paging
Every process believes it owns all physical memory. The kernel uses **page tables** to translate virtual addresses to physical ones.
Virtual address (48-bit on x86-64):
┌──────────┬──────────┬──────────┬──────────┬────────────┐
│ PGD(9) │ PUD(9) │ PMD(9) │ PTE(9) │ offset(12) │
└──────────┴──────────┴──────────┴──────────┴────────────┘
TLB (Translation Lookaside Buffer)
Page table walks require memory accesses and are slow. The TLB caches recent translations.
**TLB miss handling steps:**
1. CPU looks up virtual address in TLB → miss
2. CPU references page table base address from CR3 register
3. 4-level page table walk (4 memory accesses)
4. Extract physical address from PTE → store in TLB
5. Retry the original memory access
Page Replacement Algorithms
**LRU (Least Recently Used):**
Evicts the page that hasn't been used for the longest time. Linux uses the clock algorithm (an LRU approximation).
Clock Algorithm:
Maintain a reference bit (R) per page
Pointer cycles through pages; select R=0 pages for eviction
If R=1, reset to R=0 and advance to next page
Memory-Mapped Files in Python
Access large datasets as memory-mapped files
with open("large_dataset.bin", "r+b") as f:
Map entire file into virtual address space
mm = mmap.mmap(f.fileno(), 0)
Slice to access only needed regions (actual I/O happens on demand)
header = mm[:128]
record = mm[128:256]
mm.close()
AI training: numpy memmap treats disk data like an array
data = np.memmap("features.npy", dtype="float32", mode="r", shape=(1_000_000, 512))
batch = data[0:1024] # Only loads the requested batch from disk
4. Synchronization
Mutex and Condition Variable
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define BUFFER_SIZE 10
int buffer[BUFFER_SIZE];
int count = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t not_full = PTHREAD_COND_INITIALIZER;
pthread_cond_t not_empty = PTHREAD_COND_INITIALIZER;
void *producer(void *arg) {
for (int i = 0; i < 50; i++) {
pthread_mutex_lock(&mutex);
while (count == BUFFER_SIZE)
pthread_cond_wait(¬_full, &mutex); // Wait if buffer full
buffer[count++] = i;
printf("Produced: %d (count=%d)\n", i, count);
pthread_cond_signal(¬_empty);
pthread_mutex_unlock(&mutex);
}
return NULL;
}
void *consumer(void *arg) {
for (int i = 0; i < 50; i++) {
pthread_mutex_lock(&mutex);
while (count == 0)
pthread_cond_wait(¬_empty, &mutex); // Wait if buffer empty
int val = buffer[--count];
printf("Consumed: %d (count=%d)\n", val, count);
pthread_cond_signal(¬_full);
pthread_mutex_unlock(&mutex);
}
return NULL;
}
int main(void) {
pthread_t prod, cons;
pthread_create(&prod, NULL, producer, NULL);
pthread_create(&cons, NULL, consumer, NULL);
pthread_join(prod, NULL);
pthread_join(cons, NULL);
return 0;
}
Deadlock
**Coffman Conditions** — all four must hold simultaneously for deadlock:
1. **Mutual Exclusion**: A resource can only be used by one process at a time
2. **Hold and Wait**: A process holds resources while waiting for additional ones
3. **No Preemption**: Resources cannot be forcibly taken; only voluntary release
4. **Circular Wait**: P1 waits for P2, P2 waits for P3, P3 waits for P1
**Prevention Strategies:**
- Fix a global resource ordering (eliminates circular wait)
- Request all resources at once (eliminates hold-and-wait)
- Use the Banker's Algorithm to maintain a safe state
5. File Systems
ext4 and inodes
View inode information
stat /etc/hostname
File: /etc/hostname
Size: 12 Blocks: 8 IO Block: 4096 regular file
Inode: 131073 Links: 1
Access: 2026-03-17 10:00:00
Check remaining inodes (inode exhaustion has same effect as a full disk)
df -i /
**Inode structure:**
- File metadata (permissions, owner, timestamps)
- Data block pointers (direct / indirect / double-indirect)
- Filename is NOT in the inode — it lives in directory entries
Journaling
ext4 uses **journaling** to guarantee recovery after unexpected power loss.
Before write: Record changes in journal first (Write-Ahead Log)
After write: Apply changes to actual blocks
After commit: Remove journal entry
VFS Abstraction Layer
User space: open() read() write()
│
Kernel VFS: vfs_open() vfs_read() ← common interface
│
File systems: ext4 | btrfs | tmpfs | procfs | nfs
│
Block device layer → actual hardware
6. I/O and Interrupts
DMA and Interrupt Handlers
Instead of the CPU copying data to memory directly, the DMA controller handles it.
1. CPU → DMA: "Copy disk block X to memory address Y"
2. DMA transfers independently (CPU does other work)
3. DMA completes → interrupt fires
4. CPU: finishes current instruction → looks up interrupt vector → runs ISR
5. ISR: marks I/O complete, wakes waiting process
epoll vs io_uring
**epoll** (event-driven I/O multiplexing):
// epoll: still requires multiple syscalls
int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
epoll_wait(epfd, events, MAX_EVENTS, -1); // syscall
read(fd, buf, size); // another syscall
**io_uring** (Linux 5.1+, shared ring buffer):
// io_uring: submit I/O without per-operation syscalls
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, size, 0);
io_uring_submit(&ring); // One syscall to submit many I/Os
// Wait for completion (communicates via shared memory)
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
**Why io_uring achieves higher performance:**
- SQ/CQ ring buffers shared between user and kernel → no data copying
- Batch multiple I/O submissions → fewer syscalls
- `IORING_SETUP_SQPOLL` mode: kernel thread polls SQ → zero syscalls
- Pre-registered buffers: `io_uring_register_buffers()` avoids per-I/O address mapping
7. Operating Systems from an AI/ML Perspective
NUMA Architecture
In multi-socket servers, each CPU socket has its own local memory.
Socket 0 Socket 1
┌─────────┐ ┌─────────┐
│ CPU 0 │──QPI────│ CPU 1 │
│ 32GB RAM│ │ 32GB RAM│
└─────────┘ └─────────┘
Local access: ~100ns Remote access: ~300ns
NUMA impact on AI training:
- If DataLoader workers run on Socket 0 but GPU is on Socket 1's PCIe, all tensors must traverse QPI
- Pin processes and memory to the correct NUMA node: `numactl --cpunodebind=0 --membind=0 python train.py`
Check NUMA topology
numactl --hardware
Monitor NUMA statistics for a Python process
numastat -p python
Python Multiprocessing vs Multithreading
def cpu_bound_task(n):
"""Prime counting (CPU-intensive)"""
count = 0
for i in range(2, n):
if all(i % j != 0 for j in range(2, int(i**0.5) + 1)):
count += 1
return count
N = 100_000
Multithreading: GIL prevents true parallelism for CPU-bound tasks
start = time.time()
threads = [threading.Thread(target=cpu_bound_task, args=(N,)) for _ in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]
print(f"Threading: {time.time() - start:.2f}s")
Multiprocessing: separate processes bypass the GIL
start = time.time()
with multiprocessing.Pool(4) as pool:
pool.map(cpu_bound_task, [N] * 4)
print(f"Multiprocessing: {time.time() - start:.2f}s")
Result: multiprocessing is ~4x faster
GPU Resource Isolation with cgroups
cgroups v2 setup for per-team GPU resource limits
Create a group for Team A
mkdir /sys/fs/cgroup/team_a
CPU limit: use at most 25% of total CPU
echo "25000 100000" > /sys/fs/cgroup/team_a/cpu.max
Memory limit: 16GB
echo $((16 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/team_a/memory.max
Add current process to the group
echo $$ > /sys/fs/cgroup/team_a/cgroup.procs
GPU isolation using NVIDIA MIG + cgroups
MIG splits one A100 into multiple instances
nvidia-smi mig -cgi 3g.40gb -C # Create a 40GB instance
Limit GPU resources with Docker
docker run --gpus '"device=0,1"' \
--cpuset-cpus="0-15" \
--memory="32g" \
pytorch/pytorch:latest python train.py
Exploring the /proc Filesystem
View process memory mappings
cat /proc/$(pgrep python)/maps
Check process status
cat /proc/$(pgrep python)/status | grep -E "VmRSS|VmPeak|Threads"
CPU scheduling statistics
cat /proc/$(pgrep python)/schedstat
Check NUMA memory binding
cat /proc/$(pgrep python)/numa_maps | head -20
System-wide memory info
cat /proc/meminfo | grep -E "MemTotal|MemFree|Cached|HugePages"
Quiz
Test your understanding of core OS concepts.
**Answer**: The process has 5 steps.
**Explanation**:
1. The CPU looks up the virtual address in the TLB but finds no entry (TLB miss)
2. The MMU reads the page table base address (PGD) from the CR3 register
3. It walks 4 levels: PGD → PUD → PMD → PTE, reading the physical address (4 memory accesses)
4. The virtual-to-physical mapping is loaded into the TLB
5. The original memory access is retried and completes
A low TLB hit rate causes significant slowdowns. Using Huge Pages (2MB or 1GB) reduces the number of TLB entries needed and improves the hit rate.
**Answer**: vruntime is a weighted virtual runtime that drives fair CPU allocation.
**Explanation**:
- Each task accumulates `vruntime`, which is actual CPU time scaled by priority weight
- Higher-priority tasks have their vruntime increase more slowly for the same execution time
- CFS always picks the task with the smallest vruntime (leftmost node in the Red-Black Tree)
- As a result, every task receives CPU proportional to its priority — truly "completely fair"
- New tasks start with `min_vruntime` as their initial value to prevent starvation of existing tasks
**Answer**: Mutual exclusion, hold and wait, no preemption, circular wait.
**Explanation**:
1. **Mutual Exclusion**: A resource can only be held by one process at a time (e.g., a printer or mutex)
2. **Hold and Wait**: A process already holding resources waits to acquire more
3. **No Preemption**: Resources can only be released voluntarily; they cannot be forcibly taken
4. **Circular Wait**: A cycle exists: P1 waits for P2's resource, P2 waits for P3's, P3 waits for P1's
Deadlock is impossible if any single condition is eliminated. Prevention strategies are designed to break exactly one of these four conditions.
**Answer**: Minimized syscall overhead and zero-copy shared ring buffers.
**Explanation**:
- **Fewer syscalls**: epoll requires separate syscalls for event notification and each read/write; io_uring submits many I/O operations with a single `io_uring_submit()`
- **Shared ring buffers**: SQ and CQ are shared memory between user space and kernel, eliminating data copying
- **SQPOLL mode**: A kernel thread polls the SQ ring, so no `submit` syscall is needed at all
- **Pre-registered buffers**: `io_uring_register_buffers()` pre-pins memory so per-I/O address resolution is skipped
- The result is millions of I/O operations per second with minimal CPU overhead
**Answer**: Reduced memory bandwidth and increased latency degrade training throughput.
**Explanation**:
- Local NUMA memory access is approximately 100ns; remote access is approximately 300ns — 3x slower
- AI training moves large tensor data to the GPU on every iteration, making memory bandwidth the critical bottleneck
- When DataLoader workers are bound to Socket 0 but the GPU is connected via Socket 1's PCIe lanes, all data must traverse the QPI interconnect
- **Optimization**: use `numactl --cpunodebind=N --membind=N` to pin processes and memory to the same NUMA node as the GPU; combine with `torch.cuda.set_device()` and a NUMA-aware DataLoader
Summary
The OS should not be a black box for AI engineers. Key takeaways:
- **Scheduling**: CFS vruntime ensures fair CPU sharing; watch out for priority inversion
- **Memory**: Virtual memory provides isolation; minimize TLB misses; use Huge Pages
- **Synchronization**: mutex/condvar eliminate race conditions; lock ordering prevents deadlock
- **I/O**: io_uring minimizes syscall overhead for high-throughput scenarios
- **AI Optimization**: NUMA-aware placement, GPU isolation with cgroups, /proc for bottleneck diagnosis
현재 단락 (1/276)
The operating system is the **intermediary** between hardware and applications. As an AI/ML engineer...