Skip to content
Published on

Operating Systems: A Complete Guide — From Processes & Memory to AI Workload Optimization

Authors

Introduction

The operating system is the intermediary between hardware and applications. As an AI/ML engineer, you need to go beyond writing Python code and be able to answer:

  • Is slow PyTorch training caused by CPU scheduling or memory bandwidth?
  • Which is more efficient: multiprocessing or multithreading?
  • How do you implement isolation when multiple teams share GPU resources?

This guide answers those questions through core OS concepts with practical examples.


1. Processes and Threads

What Is a Process?

A process is a running instance of a program. Each process has its own independent virtual address space, file descriptors, and signal handlers.

Process State Diagram:

  New ──→ Ready ──→ Running
           ↑            │
           │  scheduler │ I/O request
           └── Waiting ←┘
             Terminated

The PCB (Process Control Block) is the data structure the kernel maintains for each process.

// Linux task_struct (simplified)
struct task_struct {
    pid_t pid;              // Process ID
    int   state;            // Current state
    struct mm_struct *mm;   // Virtual memory mapping
    struct files_struct *files; // Open file table
    struct thread_info thread_info; // Saved registers
    long prio;              // Scheduling priority
};

Process Creation with fork() + exec()

#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main(void) {
    pid_t pid = fork();

    if (pid < 0) {
        perror("fork failed");
        return 1;
    } else if (pid == 0) {
        // Child process
        char *args[] = {"/bin/ls", "-la", NULL};
        execv("/bin/ls", args);
        perror("exec failed"); // Not reached if exec succeeds
    } else {
        // Parent process
        int status;
        waitpid(pid, &status, 0);
        printf("Child exit code: %d\n", WEXITSTATUS(status));
    }
    return 0;
}

fork() copies the parent's virtual address space using Copy-on-Write (CoW). Physical memory pages are only duplicated when a write actually occurs, making the operation efficient.

Threads vs Processes

AspectProcessThread
Address SpaceIndependentShared
Creation CostHighLow
CommunicationIPC (pipes, sockets)Shared memory
Fault IsolationStrongWeak
Python GILNot affectedBottleneck for CPU-bound work

Context Switching

When the CPU switches from process A to process B:

  1. Save A's register state to its PCB
  2. Replace virtual memory mapping (page table pointer)
  3. Flush TLB (cache invalidation)
  4. Restore B's register state

Context switching costs a few microseconds. In AI inference servers, excessive threads can actually degrade performance.


2. CPU Scheduling

Algorithm Comparison

FIFO (First-In, First-Out)

  • Simple but suffers from the Convoy Effect: short jobs wait behind long ones

SJF (Shortest Job First)

  • Minimizes average wait time, but execution time is hard to predict

Round Robin

  • Fixed time quantum assigned to each process
  • Very short quantums cause excessive context switching overhead

Linux CFS (Completely Fair Scheduler)

  • vruntime-based: tracks how much CPU each process has used
  • Red-Black Tree selects the least-run process in O(log n)
// vruntime calculation (conceptual pseudocode)
void update_vruntime(struct task_struct *task, u64 delta_exec) {
    // Higher priority means vruntime grows more slowly
    u64 weight = prio_to_weight[task->nice + 20];
    task->vruntime += delta_exec * NICE_0_WEIGHT / weight;
}

Priority Inversion Problem

High priority H ──────────────────────→ Waiting (needs lock)
Mid  priority M ──────────────────────→ Runs before H!
Low  priority L ──→ Holds lock ──→ Gets preempted

When L is preempted by M while holding a lock, H runs later than M. The solution is Priority Inheritance: L temporarily inherits H's priority while holding the lock.


3. Memory Management

Virtual Memory and Paging

Every process believes it owns all physical memory. The kernel uses page tables to translate virtual addresses to physical ones.

Virtual address (48-bit on x86-64):
┌──────────┬──────────┬──────────┬──────────┬────────────┐
PGD(9)PUD(9)PMD(9)PTE(9)offset(12)└──────────┴──────────┴──────────┴──────────┴────────────┘

TLB (Translation Lookaside Buffer)

Page table walks require memory accesses and are slow. The TLB caches recent translations.

TLB miss handling steps:

  1. CPU looks up virtual address in TLB → miss
  2. CPU references page table base address from CR3 register
  3. 4-level page table walk (4 memory accesses)
  4. Extract physical address from PTE → store in TLB
  5. Retry the original memory access

Page Replacement Algorithms

LRU (Least Recently Used): Evicts the page that hasn't been used for the longest time. Linux uses the clock algorithm (an LRU approximation).

Clock Algorithm:
  Maintain a reference bit (R) per page
  Pointer cycles through pages; select R=0 pages for eviction
  If R=1, reset to R=0 and advance to next page

Memory-Mapped Files in Python

import mmap
import os

# Access large datasets as memory-mapped files
with open("large_dataset.bin", "r+b") as f:
    # Map entire file into virtual address space
    mm = mmap.mmap(f.fileno(), 0)

    # Slice to access only needed regions (actual I/O happens on demand)
    header = mm[:128]
    record = mm[128:256]

    mm.close()

# AI training: numpy memmap treats disk data like an array
import numpy as np
data = np.memmap("features.npy", dtype="float32", mode="r", shape=(1_000_000, 512))
batch = data[0:1024]  # Only loads the requested batch from disk

4. Synchronization

Mutex and Condition Variable

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

#define BUFFER_SIZE 10

int buffer[BUFFER_SIZE];
int count = 0;

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t  not_full  = PTHREAD_COND_INITIALIZER;
pthread_cond_t  not_empty = PTHREAD_COND_INITIALIZER;

void *producer(void *arg) {
    for (int i = 0; i < 50; i++) {
        pthread_mutex_lock(&mutex);
        while (count == BUFFER_SIZE)
            pthread_cond_wait(&not_full, &mutex);  // Wait if buffer full

        buffer[count++] = i;
        printf("Produced: %d (count=%d)\n", i, count);
        pthread_cond_signal(&not_empty);
        pthread_mutex_unlock(&mutex);
    }
    return NULL;
}

void *consumer(void *arg) {
    for (int i = 0; i < 50; i++) {
        pthread_mutex_lock(&mutex);
        while (count == 0)
            pthread_cond_wait(&not_empty, &mutex);  // Wait if buffer empty

        int val = buffer[--count];
        printf("Consumed: %d (count=%d)\n", val, count);
        pthread_cond_signal(&not_full);
        pthread_mutex_unlock(&mutex);
    }
    return NULL;
}

int main(void) {
    pthread_t prod, cons;
    pthread_create(&prod, NULL, producer, NULL);
    pthread_create(&cons, NULL, consumer, NULL);
    pthread_join(prod, NULL);
    pthread_join(cons, NULL);
    return 0;
}

Deadlock

Coffman Conditions — all four must hold simultaneously for deadlock:

  1. Mutual Exclusion: A resource can only be used by one process at a time
  2. Hold and Wait: A process holds resources while waiting for additional ones
  3. No Preemption: Resources cannot be forcibly taken; only voluntary release
  4. Circular Wait: P1 waits for P2, P2 waits for P3, P3 waits for P1

Prevention Strategies:

  • Fix a global resource ordering (eliminates circular wait)
  • Request all resources at once (eliminates hold-and-wait)
  • Use the Banker's Algorithm to maintain a safe state

5. File Systems

ext4 and inodes

# View inode information
stat /etc/hostname
# File: /etc/hostname
# Size: 12        Blocks: 8   IO Block: 4096  regular file
# Inode: 131073   Links: 1
# Access: 2026-03-17 10:00:00

# Check remaining inodes (inode exhaustion has same effect as a full disk)
df -i /

Inode structure:

  • File metadata (permissions, owner, timestamps)
  • Data block pointers (direct / indirect / double-indirect)
  • Filename is NOT in the inode — it lives in directory entries

Journaling

ext4 uses journaling to guarantee recovery after unexpected power loss.

Before write: Record changes in journal first (Write-Ahead Log)
After write:  Apply changes to actual blocks
After commit: Remove journal entry

VFS Abstraction Layer

User space:   open()  read()  write()
Kernel VFS:   vfs_open()  vfs_read()   ← common interface
File systems: ext4 | btrfs | tmpfs | procfs | nfs
Block device layer → actual hardware

6. I/O and Interrupts

DMA and Interrupt Handlers

Instead of the CPU copying data to memory directly, the DMA controller handles it.

1. CPUDMA: "Copy disk block X to memory address Y"
2. DMA transfers independently (CPU does other work)
3. DMA completes → interrupt fires
4. CPU: finishes current instruction → looks up interrupt vector → runs ISR
5. ISR: marks I/O complete, wakes waiting process

epoll vs io_uring

epoll (event-driven I/O multiplexing):

// epoll: still requires multiple syscalls
int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
epoll_wait(epfd, events, MAX_EVENTS, -1);  // syscall
read(fd, buf, size);                        // another syscall

io_uring (Linux 5.1+, shared ring buffer):

// io_uring: submit I/O without per-operation syscalls
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, size, 0);
io_uring_submit(&ring);  // One syscall to submit many I/Os

// Wait for completion (communicates via shared memory)
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);

Why io_uring achieves higher performance:

  • SQ/CQ ring buffers shared between user and kernel → no data copying
  • Batch multiple I/O submissions → fewer syscalls
  • IORING_SETUP_SQPOLL mode: kernel thread polls SQ → zero syscalls
  • Pre-registered buffers: io_uring_register_buffers() avoids per-I/O address mapping

7. Operating Systems from an AI/ML Perspective

NUMA Architecture

In multi-socket servers, each CPU socket has its own local memory.

Socket 0              Socket 1
┌─────────┐          ┌─────────┐
CPU 0  │──QPI────│  CPU 1│ 32GB RAM│          │ 32GB RAM└─────────┘          └─────────┘
Local access: ~100ns   Remote access: ~300ns

NUMA impact on AI training:

  • If DataLoader workers run on Socket 0 but GPU is on Socket 1's PCIe, all tensors must traverse QPI
  • Pin processes and memory to the correct NUMA node: numactl --cpunodebind=0 --membind=0 python train.py
# Check NUMA topology
numactl --hardware

# Monitor NUMA statistics for a Python process
numastat -p python

Python Multiprocessing vs Multithreading

import time
import threading
import multiprocessing

def cpu_bound_task(n):
    """Prime counting (CPU-intensive)"""
    count = 0
    for i in range(2, n):
        if all(i % j != 0 for j in range(2, int(i**0.5) + 1)):
            count += 1
    return count

N = 100_000

# Multithreading: GIL prevents true parallelism for CPU-bound tasks
start = time.time()
threads = [threading.Thread(target=cpu_bound_task, args=(N,)) for _ in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]
print(f"Threading: {time.time() - start:.2f}s")

# Multiprocessing: separate processes bypass the GIL
start = time.time()
with multiprocessing.Pool(4) as pool:
    pool.map(cpu_bound_task, [N] * 4)
print(f"Multiprocessing: {time.time() - start:.2f}s")
# Result: multiprocessing is ~4x faster

GPU Resource Isolation with cgroups

# cgroups v2 setup for per-team GPU resource limits

# Create a group for Team A
mkdir /sys/fs/cgroup/team_a

# CPU limit: use at most 25% of total CPU
echo "25000 100000" > /sys/fs/cgroup/team_a/cpu.max

# Memory limit: 16GB
echo $((16 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/team_a/memory.max

# Add current process to the group
echo $$ > /sys/fs/cgroup/team_a/cgroup.procs

# GPU isolation using NVIDIA MIG + cgroups
# MIG splits one A100 into multiple instances
nvidia-smi mig -cgi 3g.40gb -C  # Create a 40GB instance

# Limit GPU resources with Docker
docker run --gpus '"device=0,1"' \
  --cpuset-cpus="0-15" \
  --memory="32g" \
  pytorch/pytorch:latest python train.py

Exploring the /proc Filesystem

# View process memory mappings
cat /proc/$(pgrep python)/maps

# Check process status
cat /proc/$(pgrep python)/status | grep -E "VmRSS|VmPeak|Threads"

# CPU scheduling statistics
cat /proc/$(pgrep python)/schedstat

# Check NUMA memory binding
cat /proc/$(pgrep python)/numa_maps | head -20

# System-wide memory info
cat /proc/meminfo | grep -E "MemTotal|MemFree|Cached|HugePages"

Quiz

Test your understanding of core OS concepts.

Q1. Describe the steps taken when a TLB miss occurs in virtual memory.

Answer: The process has 5 steps.

Explanation:

  1. The CPU looks up the virtual address in the TLB but finds no entry (TLB miss)
  2. The MMU reads the page table base address (PGD) from the CR3 register
  3. It walks 4 levels: PGD → PUD → PMD → PTE, reading the physical address (4 memory accesses)
  4. The virtual-to-physical mapping is loaded into the TLB
  5. The original memory access is retried and completes

A low TLB hit rate causes significant slowdowns. Using Huge Pages (2MB or 1GB) reduces the number of TLB entries needed and improves the hit rate.

Q2. Explain how Linux CFS uses vruntime and how it ensures fairness.

Answer: vruntime is a weighted virtual runtime that drives fair CPU allocation.

Explanation:

  • Each task accumulates vruntime, which is actual CPU time scaled by priority weight
  • Higher-priority tasks have their vruntime increase more slowly for the same execution time
  • CFS always picks the task with the smallest vruntime (leftmost node in the Red-Black Tree)
  • As a result, every task receives CPU proportional to its priority — truly "completely fair"
  • New tasks start with min_vruntime as their initial value to prevent starvation of existing tasks
Q3. List and explain the four Coffman conditions required for deadlock.

Answer: Mutual exclusion, hold and wait, no preemption, circular wait.

Explanation:

  1. Mutual Exclusion: A resource can only be held by one process at a time (e.g., a printer or mutex)
  2. Hold and Wait: A process already holding resources waits to acquire more
  3. No Preemption: Resources can only be released voluntarily; they cannot be forcibly taken
  4. Circular Wait: A cycle exists: P1 waits for P2's resource, P2 waits for P3's, P3 waits for P1's

Deadlock is impossible if any single condition is eliminated. Prevention strategies are designed to break exactly one of these four conditions.

Q4. Explain why io_uring outperforms epoll for high-throughput I/O.

Answer: Minimized syscall overhead and zero-copy shared ring buffers.

Explanation:

  • Fewer syscalls: epoll requires separate syscalls for event notification and each read/write; io_uring submits many I/O operations with a single io_uring_submit()
  • Shared ring buffers: SQ and CQ are shared memory between user space and kernel, eliminating data copying
  • SQPOLL mode: A kernel thread polls the SQ ring, so no submit syscall is needed at all
  • Pre-registered buffers: io_uring_register_buffers() pre-pins memory so per-I/O address resolution is skipped
  • The result is millions of I/O operations per second with minimal CPU overhead
Q5. Explain how remote memory access in a NUMA architecture affects AI training performance.

Answer: Reduced memory bandwidth and increased latency degrade training throughput.

Explanation:

  • Local NUMA memory access is approximately 100ns; remote access is approximately 300ns — 3x slower
  • AI training moves large tensor data to the GPU on every iteration, making memory bandwidth the critical bottleneck
  • When DataLoader workers are bound to Socket 0 but the GPU is connected via Socket 1's PCIe lanes, all data must traverse the QPI interconnect
  • Optimization: use numactl --cpunodebind=N --membind=N to pin processes and memory to the same NUMA node as the GPU; combine with torch.cuda.set_device() and a NUMA-aware DataLoader

Summary

The OS should not be a black box for AI engineers. Key takeaways:

  • Scheduling: CFS vruntime ensures fair CPU sharing; watch out for priority inversion
  • Memory: Virtual memory provides isolation; minimize TLB misses; use Huge Pages
  • Synchronization: mutex/condvar eliminate race conditions; lock ordering prevents deadlock
  • I/O: io_uring minimizes syscall overhead for high-throughput scenarios
  • AI Optimization: NUMA-aware placement, GPU isolation with cgroups, /proc for bottleneck diagnosis