✍️ 필사 모드: Linux Performance Engineering Complete Guide 2025: Profiling, System Tuning, eBPF, Bottleneck Analysis
EnglishTable of Contents
1. Why Linux Performance Engineering
A significant portion of production incidents stem from performance issues. CPU spikes, memory leaks, disk I/O bottlenecks, network latency -- the ability to systematically analyze and resolve all of these is Linux performance engineering.
Core topics covered in this guide:
- Performance analysis methodologies (USE, RED, TSA)
- CPU analysis (perf, mpstat, pidstat, Flame Graph)
- Memory analysis (vmstat, /proc/meminfo, slab, OOM Killer)
- Disk I/O (iostat, blktrace, I/O schedulers, fio)
- Network performance (sar, ss, iperf3, TCP tuning)
- eBPF deep dive (architecture, BCC, bpftrace, libbpf CO-RE)
- Flame Graphs (CPU, off-CPU, memory, I/O)
- sysctl tuning and cgroups v2
- NUMA, Huge Pages, process scheduling
- Production tuning checklist (20+ items)
2. Performance Analysis Methodologies
2.1 USE Method (Utilization, Saturation, Errors)
A systematic performance analysis framework by Brendan Gregg.
For every resource, check 3 things:
+-------------------+--------------------------------------------+
| Resource | U (Utilization) | S (Saturation) | E (Errors)|
+-------------------+----------------+---------------+----------+
| CPU | mpstat | runq latency | perf/dmesg|
| Memory | free | vmstat si/so | dmesg OOM |
| Network Interface | sar -n DEV | ifconfig (overruns) | ifconfig (errors) |
| Storage I/O | iostat | iostat avgqu | iostat (errors) |
| Storage Capacity | df -h | (none) | stale mounts |
| File Descriptors | lsof | (none) | "Too many |
| | | | open files"|
+-------------------+----------------+---------------+----------+
2.2 RED Method (Rate, Errors, Duration)
A methodology well-suited for microservices.
Measure 3 things from a service perspective:
Rate: Requests per second (requests/sec)
Errors: Failed request ratio (error rate)
Duration: Request latency distribution (P50/P95/P99)
2.3 TSA Method (Thread State Analysis)
Thread State Classification:
On-CPU: Running (actively using CPU)
Runnable: Waiting to run (waiting for CPU)
Sleeping: Waiting for I/O, timer, lock, etc.
Idle: Nothing to do
Analysis Tools:
- perf record + Flame Graph (On-CPU analysis)
- offcputime (Off-CPU analysis)
- bpftrace (detailed thread analysis)
3. Linux Performance Tools Overview
3.1 Brendan Gregg's Linux Performance Tools Map
Applications
/ | \
/ | \
System Libs Runtime Compiler
| | |
v v v
+-----------------------------------------+
| System Call Interface |
+-----------------------------------------+
| VFS | Sockets | Scheduler | VM |
+-----------+---------+-----------+-------+
| File Sys | TCP/UDP | (sched) | (mm) |
+-----------+---------+-----------+-------+
| Volume Mgr| IP | | |
+-----------+---------+-----------+-------+
| Block Dev | Ethernet| | |
+-----------+---------+-----------+-------+
| Device Drivers |
+-----------------------------------------+
Observability Tools:
App: strace, ltrace, gdb
Sched: perf, mpstat, pidstat, runqlat
Memory: vmstat, slabtop, free, sar
FS: opensnoop, ext4slower, fileslower
Disk: iostat, biolatency, biotop
Net: sar, ss, tcpdump, nicstat
All: eBPF (bpftrace, BCC tools)
4. CPU Analysis
4.1 perf stat (Hardware Counters)
# Measure CPU counters for a process
perf stat -p PID sleep 10
# Example output:
# Performance counter stats for process id '12345':
#
# 10,234.56 msec task-clock # 1.023 CPUs utilized
# 2,345 context-switches # 229.12 /sec
# 12 cpu-migrations # 1.17 /sec
# 45,678 page-faults # 4.46K /sec
# 12,345,678,901 cycles # 1.206 GHz
# 9,876,543,210 instructions # 0.80 insn/cycle
# 1,234,567,890 branches # 120.63M /sec
# 12,345,678 branch-misses # 1.00% of all branches
# IPC (Instructions Per Cycle) is the key metric
# IPC < 1.0: likely memory-bound
# IPC > 1.0: compute-bound
4.2 perf record + perf report
# Record CPU profile (30 seconds)
perf record -g -p PID sleep 30
# Or system-wide
perf record -g -a sleep 30
# Analyze profile
perf report --stdio
# Example output:
# Overhead Command Shared Object Symbol
# 23.45% nginx libc.so.6 [.] __memcpy_avx2
# 15.67% nginx nginx [.] ngx_http_parse_request_line
# 12.34% nginx [kernel.vmlinux] [k] copy_user_enhanced_fast_string
# 8.90% nginx libssl.so.3 [.] EVP_EncryptUpdate
4.3 Flame Graph Generation
# 1. Collect perf data
perf record -F 99 -g -a sleep 30
# 2. Generate Flame Graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu.svg
# Or as a one-liner:
perf record -F 99 -g -a -- sleep 30 && \
perf script | \
stackcollapse-perf.pl | \
flamegraph.pl > flamegraph.svg
How to Read Flame Graphs:
+-----------------------------------------------------------+
| main() |
+-------------------+---------------------------------------+
| process_request()| handle_connection() |
+--------+----------+------------------+--------------------+
| parse()| route() | read_file() | send_response() |
+--------+----+-----+------+-----------+----------+---------+
|sort| |read()| |write() |encrypt()|
+----+ +------+ +----------+---------+
- X-axis: CPU time proportion (left-to-right order is meaningless)
- Y-axis: Call stack depth (bottom-to-top = call direction)
- Width: CPU time consumed by that function
- Find wide "plateaus" -- those are optimization targets!
4.4 mpstat, pidstat
# Per-CPU utilization
mpstat -P ALL 1
# Example output:
# CPU %usr %nice %sys %iowait %irq %soft %steal %idle
# all 25.3 0.0 5.2 2.1 0.0 0.5 0.0 66.9
# 0 45.6 0.0 8.3 0.0 0.0 1.2 0.0 44.9 <- hot CPU
# 1 12.3 0.0 3.1 4.2 0.0 0.1 0.0 80.3
# 2 35.7 0.0 6.8 0.0 0.0 0.8 0.0 56.7
# 3 7.5 0.0 2.5 4.1 0.0 0.0 0.0 85.9
# Per-process CPU utilization
pidstat -p ALL 1
# Per-thread CPU utilization
pidstat -t -p PID 1
5. Memory Analysis
5.1 vmstat
# Memory/CPU statistics at 1-second intervals
vmstat 1
# Interpreting the output:
# procs --------memory-------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa
# 2 0 0 524288 65536 2097152 0 0 4 12 156 312 15 5 78 2
# 5 1 0 491520 65536 2097152 0 0 0 256 892 1543 45 12 38 5
# Key metrics:
# r: Run queue waiting processes (r > CPU count = saturated)
# b: Uninterruptible sleep (usually I/O wait)
# si/so: Swap in/out (non-zero = memory pressure)
# wa: I/O wait (high = disk bottleneck)
5.2 /proc/meminfo Details
cat /proc/meminfo
# Key entries:
# MemTotal: 16384000 kB Total physical memory
# MemFree: 1024000 kB Completely unused memory
# MemAvailable: 8192000 kB Actually available (including reclaimable cache)
# Buffers: 524288 kB Block device I/O buffers
# Cached: 6553600 kB Page cache
# SwapCached: 0 kB Cache read back from swap
# Active: 4096000 kB Recently accessed memory
# Inactive: 3072000 kB Old memory (reclaim candidate)
# Slab: 512000 kB Kernel data structure cache
# SReclaimable: 384000 kB Reclaimable slab
# SUnreclaim: 128000 kB Non-reclaimable slab
5.3 Page Cache and OOM Killer
# Page cache status
free -h
# total used free shared buff/cache available
# Mem: 16G 4.2G 1.0G 256M 10.8G 11.2G
# Swap: 4G 0B 4G
# Clear cache (use with caution in production)
# echo 1 > /proc/sys/vm/drop_caches # Page cache only
# echo 2 > /proc/sys/vm/drop_caches # dentries + inodes
# echo 3 > /proc/sys/vm/drop_caches # All
# Check OOM Killer logs
dmesg | grep -i "oom\|out of memory\|killed process"
# Check per-process OOM score
cat /proc/PID/oom_score
# Adjust OOM score (-1000 to 1000)
echo -500 > /proc/PID/oom_score_adj # Protect from OOM
echo 1000 > /proc/PID/oom_score_adj # OOM kill priority target
5.4 slabtop
# Monitor kernel slab caches
slabtop -s c # Sort by cache size
# Example output:
# OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
# 65536 62000 94% 0.19K 3277 20 13108K dentry
# 32768 30000 91% 0.50K 4096 8 16384K inode_cache
# 16384 15000 91% 1.00K 4096 4 16384K ext4_inode_cache
6. Disk I/O Analysis
6.1 iostat
# Disk I/O statistics (extended mode, 1-second interval)
iostat -xz 1
# Interpreting the output:
# Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s await r_await w_await svctm %util
# sda 150.0 200.0 6000 8000 10.0 50.0 2.50 1.80 3.00 0.85 29.8
# nvme0 500.0 800.0 50000 80000 0.0 0.0 0.25 0.20 0.28 0.08 10.4
# Key metrics:
# await: Average I/O wait time (ms) - high = disk bottleneck
# %util: Device utilization - near 100% = saturated
# r_await, w_await: Separate read/write wait times
# rrqm/wrqm: Merged requests (high = sequential I/O)
6.2 I/O Schedulers
# Check current I/O scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# Change scheduler
echo "bfq" > /sys/block/sda/queue/scheduler
I/O Scheduler Comparison:
+-------------+------------------+----------------------------------+
| Scheduler | Best For | Characteristics |
+-------------+------------------+----------------------------------+
| none | NVMe SSD | No scheduling (delegates to HW) |
| mq-deadline | SSD/HDD general | Request deadline guarantee, |
| | | good for databases |
| bfq | Desktop | I/O fairness, interactive work |
| kyber | High-perf SSD | Low latency, read/write queues |
+-------------+------------------+----------------------------------+
6.3 fio Benchmarking
# Sequential read benchmark
fio --name=seqread --rw=read --bs=1M --size=4G \
--numjobs=1 --runtime=60 --direct=1
# Random read (IOPS measurement)
fio --name=randread --rw=randread --bs=4k --size=4G \
--numjobs=8 --runtime=60 --direct=1 --iodepth=32
# Mixed workload (DB simulation)
fio --name=mixed --rw=randrw --rwmixread=70 \
--bs=8k --size=4G --numjobs=4 --runtime=60 \
--direct=1 --iodepth=16
# Interpreting results:
# read: IOPS=125000, BW=488MiB/s, lat avg=0.25ms, p99=0.50ms
# write: IOPS=53571, BW=209MiB/s, lat avg=0.45ms, p99=1.20ms
7. Network Performance Analysis
7.1 sar
# Network interface statistics
sar -n DEV 1
# TCP statistics
sar -n TCP 1
# Example output:
# IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
# eth0 15000 12000 8500 6200 0.0 0.0 5.0 6.8
# TCP error statistics
sar -n ETCP 1
7.2 ss and TCP Tuning
# TCP connection state summary
ss -s
# Check receive/send queues (bottleneck diagnosis)
ss -tnp | awk '{print $2, $3, $5}'
# Congestion control and RTT info
ss -ti
# TCP memory usage
ss -tm
7.3 iperf3 Benchmarking
# Server side
iperf3 -s
# Client side (TCP bandwidth measurement)
iperf3 -c SERVER_IP -t 30 -P 4
# UDP bandwidth measurement
iperf3 -c SERVER_IP -u -b 10G -t 30
# Bidirectional test
iperf3 -c SERVER_IP -t 30 --bidir
# MTU optimization (Jumbo Frames)
# Standard: 1500 bytes
# Jumbo: 9000 bytes (within data centers)
ip link set eth0 mtu 9000
8. eBPF Deep Dive
8.1 eBPF Architecture
User Space Kernel Space
+--------------------+ +---------------------------+
| | | |
| BCC / bpftrace / | load | eBPF Virtual Machine |
| libbpf programs |------->| (JIT compiled) |
| | | |
| Maps (data sharing)|<------>| Hooks: |
| | read/ | - kprobes (func entry) |
| Output | write | - tracepoints (static) |
| (stdout, perf | | - XDP (network packets) |
| buffer, ringbuf) | | - LSM (security module) |
+--------------------+ | - cgroup (resource ctrl) |
+---------------------------+
eBPF Verifier:
- Prevents infinite loops
- Blocks out-of-bounds memory access
- Guarantees kernel stability
8.2 BCC Tools
# === CPU Related ===
# Trace new process execution
execsnoop
# Run queue latency histogram
runqlat # CPU scheduler delay analysis
# === Memory Related ===
# Trace page faults
drsnoop # Direct reclaim tracing
# Memory allocation tracing
memleak # Memory leak detection
# === Disk I/O ===
# Block I/O latency histogram
biolatency # I/O wait time distribution
# Top block I/O processes
biotop # Real-time I/O heavy processes
# === File System ===
# Slow filesystem operations
ext4slower 1 # ext4 operations taking more than 1ms
# File open tracing
opensnoop # Which process opens which files
# === Network ===
# TCP connection tracing
tcpconnect # New TCP connections
tcpaccept # Accepted TCP connections
tcpretrans # TCP retransmissions
8.3 bpftrace One-Liners
# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# read() return size histogram
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
@bytes = hist(args->ret);
}'
# Block I/O size by process
bpftrace -e 'tracepoint:block:block_rq_issue {
@[comm] = hist(args->bytes);
}'
# CPU scheduler switch tracing
bpftrace -e 'tracepoint:sched:sched_switch {
@[args->next_comm] = count();
}'
# TCP retransmission tracing
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
@[comm, args->daddr, args->dport] = count();
}'
# VFS read latency
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
@ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
8.4 libbpf CO-RE (Compile Once - Run Everywhere)
// Simple CO-RE eBPF program structure
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
struct event {
u32 pid;
u64 duration_ns;
char comm[16];
};
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
SEC("kprobe/do_sys_openat2")
int BPF_KPROBE(trace_openat2, int dfd, const char *filename)
{
struct event *e;
e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
9. Flame Graphs Deep Dive
9.1 CPU Flame Graph
# Collect CPU profile with perf
perf record -F 99 -g -a sleep 30
# Generate Flame Graph
perf script | \
stackcollapse-perf.pl | \
flamegraph.pl --title "CPU Flame Graph" > cpu_flame.svg
9.2 Off-CPU Flame Graph
Analyzes time when a process is NOT running on CPU (I/O, locks, sleeps, etc.).
# Using BCC offcputime
offcputime -df -p PID 30 | \
flamegraph.pl --color=io --title "Off-CPU" > offcpu.svg
# bpftrace off-CPU analysis
bpftrace -e '
tracepoint:sched:sched_switch {
@start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
@us[args->next_comm, ustack] =
hist((nsecs - @start[args->next_pid]) / 1000);
delete(@start[args->next_pid]);
}'
9.3 Memory Flame Graph
# Trace memory allocations
perf record -e kmem:kmalloc -g -a sleep 10
perf script | stackcollapse-perf.pl | \
flamegraph.pl --color=mem --title "Memory Allocations" > mem_flame.svg
# Or using BCC memleak
memleak -p PID -a 30 | \
flamegraph.pl --title "Memory Leak" > memleak_flame.svg
10. sysctl Tuning
10.1 VM (Virtual Memory) Tuning
# === Swap Behavior ===
# vm.swappiness: Swap tendency (0-100)
# 0: Minimize swap (OOM risk)
# 10: Recommended for DB servers
# 60: Default
# 100: Aggressive swapping
sysctl -w vm.swappiness=10
# === Dirty Pages ===
# Dirty page ratio (write-back trigger threshold)
sysctl -w vm.dirty_ratio=15
# Background flush start threshold
sysctl -w vm.dirty_background_ratio=5
# Dirty page expiry (centiseconds)
sysctl -w vm.dirty_expire_centisecs=3000
# Dirty page writeback interval
sysctl -w vm.dirty_writeback_centisecs=500
# === Overcommit ===
# 0: Heuristic (default)
# 1: Always allow
# 2: Allow up to physical mem + swap * ratio
sysctl -w vm.overcommit_memory=0
sysctl -w vm.overcommit_ratio=50
# === Minimum Free Memory ===
sysctl -w vm.min_free_kbytes=65536
10.2 Network Tuning
# === TCP Connections ===
# Max connection backlog
sysctl -w net.core.somaxconn=65535
# SYN backlog
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# TCP connection reuse
sysctl -w net.ipv4.tcp_tw_reuse=1
# === TCP Buffers ===
# Receive buffer (min, default, max)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
# Send buffer
sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"
# Global network buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
# === TCP Keepalive ===
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=60
sysctl -w net.ipv4.tcp_keepalive_probes=3
# === Congestion Control ===
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
# === Miscellaneous ===
# FIN-WAIT-2 timeout
sysctl -w net.ipv4.tcp_fin_timeout=15
# Port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# SYN cookies (SYN Flood defense)
sysctl -w net.ipv4.tcp_syncookies=1
10.3 Filesystem Tuning
# System-wide file descriptor limit
sysctl -w fs.file-max=2097152
# inotify watch limit (file monitoring)
sysctl -w fs.inotify.max_user_watches=524288
# AIO max request count
sysctl -w fs.aio-max-nr=1048576
10.4 Kernel Scheduler Tuning
# CFS scheduler minimum runtime (ns)
sysctl -w kernel.sched_min_granularity_ns=1000000
# Scheduling latency (ns)
sysctl -w kernel.sched_latency_ns=6000000
# Migration cost (ns)
sysctl -w kernel.sched_migration_cost_ns=500000
# Automatic group scheduling
sysctl -w kernel.sched_autogroup_enabled=1
11. cgroups v2
11.1 cgroups v2 Basics
# Check cgroups v2 mount
mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2
# Create a cgroup
mkdir /sys/fs/cgroup/myapp
# Assign a process
echo PID > /sys/fs/cgroup/myapp/cgroup.procs
# Check available controllers
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids
11.2 CPU Limits
# CPU max usage limit
# Format: QUOTA PERIOD (microseconds)
# 50ms out of 100ms = 50% CPU
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
# CPU weight (1-10000, default 100)
echo "200" > /sys/fs/cgroup/myapp/cpu.weight
11.3 Memory Limits
# Memory hard limit
echo "1G" > /sys/fs/cgroup/myapp/memory.max
# Memory soft limit (reclaim priority target)
echo "512M" > /sys/fs/cgroup/myapp/memory.high
# Memory minimum guarantee
echo "256M" > /sys/fs/cgroup/myapp/memory.min
# Swap limit
echo "0" > /sys/fs/cgroup/myapp/memory.swap.max
# Current memory usage
cat /sys/fs/cgroup/myapp/memory.current
# Memory statistics
cat /sys/fs/cgroup/myapp/memory.stat
11.4 I/O Limits
# Per-device I/O bandwidth limit
# Format: MAJOR:MINOR TYPE=LIMIT
# Limit sda reads to 50MB/s
echo "8:0 rbps=52428800" > /sys/fs/cgroup/myapp/io.max
# Limit sda writes to 20MB/s
echo "8:0 wbps=20971520" > /sys/fs/cgroup/myapp/io.max
# IOPS limits
echo "8:0 riops=1000 wiops=500" > /sys/fs/cgroup/myapp/io.max
# I/O weight
echo "default 100" > /sys/fs/cgroup/myapp/io.weight
11.5 Docker/K8s Integration
# Docker cgroups v2 resource limits
# docker run --cpus=2 --memory=4g --memory-swap=4g myapp
# Kubernetes Pod resources (maps to cgroups v2)
# resources:
# requests:
# cpu: "500m" -> cpu.weight
# memory: "512Mi" -> memory.min
# limits:
# cpu: "2" -> cpu.max
# memory: "4Gi" -> memory.max
12. NUMA (Non-Uniform Memory Access)
12.1 NUMA Topology
# Check NUMA topology
numactl --hardware
# Example output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 0 free: 16384 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node 1 free: 15360 MB
# node distances:
# node 0 1
# 0: 10 21 <- Same node: 10, different node: 21 (2.1x slower)
# 1: 21 10
# NUMA memory statistics
numastat -m
# Per-process NUMA memory stats
numastat -p PID
12.2 NUMA Memory Binding
# Run on a specific NUMA node
numactl --cpunodebind=0 --membind=0 ./myapp
# Bind both CPU and memory to node 0
numactl -N 0 -m 0 ./database_process
# Interleave mode (distribute evenly across nodes)
numactl --interleave=all ./myapp
# Check NUMA policy of existing process
cat /proc/PID/numa_maps
13. Huge Pages
13.1 Transparent Huge Pages (THP)
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# Disable THP for databases (memory fragmentation + latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
13.2 Explicit Huge Pages
# Allocate Huge Pages (2MB pages)
sysctl -w vm.nr_hugepages=1024 # 1024 * 2MB = 2GB
# Check current status
cat /proc/meminfo | grep Huge
# HugePages_Total: 1024
# HugePages_Free: 512
# HugePages_Rsvd: 256
# HugePages_Surp: 0
# Hugepagesize: 2048 kB
# 1GB Huge Pages (kernel boot parameter)
# GRUB: hugepagesz=1G hugepages=16
13.3 Database Usage
# PostgreSQL: Use Huge Pages for shared_buffers
# postgresql.conf:
# huge_pages = try
# shared_buffers = 8GB
# Calculate Huge Pages needed:
# shared_buffers / Hugepagesize = required pages
# 8GB / 2MB = 4096 pages
sysctl -w vm.nr_hugepages=4096
14. Process Scheduling
14.1 CFS (Completely Fair Scheduler)
# CFS scheduler statistics
cat /proc/sched_debug | head -50
# Adjust nice value (-20 to 19, lower = higher priority)
nice -n -10 ./critical_process
renice -n -10 -p PID
# Check process scheduling policy
chrt -p PID
14.2 Real-Time Scheduling
# Set SCHED_FIFO (real-time, priority 1-99)
chrt -f 50 ./realtime_app
# SCHED_RR (round-robin real-time)
chrt -r 50 ./realtime_app
# SCHED_DEADLINE (deadline-based)
chrt -d --sched-runtime 5000000 --sched-deadline 10000000 \
--sched-period 10000000 0 ./deadline_app
14.3 CPU Affinity
# Bind process to specific CPUs
taskset -c 0,1 ./myapp # Run only on CPU 0, 1
taskset -c 0-3 ./myapp # Run on CPU 0-3
taskset -pc 0,1 PID # Apply to existing process
# IRQ affinity (interrupt distribution)
echo 2 > /proc/irq/IRQ_NUM/smp_affinity # Assign to CPU 1
# isolcpus (boot parameter)
# GRUB: isolcpus=4,5,6,7
# Exclude CPU 4-7 from general scheduling for dedicated process use
15. Production Tuning Checklist
+----+------------------------------+-------------------------------+
| # | Item | Recommended Setting |
+----+------------------------------+-------------------------------+
| 1 | File descriptor limit | ulimit -n 1048576 |
| | | fs.file-max = 2097152 |
+----+------------------------------+-------------------------------+
| 2 | TCP backlog | net.core.somaxconn = 65535 |
+----+------------------------------+-------------------------------+
| 3 | TCP buffers | Optimize tcp_rmem/wmem |
+----+------------------------------+-------------------------------+
| 4 | TCP congestion control | BBR or env-appropriate algo |
+----+------------------------------+-------------------------------+
| 5 | TCP Keepalive | 600/60/3 (adjust per app) |
+----+------------------------------+-------------------------------+
| 6 | TIME_WAIT management | tcp_tw_reuse = 1 |
+----+------------------------------+-------------------------------+
| 7 | FIN timeout | tcp_fin_timeout = 15 |
+----+------------------------------+-------------------------------+
| 8 | SYN cookies | tcp_syncookies = 1 |
+----+------------------------------+-------------------------------+
| 9 | Port range | ip_local_port_range 1024-65535|
+----+------------------------------+-------------------------------+
| 10 | Swap behavior | vm.swappiness = 10 (DB) |
+----+------------------------------+-------------------------------+
| 11 | Dirty pages | dirty_ratio = 15 |
| | | dirty_background_ratio = 5 |
+----+------------------------------+-------------------------------+
| 12 | I/O scheduler | NVMe: none, SSD: mq-deadline |
+----+------------------------------+-------------------------------+
| 13 | THP | Disable for DB servers |
+----+------------------------------+-------------------------------+
| 14 | Huge Pages | Configure for DB shared_buf |
+----+------------------------------+-------------------------------+
| 15 | NUMA | Bind DB to single node |
+----+------------------------------+-------------------------------+
| 16 | CPU affinity | Pin critical processes |
+----+------------------------------+-------------------------------+
| 17 | OOM score adjustment | Protect critical processes |
+----+------------------------------+-------------------------------+
| 18 | cgroups v2 | Resource isolation and limits |
+----+------------------------------+-------------------------------+
| 19 | Congestion control | BBR + fq qdisc |
+----+------------------------------+-------------------------------+
| 20 | inotify watches | max_user_watches = 524288 |
+----+------------------------------+-------------------------------+
| 21 | AIO requests | aio-max-nr = 1048576 |
+----+------------------------------+-------------------------------+
| 22 | Kernel log monitoring | Check dmesg periodically |
+----+------------------------------+-------------------------------+
16. Practice Quiz
Quiz 1: USE Method
If a server's CPU utilization is 90% but there is no performance degradation, what should you check next according to the USE method?
Answer: Saturation
High CPU utilization is not necessarily a problem. What matters is saturation. Check whether processes are waiting in the run queue.
# Check run queue length
vmstat 1 | awk '{print $1}' # r column
# Measure scheduler latency with BCC runqlat
runqlat
If there is no saturation (r is less than or equal to CPU count), the CPU is being efficiently used. Finally, check for Errors.
Quiz 2: Flame Graph Interpretation
In a Flame Graph, a function has a very wide width but its child functions fill up all the space above it. Is this function itself the bottleneck?
Answer: No
In a Flame Graph, a function's width represents the total CPU time of that function AND all its children. If child functions fill up the space, the actual CPU time is being consumed by the children.
The real bottleneck is found by looking for "plateaus" -- functions at the top of the stack that have wide width are the ones actually consuming CPU time.
Quiz 3: vm.swappiness
Does setting vm.swappiness to 0 completely disable swap?
Answer: No
vm.swappiness=0 does not completely disable swap. It instructs the kernel to minimize swap usage under memory pressure. Under extreme memory pressure, swapping can still occur.
To completely disable swap, use the swapoff -a command. However, this risks the OOM Killer terminating processes, so caution is required.
Quiz 4: THP and Databases
Why do Transparent Huge Pages (THP) negatively impact database performance (PostgreSQL, MySQL, MongoDB)?
Answer: THP allocates memory in 2MB chunks. Databases typically work with 8KB (PostgreSQL) or 16KB (MySQL) page sizes.
Issues:
- Memory fragmentation: Kernel memory compaction for THP allocation causes latency spikes
- Write amplification: Even modifying a small portion triggers Copy-on-Write of the entire 2MB
- Memory waste: Allocations are larger than what is actually used
- Unpredictable latency: Defrag process can stall processes
Instead, configure Explicit Huge Pages dedicated to shared_buffers.
Quiz 5: eBPF vs Traditional Tracing
Why is eBPF safer for production environments than strace?
Answer: strace uses the ptrace syscall to intercept every system call of the target process. This adds two context switches per syscall, potentially degrading performance by 50-100% or more.
eBPF:
- Runs inside the kernel: Minimizes user-kernel transitions
- JIT compiled: Native code-level performance
- Verifier: Guarantees programs cannot corrupt the kernel
- Selective tracing: Only traces events you need
- Minimal overhead: Typically under 5% performance impact
This makes it safe to use even in production environments.
17. References
- Systems Performance, 2nd Edition - Brendan Gregg (The Bible of performance engineering)
- BPF Performance Tools - Brendan Gregg (Complete eBPF tools guide)
- Brendan Gregg's Website - https://www.brendangregg.com/ (Extensive free resources)
- Linux Perf Wiki - https://perf.wiki.kernel.org/
- Flame Graphs - https://www.brendangregg.com/flamegraphs.html
- BCC Tools - https://github.com/iovisor/bcc
- bpftrace - https://github.com/bpftrace/bpftrace
- io_uring - https://kernel.dk/io_uring.pdf
- Linux kernel documentation: cgroups v2 - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
- NUMA documentation - https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html
- Red Hat Performance Tuning Guide - https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/
- Netflix Tech Blog - Linux Performance - https://netflixtechblog.com/
- Facebook/Meta Engineering Blog - BPF - https://engineering.fb.com/
- sysctl documentation - https://www.kernel.org/doc/html/latest/admin-guide/sysctl/
현재 단락 (1/398)
A significant portion of production incidents stem from performance issues. CPU spikes, memory leaks...