Linux Performance Engineering Complete Guide 2025: Profiling, System Tuning, eBPF, Bottleneck Analysis

1. Why Linux Performance Engineering

A significant portion of production incidents stem from performance issues. CPU spikes, memory leaks, disk I/O bottlenecks, network latency -- the ability to systematically analyze and resolve all of these is Linux performance engineering.

Core topics covered in this guide:

Performance analysis methodologies (USE, RED, TSA)
CPU analysis (perf, mpstat, pidstat, Flame Graph)
Memory analysis (vmstat, /proc/meminfo, slab, OOM Killer)
Disk I/O (iostat, blktrace, I/O schedulers, fio)
Network performance (sar, ss, iperf3, TCP tuning)
eBPF deep dive (architecture, BCC, bpftrace, libbpf CO-RE)
Flame Graphs (CPU, off-CPU, memory, I/O)
sysctl tuning and cgroups v2
NUMA, Huge Pages, process scheduling
Production tuning checklist (20+ items)

2. Performance Analysis Methodologies

2.1 USE Method (Utilization, Saturation, Errors)

A systematic performance analysis framework by Brendan Gregg.

For every resource, check 3 things:

+-------------------+--------------------------------------------+
| Resource          | U (Utilization) | S (Saturation) | E (Errors)|
+-------------------+----------------+---------------+----------+
| CPU               | mpstat         | runq latency  | perf/dmesg|
| Memory            | free           | vmstat si/so  | dmesg OOM |
| Network Interface | sar -n DEV     | ifconfig (overruns) | ifconfig (errors) |
| Storage I/O       | iostat         | iostat avgqu  | iostat (errors) |
| Storage Capacity  | df -h          | (none)        | stale mounts |
| File Descriptors  | lsof           | (none)        | "Too many |
|                   |                |               |  open files"|
+-------------------+----------------+---------------+----------+

2.2 RED Method (Rate, Errors, Duration)

A methodology well-suited for microservices.

Measure 3 things from a service perspective:

Rate:     Requests per second (requests/sec)
Errors:   Failed request ratio (error rate)
Duration: Request latency distribution (P50/P95/P99)

2.3 TSA Method (Thread State Analysis)

Thread State Classification:

On-CPU:      Running (actively using CPU)
Runnable:    Waiting to run (waiting for CPU)
Sleeping:    Waiting for I/O, timer, lock, etc.
Idle:        Nothing to do

Analysis Tools:
- perf record + Flame Graph (On-CPU analysis)
- offcputime (Off-CPU analysis)
- bpftrace (detailed thread analysis)

3. Linux Performance Tools Overview

3.1 Brendan Gregg's Linux Performance Tools Map

                        Applications
                     /      |      \
                   /        |        \
             System Libs  Runtime   Compiler
                  |         |          |
                  v         v          v
            +-----------------------------------------+
            |          System Call Interface           |
            +-----------------------------------------+
            |    VFS    | Sockets | Scheduler | VM    |
            +-----------+---------+-----------+-------+
            | File Sys  | TCP/UDP | (sched)   | (mm)  |
            +-----------+---------+-----------+-------+
            | Volume Mgr| IP      |           |       |
            +-----------+---------+-----------+-------+
            | Block Dev | Ethernet|           |       |
            +-----------+---------+-----------+-------+
            |  Device Drivers                         |
            +-----------------------------------------+

Observability Tools:
  App:    strace, ltrace, gdb
  Sched:  perf, mpstat, pidstat, runqlat
  Memory: vmstat, slabtop, free, sar
  FS:     opensnoop, ext4slower, fileslower
  Disk:   iostat, biolatency, biotop
  Net:    sar, ss, tcpdump, nicstat
  All:    eBPF (bpftrace, BCC tools)

4. CPU Analysis

4.1 perf stat (Hardware Counters)

# Measure CPU counters for a process
perf stat -p PID sleep 10

# Example output:
#  Performance counter stats for process id '12345':
#
#       10,234.56 msec task-clock           # 1.023 CPUs utilized
#           2,345      context-switches     # 229.12 /sec
#              12      cpu-migrations       # 1.17 /sec
#          45,678      page-faults          # 4.46K /sec
#  12,345,678,901      cycles               # 1.206 GHz
#   9,876,543,210      instructions         # 0.80 insn/cycle
#   1,234,567,890      branches             # 120.63M /sec
#      12,345,678      branch-misses        # 1.00% of all branches

# IPC (Instructions Per Cycle) is the key metric
# IPC < 1.0: likely memory-bound
# IPC > 1.0: compute-bound

4.2 perf record + perf report

# Record CPU profile (30 seconds)
perf record -g -p PID sleep 30

# Or system-wide
perf record -g -a sleep 30

# Analyze profile
perf report --stdio

# Example output:
# Overhead  Command  Shared Object      Symbol
#   23.45%  nginx    libc.so.6          [.] __memcpy_avx2
#   15.67%  nginx    nginx              [.] ngx_http_parse_request_line
#   12.34%  nginx    [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
#    8.90%  nginx    libssl.so.3        [.] EVP_EncryptUpdate

4.3 Flame Graph Generation

# 1. Collect perf data
perf record -F 99 -g -a sleep 30

# 2. Generate Flame Graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu.svg

# Or as a one-liner:
perf record -F 99 -g -a -- sleep 30 && \
perf script | \
  stackcollapse-perf.pl | \
  flamegraph.pl > flamegraph.svg

How to Read Flame Graphs:

+-----------------------------------------------------------+
|                    main()                                  |
+-------------------+---------------------------------------+
|  process_request()|            handle_connection()         |
+--------+----------+------------------+--------------------+
| parse()|  route() |  read_file()     |  send_response()   |
+--------+----+-----+------+-----------+----------+---------+
         |sort|      |read()|          |write()   |encrypt()|
         +----+      +------+          +----------+---------+

- X-axis: CPU time proportion (left-to-right order is meaningless)
- Y-axis: Call stack depth (bottom-to-top = call direction)
- Width: CPU time consumed by that function
- Find wide "plateaus" -- those are optimization targets!

4.4 mpstat, pidstat

# Per-CPU utilization
mpstat -P ALL 1

# Example output:
# CPU  %usr  %nice  %sys  %iowait  %irq  %soft  %steal  %idle
# all  25.3   0.0   5.2     2.1    0.0    0.5     0.0   66.9
#   0  45.6   0.0   8.3     0.0    0.0    1.2     0.0   44.9  <- hot CPU
#   1  12.3   0.0   3.1     4.2    0.0    0.1     0.0   80.3
#   2  35.7   0.0   6.8     0.0    0.0    0.8     0.0   56.7
#   3   7.5   0.0   2.5     4.1    0.0    0.0     0.0   85.9

# Per-process CPU utilization
pidstat -p ALL 1

# Per-thread CPU utilization
pidstat -t -p PID 1

5. Memory Analysis

5.1 vmstat

# Memory/CPU statistics at 1-second intervals
vmstat 1

# Interpreting the output:
# procs  --------memory--------  ---swap--  -----io----  -system-- ------cpu-----
#  r  b  swpd    free   buff  cache  si   so    bi    bo   in    cs  us sy id wa
#  2  0     0  524288  65536 2097152  0    0     4    12  156   312  15  5 78  2
#  5  1     0  491520  65536 2097152  0    0     0   256  892  1543  45 12 38  5

# Key metrics:
# r: Run queue waiting processes (r > CPU count = saturated)
# b: Uninterruptible sleep (usually I/O wait)
# si/so: Swap in/out (non-zero = memory pressure)
# wa: I/O wait (high = disk bottleneck)

5.2 /proc/meminfo Details

cat /proc/meminfo

# Key entries:
# MemTotal:       16384000 kB   Total physical memory
# MemFree:         1024000 kB   Completely unused memory
# MemAvailable:    8192000 kB   Actually available (including reclaimable cache)
# Buffers:          524288 kB   Block device I/O buffers
# Cached:          6553600 kB   Page cache
# SwapCached:            0 kB   Cache read back from swap
# Active:          4096000 kB   Recently accessed memory
# Inactive:        3072000 kB   Old memory (reclaim candidate)
# Slab:             512000 kB   Kernel data structure cache
# SReclaimable:     384000 kB   Reclaimable slab
# SUnreclaim:       128000 kB   Non-reclaimable slab

5.3 Page Cache and OOM Killer

# Page cache status
free -h
#               total   used   free   shared  buff/cache  available
# Mem:           16G    4.2G   1.0G    256M      10.8G      11.2G
# Swap:           4G      0B    4G

# Clear cache (use with caution in production)
# echo 1 > /proc/sys/vm/drop_caches  # Page cache only
# echo 2 > /proc/sys/vm/drop_caches  # dentries + inodes
# echo 3 > /proc/sys/vm/drop_caches  # All

# Check OOM Killer logs
dmesg | grep -i "oom\|out of memory\|killed process"

# Check per-process OOM score
cat /proc/PID/oom_score

# Adjust OOM score (-1000 to 1000)
echo -500 > /proc/PID/oom_score_adj  # Protect from OOM
echo 1000 > /proc/PID/oom_score_adj  # OOM kill priority target

5.4 slabtop

# Monitor kernel slab caches
slabtop -s c  # Sort by cache size

# Example output:
#  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
# 65536  62000  94%    0.19K   3277       20     13108K dentry
# 32768  30000  91%    0.50K   4096        8     16384K inode_cache
# 16384  15000  91%    1.00K   4096        4     16384K ext4_inode_cache

6. Disk I/O Analysis

6.1 iostat

# Disk I/O statistics (extended mode, 1-second interval)
iostat -xz 1

# Interpreting the output:
# Device  r/s    w/s    rkB/s  wkB/s  rrqm/s  wrqm/s  await r_await w_await  svctm  %util
# sda    150.0  200.0  6000   8000    10.0    50.0    2.50   1.80    3.00    0.85   29.8
# nvme0  500.0  800.0 50000  80000     0.0     0.0    0.25   0.20    0.28    0.08   10.4

# Key metrics:
# await:  Average I/O wait time (ms) - high = disk bottleneck
# %util:  Device utilization - near 100% = saturated
# r_await, w_await: Separate read/write wait times
# rrqm/wrqm: Merged requests (high = sequential I/O)

6.2 I/O Schedulers

# Check current I/O scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none

# Change scheduler
echo "bfq" > /sys/block/sda/queue/scheduler

I/O Scheduler Comparison:

+-------------+------------------+----------------------------------+
| Scheduler   | Best For         | Characteristics                  |
+-------------+------------------+----------------------------------+
| none        | NVMe SSD         | No scheduling (delegates to HW)  |
| mq-deadline | SSD/HDD general  | Request deadline guarantee,      |
|             |                  | good for databases               |
| bfq         | Desktop          | I/O fairness, interactive work   |
| kyber       | High-perf SSD    | Low latency, read/write queues   |
+-------------+------------------+----------------------------------+

6.3 fio Benchmarking

# Sequential read benchmark
fio --name=seqread --rw=read --bs=1M --size=4G \
    --numjobs=1 --runtime=60 --direct=1

# Random read (IOPS measurement)
fio --name=randread --rw=randread --bs=4k --size=4G \
    --numjobs=8 --runtime=60 --direct=1 --iodepth=32

# Mixed workload (DB simulation)
fio --name=mixed --rw=randrw --rwmixread=70 \
    --bs=8k --size=4G --numjobs=4 --runtime=60 \
    --direct=1 --iodepth=16

# Interpreting results:
# read:  IOPS=125000, BW=488MiB/s, lat avg=0.25ms, p99=0.50ms
# write: IOPS=53571, BW=209MiB/s, lat avg=0.45ms, p99=1.20ms

7. Network Performance Analysis

7.1 sar

# Network interface statistics
sar -n DEV 1

# TCP statistics
sar -n TCP 1

# Example output:
# IFACE   rxpck/s  txpck/s  rxkB/s  txkB/s  rxcmp/s txcmp/s rxmcst/s %ifutil
# eth0    15000    12000    8500    6200      0.0     0.0     5.0     6.8

# TCP error statistics
sar -n ETCP 1

7.2 ss and TCP Tuning

# TCP connection state summary
ss -s

# Check receive/send queues (bottleneck diagnosis)
ss -tnp | awk '{print $2, $3, $5}'

# Congestion control and RTT info
ss -ti

# TCP memory usage
ss -tm

7.3 iperf3 Benchmarking

# Server side
iperf3 -s

# Client side (TCP bandwidth measurement)
iperf3 -c SERVER_IP -t 30 -P 4

# UDP bandwidth measurement
iperf3 -c SERVER_IP -u -b 10G -t 30

# Bidirectional test
iperf3 -c SERVER_IP -t 30 --bidir

# MTU optimization (Jumbo Frames)
# Standard: 1500 bytes
# Jumbo: 9000 bytes (within data centers)
ip link set eth0 mtu 9000

8. eBPF Deep Dive

8.1 eBPF Architecture

User Space                    Kernel Space
+--------------------+        +---------------------------+
|                    |        |                           |
| BCC / bpftrace /   | load   |      eBPF Virtual Machine |
| libbpf programs    |------->|      (JIT compiled)       |
|                    |        |                           |
| Maps (data sharing)|<------>|  Hooks:                   |
|                    |  read/ |  - kprobes (func entry)   |
| Output             |  write |  - tracepoints (static)   |
| (stdout, perf      |        |  - XDP (network packets)  |
|  buffer, ringbuf)  |        |  - LSM (security module)  |
+--------------------+        |  - cgroup (resource ctrl)  |
                              +---------------------------+

eBPF Verifier:
- Prevents infinite loops
- Blocks out-of-bounds memory access
- Guarantees kernel stability

8.2 BCC Tools

# === CPU Related ===
# Trace new process execution
execsnoop

# Run queue latency histogram
runqlat       # CPU scheduler delay analysis

# === Memory Related ===
# Trace page faults
drsnoop       # Direct reclaim tracing

# Memory allocation tracing
memleak       # Memory leak detection

# === Disk I/O ===
# Block I/O latency histogram
biolatency    # I/O wait time distribution

# Top block I/O processes
biotop        # Real-time I/O heavy processes

# === File System ===
# Slow filesystem operations
ext4slower 1  # ext4 operations taking more than 1ms

# File open tracing
opensnoop     # Which process opens which files

# === Network ===
# TCP connection tracing
tcpconnect    # New TCP connections
tcpaccept     # Accepted TCP connections
tcpretrans    # TCP retransmissions

8.3 bpftrace One-Liners

# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# read() return size histogram
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
  @bytes = hist(args->ret);
}'

# Block I/O size by process
bpftrace -e 'tracepoint:block:block_rq_issue {
  @[comm] = hist(args->bytes);
}'

# CPU scheduler switch tracing
bpftrace -e 'tracepoint:sched:sched_switch {
  @[args->next_comm] = count();
}'

# TCP retransmission tracing
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
  @[comm, args->daddr, args->dport] = count();
}'

# VFS read latency
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
  @ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

8.4 libbpf CO-RE (Compile Once - Run Everywhere)

// Simple CO-RE eBPF program structure
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

struct event {
    u32 pid;
    u64 duration_ns;
    char comm[16];
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("kprobe/do_sys_openat2")
int BPF_KPROBE(trace_openat2, int dfd, const char *filename)
{
    struct event *e;
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    
    e->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

9. Flame Graphs Deep Dive

9.1 CPU Flame Graph

# Collect CPU profile with perf
perf record -F 99 -g -a sleep 30

# Generate Flame Graph
perf script | \
  stackcollapse-perf.pl | \
  flamegraph.pl --title "CPU Flame Graph" > cpu_flame.svg

9.2 Off-CPU Flame Graph

Analyzes time when a process is NOT running on CPU (I/O, locks, sleeps, etc.).

# Using BCC offcputime
offcputime -df -p PID 30 | \
  flamegraph.pl --color=io --title "Off-CPU" > offcpu.svg

# bpftrace off-CPU analysis
bpftrace -e '
tracepoint:sched:sched_switch {
  @start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
  @us[args->next_comm, ustack] =
    hist((nsecs - @start[args->next_pid]) / 1000);
  delete(@start[args->next_pid]);
}'

9.3 Memory Flame Graph

# Trace memory allocations
perf record -e kmem:kmalloc -g -a sleep 10
perf script | stackcollapse-perf.pl | \
  flamegraph.pl --color=mem --title "Memory Allocations" > mem_flame.svg

# Or using BCC memleak
memleak -p PID -a 30 | \
  flamegraph.pl --title "Memory Leak" > memleak_flame.svg

10. sysctl Tuning

10.1 VM (Virtual Memory) Tuning

# === Swap Behavior ===
# vm.swappiness: Swap tendency (0-100)
# 0: Minimize swap (OOM risk)
# 10: Recommended for DB servers
# 60: Default
# 100: Aggressive swapping
sysctl -w vm.swappiness=10

# === Dirty Pages ===
# Dirty page ratio (write-back trigger threshold)
sysctl -w vm.dirty_ratio=15

# Background flush start threshold
sysctl -w vm.dirty_background_ratio=5

# Dirty page expiry (centiseconds)
sysctl -w vm.dirty_expire_centisecs=3000

# Dirty page writeback interval
sysctl -w vm.dirty_writeback_centisecs=500

# === Overcommit ===
# 0: Heuristic (default)
# 1: Always allow
# 2: Allow up to physical mem + swap * ratio
sysctl -w vm.overcommit_memory=0
sysctl -w vm.overcommit_ratio=50

# === Minimum Free Memory ===
sysctl -w vm.min_free_kbytes=65536

10.2 Network Tuning

# === TCP Connections ===
# Max connection backlog
sysctl -w net.core.somaxconn=65535

# SYN backlog
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# TCP connection reuse
sysctl -w net.ipv4.tcp_tw_reuse=1

# === TCP Buffers ===
# Receive buffer (min, default, max)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"

# Send buffer
sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"

# Global network buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# === TCP Keepalive ===
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=60
sysctl -w net.ipv4.tcp_keepalive_probes=3

# === Congestion Control ===
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

# === Miscellaneous ===
# FIN-WAIT-2 timeout
sysctl -w net.ipv4.tcp_fin_timeout=15

# Port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# SYN cookies (SYN Flood defense)
sysctl -w net.ipv4.tcp_syncookies=1

10.3 Filesystem Tuning

# System-wide file descriptor limit
sysctl -w fs.file-max=2097152

# inotify watch limit (file monitoring)
sysctl -w fs.inotify.max_user_watches=524288

# AIO max request count
sysctl -w fs.aio-max-nr=1048576

10.4 Kernel Scheduler Tuning

# CFS scheduler minimum runtime (ns)
sysctl -w kernel.sched_min_granularity_ns=1000000

# Scheduling latency (ns)
sysctl -w kernel.sched_latency_ns=6000000

# Migration cost (ns)
sysctl -w kernel.sched_migration_cost_ns=500000

# Automatic group scheduling
sysctl -w kernel.sched_autogroup_enabled=1

11. cgroups v2

11.1 cgroups v2 Basics

# Check cgroups v2 mount
mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2

# Create a cgroup
mkdir /sys/fs/cgroup/myapp

# Assign a process
echo PID > /sys/fs/cgroup/myapp/cgroup.procs

# Check available controllers
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

11.2 CPU Limits

# CPU max usage limit
# Format: QUOTA PERIOD (microseconds)
# 50ms out of 100ms = 50% CPU
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max

# CPU weight (1-10000, default 100)
echo "200" > /sys/fs/cgroup/myapp/cpu.weight

11.3 Memory Limits

# Memory hard limit
echo "1G" > /sys/fs/cgroup/myapp/memory.max

# Memory soft limit (reclaim priority target)
echo "512M" > /sys/fs/cgroup/myapp/memory.high

# Memory minimum guarantee
echo "256M" > /sys/fs/cgroup/myapp/memory.min

# Swap limit
echo "0" > /sys/fs/cgroup/myapp/memory.swap.max

# Current memory usage
cat /sys/fs/cgroup/myapp/memory.current

# Memory statistics
cat /sys/fs/cgroup/myapp/memory.stat

11.4 I/O Limits

# Per-device I/O bandwidth limit
# Format: MAJOR:MINOR TYPE=LIMIT
# Limit sda reads to 50MB/s
echo "8:0 rbps=52428800" > /sys/fs/cgroup/myapp/io.max

# Limit sda writes to 20MB/s
echo "8:0 wbps=20971520" > /sys/fs/cgroup/myapp/io.max

# IOPS limits
echo "8:0 riops=1000 wiops=500" > /sys/fs/cgroup/myapp/io.max

# I/O weight
echo "default 100" > /sys/fs/cgroup/myapp/io.weight

11.5 Docker/K8s Integration

# Docker cgroups v2 resource limits
# docker run --cpus=2 --memory=4g --memory-swap=4g myapp

# Kubernetes Pod resources (maps to cgroups v2)
# resources:
#   requests:
#     cpu: "500m"       -> cpu.weight
#     memory: "512Mi"   -> memory.min
#   limits:
#     cpu: "2"          -> cpu.max
#     memory: "4Gi"     -> memory.max

12. NUMA (Non-Uniform Memory Access)

12.1 NUMA Topology

# Check NUMA topology
numactl --hardware

# Example output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 0 free: 16384 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node 1 free: 15360 MB
# node distances:
# node   0   1
#   0:  10  21    <- Same node: 10, different node: 21 (2.1x slower)
#   1:  21  10

# NUMA memory statistics
numastat -m

# Per-process NUMA memory stats
numastat -p PID

12.2 NUMA Memory Binding

# Run on a specific NUMA node
numactl --cpunodebind=0 --membind=0 ./myapp

# Bind both CPU and memory to node 0
numactl -N 0 -m 0 ./database_process

# Interleave mode (distribute evenly across nodes)
numactl --interleave=all ./myapp

# Check NUMA policy of existing process
cat /proc/PID/numa_maps

13. Huge Pages

13.1 Transparent Huge Pages (THP)

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# Disable THP for databases (memory fragmentation + latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

13.2 Explicit Huge Pages

# Allocate Huge Pages (2MB pages)
sysctl -w vm.nr_hugepages=1024  # 1024 * 2MB = 2GB

# Check current status
cat /proc/meminfo | grep Huge
# HugePages_Total:    1024
# HugePages_Free:      512
# HugePages_Rsvd:      256
# HugePages_Surp:        0
# Hugepagesize:       2048 kB

# 1GB Huge Pages (kernel boot parameter)
# GRUB: hugepagesz=1G hugepages=16

13.3 Database Usage

# PostgreSQL: Use Huge Pages for shared_buffers
# postgresql.conf:
# huge_pages = try
# shared_buffers = 8GB

# Calculate Huge Pages needed:
# shared_buffers / Hugepagesize = required pages
# 8GB / 2MB = 4096 pages
sysctl -w vm.nr_hugepages=4096

14. Process Scheduling

14.1 CFS (Completely Fair Scheduler)

# CFS scheduler statistics
cat /proc/sched_debug | head -50

# Adjust nice value (-20 to 19, lower = higher priority)
nice -n -10 ./critical_process
renice -n -10 -p PID

# Check process scheduling policy
chrt -p PID

14.2 Real-Time Scheduling

# Set SCHED_FIFO (real-time, priority 1-99)
chrt -f 50 ./realtime_app

# SCHED_RR (round-robin real-time)
chrt -r 50 ./realtime_app

# SCHED_DEADLINE (deadline-based)
chrt -d --sched-runtime 5000000 --sched-deadline 10000000 \
    --sched-period 10000000 0 ./deadline_app

14.3 CPU Affinity

# Bind process to specific CPUs
taskset -c 0,1 ./myapp          # Run only on CPU 0, 1
taskset -c 0-3 ./myapp          # Run on CPU 0-3
taskset -pc 0,1 PID             # Apply to existing process

# IRQ affinity (interrupt distribution)
echo 2 > /proc/irq/IRQ_NUM/smp_affinity  # Assign to CPU 1

# isolcpus (boot parameter)
# GRUB: isolcpus=4,5,6,7
# Exclude CPU 4-7 from general scheduling for dedicated process use

15. Production Tuning Checklist

+----+------------------------------+-------------------------------+
| #  | Item                         | Recommended Setting           |
+----+------------------------------+-------------------------------+
| 1  | File descriptor limit        | ulimit -n 1048576             |
|    |                              | fs.file-max = 2097152         |
+----+------------------------------+-------------------------------+
| 2  | TCP backlog                  | net.core.somaxconn = 65535    |
+----+------------------------------+-------------------------------+
| 3  | TCP buffers                  | Optimize tcp_rmem/wmem        |
+----+------------------------------+-------------------------------+
| 4  | TCP congestion control       | BBR or env-appropriate algo   |
+----+------------------------------+-------------------------------+
| 5  | TCP Keepalive                | 600/60/3 (adjust per app)     |
+----+------------------------------+-------------------------------+
| 6  | TIME_WAIT management         | tcp_tw_reuse = 1              |
+----+------------------------------+-------------------------------+
| 7  | FIN timeout                  | tcp_fin_timeout = 15          |
+----+------------------------------+-------------------------------+
| 8  | SYN cookies                  | tcp_syncookies = 1            |
+----+------------------------------+-------------------------------+
| 9  | Port range                   | ip_local_port_range 1024-65535|
+----+------------------------------+-------------------------------+
| 10 | Swap behavior                | vm.swappiness = 10 (DB)       |
+----+------------------------------+-------------------------------+
| 11 | Dirty pages                  | dirty_ratio = 15              |
|    |                              | dirty_background_ratio = 5    |
+----+------------------------------+-------------------------------+
| 12 | I/O scheduler                | NVMe: none, SSD: mq-deadline  |
+----+------------------------------+-------------------------------+
| 13 | THP                          | Disable for DB servers        |
+----+------------------------------+-------------------------------+
| 14 | Huge Pages                   | Configure for DB shared_buf   |
+----+------------------------------+-------------------------------+
| 15 | NUMA                         | Bind DB to single node        |
+----+------------------------------+-------------------------------+
| 16 | CPU affinity                 | Pin critical processes         |
+----+------------------------------+-------------------------------+
| 17 | OOM score adjustment         | Protect critical processes    |
+----+------------------------------+-------------------------------+
| 18 | cgroups v2                   | Resource isolation and limits |
+----+------------------------------+-------------------------------+
| 19 | Congestion control           | BBR + fq qdisc                |
+----+------------------------------+-------------------------------+
| 20 | inotify watches              | max_user_watches = 524288     |
+----+------------------------------+-------------------------------+
| 21 | AIO requests                 | aio-max-nr = 1048576          |
+----+------------------------------+-------------------------------+
| 22 | Kernel log monitoring        | Check dmesg periodically      |
+----+------------------------------+-------------------------------+

16. Practice Quiz

Quiz 1: USE Method

If a server's CPU utilization is 90% but there is no performance degradation, what should you check next according to the USE method?

Answer: Saturation

High CPU utilization is not necessarily a problem. What matters is saturation. Check whether processes are waiting in the run queue.

# Check run queue length
vmstat 1 | awk '{print $1}'  # r column

# Measure scheduler latency with BCC runqlat
runqlat

If there is no saturation (r is less than or equal to CPU count), the CPU is being efficiently used. Finally, check for Errors.

Quiz 2: Flame Graph Interpretation

In a Flame Graph, a function has a very wide width but its child functions fill up all the space above it. Is this function itself the bottleneck?

Answer: No

In a Flame Graph, a function's width represents the total CPU time of that function AND all its children. If child functions fill up the space, the actual CPU time is being consumed by the children.

The real bottleneck is found by looking for "plateaus" -- functions at the top of the stack that have wide width are the ones actually consuming CPU time.

Quiz 3: vm.swappiness

Does setting vm.swappiness to 0 completely disable swap?

Answer: No

vm.swappiness=0 does not completely disable swap. It instructs the kernel to minimize swap usage under memory pressure. Under extreme memory pressure, swapping can still occur.

To completely disable swap, use the swapoff -a command. However, this risks the OOM Killer terminating processes, so caution is required.

Quiz 4: THP and Databases

Why do Transparent Huge Pages (THP) negatively impact database performance (PostgreSQL, MySQL, MongoDB)?

Answer: THP allocates memory in 2MB chunks. Databases typically work with 8KB (PostgreSQL) or 16KB (MySQL) page sizes.

Issues:

Memory fragmentation: Kernel memory compaction for THP allocation causes latency spikes
Write amplification: Even modifying a small portion triggers Copy-on-Write of the entire 2MB
Memory waste: Allocations are larger than what is actually used
Unpredictable latency: Defrag process can stall processes

Instead, configure Explicit Huge Pages dedicated to shared_buffers.

Quiz 5: eBPF vs Traditional Tracing

Why is eBPF safer for production environments than strace?

Answer: strace uses the ptrace syscall to intercept every system call of the target process. This adds two context switches per syscall, potentially degrading performance by 50-100% or more.

eBPF:

Runs inside the kernel: Minimizes user-kernel transitions
JIT compiled: Native code-level performance
Verifier: Guarantees programs cannot corrupt the kernel
Selective tracing: Only traces events you need
Minimal overhead: Typically under 5% performance impact

This makes it safe to use even in production environments.

17. References

Systems Performance, 2nd Edition - Brendan Gregg (The Bible of performance engineering)
BPF Performance Tools - Brendan Gregg (Complete eBPF tools guide)
Brendan Gregg's Website - https://www.brendangregg.com/ (Extensive free resources)
Linux Perf Wiki - https://perf.wiki.kernel.org/
Flame Graphs - https://www.brendangregg.com/flamegraphs.html
BCC Tools - https://github.com/iovisor/bcc
bpftrace - https://github.com/bpftrace/bpftrace
io_uring - https://kernel.dk/io_uring.pdf
Linux kernel documentation: cgroups v2 - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
NUMA documentation - https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html
Red Hat Performance Tuning Guide - https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/
Netflix Tech Blog - Linux Performance - https://netflixtechblog.com/
Facebook/Meta Engineering Blog - BPF - https://engineering.fb.com/
sysctl documentation - https://www.kernel.org/doc/html/latest/admin-guide/sysctl/

Table of Contents

1. Why Linux Performance Engineering

2. Performance Analysis Methodologies

2.1 USE Method (Utilization, Saturation, Errors)

2.2 RED Method (Rate, Errors, Duration)

2.3 TSA Method (Thread State Analysis)

3. Linux Performance Tools Overview

3.1 Brendan Gregg's Linux Performance Tools Map

4. CPU Analysis

4.1 perf stat (Hardware Counters)

4.2 perf record + perf report

4.3 Flame Graph Generation

4.4 mpstat, pidstat

5. Memory Analysis

5.1 vmstat

5.2 /proc/meminfo Details

5.3 Page Cache and OOM Killer

5.4 slabtop

6. Disk I/O Analysis

6.1 iostat

6.2 I/O Schedulers

6.3 fio Benchmarking

7. Network Performance Analysis

7.1 sar

7.2 ss and TCP Tuning

7.3 iperf3 Benchmarking

8. eBPF Deep Dive

8.1 eBPF Architecture

8.2 BCC Tools

8.3 bpftrace One-Liners

8.4 libbpf CO-RE (Compile Once - Run Everywhere)

9. Flame Graphs Deep Dive

9.1 CPU Flame Graph

9.2 Off-CPU Flame Graph

9.3 Memory Flame Graph

10. sysctl Tuning

10.1 VM (Virtual Memory) Tuning

10.2 Network Tuning

10.3 Filesystem Tuning

10.4 Kernel Scheduler Tuning

11. cgroups v2

11.1 cgroups v2 Basics

11.2 CPU Limits

11.3 Memory Limits

11.4 I/O Limits

11.5 Docker/K8s Integration

12. NUMA (Non-Uniform Memory Access)

12.1 NUMA Topology

12.2 NUMA Memory Binding

13. Huge Pages

13.1 Transparent Huge Pages (THP)

13.2 Explicit Huge Pages

13.3 Database Usage

14. Process Scheduling

14.1 CFS (Completely Fair Scheduler)

14.2 Real-Time Scheduling

14.3 CPU Affinity

15. Production Tuning Checklist

16. Practice Quiz

Quiz 1: USE Method

Quiz 2: Flame Graph Interpretation

Quiz 3: vm.swappiness

Quiz 4: THP and Databases

Quiz 5: eBPF vs Traditional Tracing

17. References