Split View: Linux 성능 엔지니어링 완전 가이드 2025: 프로파일링, 시스템 튜닝, eBPF, 병목 분석
Linux 성능 엔지니어링 완전 가이드 2025: 프로파일링, 시스템 튜닝, eBPF, 병목 분석
목차
1. 왜 Linux 성능 엔지니어링인가
프로덕션 환경에서 발생하는 장애의 상당수는 성능 문제에서 비롯됩니다. CPU 사용률 급등, 메모리 누수, 디스크 I/O 병목, 네트워크 지연 -- 이 모든 것을 체계적으로 분석하고 해결할 수 있는 역량이 Linux 성능 엔지니어링입니다.
이 가이드에서 다루는 핵심 주제:
- 성능 분석 방법론 (USE, RED, TSA)
- CPU 분석 (perf, mpstat, pidstat, Flame Graph)
- 메모리 분석 (vmstat, /proc/meminfo, slab, OOM Killer)
- 디스크 I/O (iostat, blktrace, I/O 스케줄러, fio)
- 네트워크 성능 (sar, ss, iperf3, TCP 튜닝)
- eBPF 심화 (아키텍처, BCC, bpftrace, libbpf CO-RE)
- Flame Graph (CPU, off-CPU, 메모리, I/O)
- sysctl 튜닝과 cgroups v2
- NUMA, Huge Pages, 프로세스 스케줄링
- 프로덕션 튜닝 체크리스트 (20+ 항목)
2. 성능 분석 방법론
2.1 USE 방법론 (Utilization, Saturation, Errors)
Brendan Gregg가 제안한 체계적 성능 분석 프레임워크입니다.
모든 리소스에 대해 3가지를 확인:
+-------------------+--------------------------------------------+
| 리소스 | U (이용률) | S (포화도) | E (에러) |
+-------------------+-------------+--------------+-------------+
| CPU | mpstat | runq latency | perf/dmesg |
| 메모리 | free | vmstat si/so | dmesg OOM |
| 네트워크 인터페이스 | sar -n DEV | ifconfig (오버런) | ifconfig (에러) |
| 스토리지 I/O | iostat | iostat avgqu | iostat (에러) |
| 스토리지 용량 | df -h | (없음) | stale mounts |
| 파일 디스크립터 | lsof | (없음) | "Too many |
| | | | open files" |
+-------------------+-------------+--------------+-------------+
2.2 RED 방법론 (Rate, Errors, Duration)
마이크로서비스에 적합한 방법론입니다.
서비스 관점에서 3가지를 측정:
Rate: 초당 요청 수 (requests/sec)
Errors: 실패한 요청 비율 (error rate)
Duration: 요청 처리 시간 분포 (latency P50/P95/P99)
2.3 TSA 방법론 (Thread State Analysis)
스레드 상태 분류:
On-CPU: 실행 중 (CPU를 사용하고 있음)
Runnable: 실행 대기 (CPU를 기다림)
Sleeping: I/O 대기, 타이머, 락 대기 등
Idle: 할 일 없음
분석 도구:
- perf record + Flame Graph (On-CPU 분석)
- offcputime (Off-CPU 분석)
- bpftrace (상세 스레드 분석)
3. Linux 성능 도구 개요
3.1 Brendan Gregg의 Linux 성능 도구 맵
Applications
/ | \
/ | \
System Libs Runtime Compiler
| | |
v v v
+-----------------------------------------+
| System Call Interface |
+-----------------------------------------+
| VFS | Sockets | Scheduler | VM |
+-----------+---------+-----------+-------+
| File Sys | TCP/UDP | (sched) | (mm) |
+-----------+---------+-----------+-------+
| Volume Mgr| IP | | |
+-----------+---------+-----------+-------+
| Block Dev | Ethernet| | |
+-----------+---------+-----------+-------+
| Device Drivers |
+-----------------------------------------+
관측 도구:
App: strace, ltrace, gdb
Sched: perf, mpstat, pidstat, runqlat
Memory: vmstat, slabtop, free, sar
FS: opensnoop, ext4slower, fileslower
Disk: iostat, biolatency, biotop
Net: sar, ss, tcpdump, nicstat
All: eBPF (bpftrace, BCC tools)
4. CPU 분석
4.1 perf stat (하드웨어 카운터)
# 프로세스의 CPU 카운터 측정
perf stat -p PID sleep 10
# 출력 예:
# Performance counter stats for process id '12345':
#
# 10,234.56 msec task-clock # 1.023 CPUs utilized
# 2,345 context-switches # 229.12 /sec
# 12 cpu-migrations # 1.17 /sec
# 45,678 page-faults # 4.46K /sec
# 12,345,678,901 cycles # 1.206 GHz
# 9,876,543,210 instructions # 0.80 insn/cycle
# 1,234,567,890 branches # 120.63M /sec
# 12,345,678 branch-misses # 1.00% of all branches
# IPC (Instructions Per Cycle)가 핵심 지표
# IPC < 1.0: 메모리 바운드 가능성
# IPC > 1.0: 컴퓨트 바운드
4.2 perf record + perf report
# CPU 프로파일 기록 (30초)
perf record -g -p PID sleep 30
# 또는 시스템 전체
perf record -g -a sleep 30
# 프로파일 분석
perf report --stdio
# 출력 예:
# Overhead Command Shared Object Symbol
# 23.45% nginx libc.so.6 [.] __memcpy_avx2
# 15.67% nginx nginx [.] ngx_http_parse_request_line
# 12.34% nginx [kernel.vmlinux] [k] copy_user_enhanced_fast_string
# 8.90% nginx libssl.so.3 [.] EVP_EncryptUpdate
4.3 Flame Graph 생성
# 1. perf 데이터 수집
perf record -F 99 -g -a sleep 30
# 2. Flame Graph 생성
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu.svg
# 또는 한 줄로:
perf record -F 99 -g -a -- sleep 30 && \
perf script | \
stackcollapse-perf.pl | \
flamegraph.pl > flamegraph.svg
Flame Graph 읽는 법:
+-----------------------------------------------------------+
| main() |
+-------------------+---------------------------------------+
| process_request()| handle_connection() |
+--------+----------+------------------+--------------------+
| parse()| route() | read_file() | send_response() |
+--------+----+-----+------+-----------+----------+---------+
|sort| |read()| |write() |encrypt()|
+----+ +------+ +----------+---------+
- X축: CPU 시간 비율 (왼쪽->오른쪽 순서 무의미)
- Y축: 호출 스택 깊이 (아래->위로 호출 방향)
- 너비: 해당 함수가 차지하는 CPU 시간
- 넓은 "plateau"를 찾으면 최적화 대상!
4.4 mpstat, pidstat
# CPU별 사용률
mpstat -P ALL 1
# 출력 예:
# CPU %usr %nice %sys %iowait %irq %soft %steal %idle
# all 25.3 0.0 5.2 2.1 0.0 0.5 0.0 66.9
# 0 45.6 0.0 8.3 0.0 0.0 1.2 0.0 44.9 <- 핫 CPU
# 1 12.3 0.0 3.1 4.2 0.0 0.1 0.0 80.3
# 2 35.7 0.0 6.8 0.0 0.0 0.8 0.0 56.7
# 3 7.5 0.0 2.5 4.1 0.0 0.0 0.0 85.9
# 프로세스별 CPU 사용률
pidstat -p ALL 1
# 스레드별 CPU 사용률
pidstat -t -p PID 1
5. 메모리 분석
5.1 vmstat
# 1초 간격으로 메모리/CPU 통계
vmstat 1
# 출력 해석:
# procs --------memory-------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa
# 2 0 0 524288 65536 2097152 0 0 4 12 156 312 15 5 78 2
# 5 1 0 491520 65536 2097152 0 0 0 256 892 1543 45 12 38 5
# 핵심 지표:
# r: 실행 큐 대기 프로세스 (r > CPU 수이면 포화)
# b: 인터럽트 불가능 슬립 (보통 I/O 대기)
# si/so: 스왑 인/아웃 (0이 아니면 메모리 부족)
# wa: I/O 대기 (높으면 디스크 병목)
5.2 /proc/meminfo 상세
cat /proc/meminfo
# 핵심 항목:
# MemTotal: 16384000 kB 전체 물리 메모리
# MemFree: 1024000 kB 완전히 미사용 메모리
# MemAvailable: 8192000 kB 실제 사용 가능 메모리 (캐시 회수 포함)
# Buffers: 524288 kB 블록 디바이스 I/O 버퍼
# Cached: 6553600 kB 페이지 캐시
# SwapCached: 0 kB 스왑에서 다시 읽어온 캐시
# Active: 4096000 kB 최근 접근된 메모리
# Inactive: 3072000 kB 오래된 메모리 (회수 대상)
# Slab: 512000 kB 커널 자료구조 캐시
# SReclaimable: 384000 kB 회수 가능한 슬랩
# SUnreclaim: 128000 kB 회수 불가능한 슬랩
5.3 Page Cache와 OOM Killer
# 페이지 캐시 상태
free -h
# total used free shared buff/cache available
# Mem: 16G 4.2G 1.0G 256M 10.8G 11.2G
# Swap: 4G 0B 4G
# 캐시 해제 (프로덕션에서 주의)
# echo 1 > /proc/sys/vm/drop_caches # 페이지 캐시만
# echo 2 > /proc/sys/vm/drop_caches # dentries + inodes
# echo 3 > /proc/sys/vm/drop_caches # 전부
# OOM Killer 로그 확인
dmesg | grep -i "oom\|out of memory\|killed process"
# 프로세스별 OOM 점수 확인
cat /proc/PID/oom_score
# OOM 점수 조정 (-1000 ~ 1000)
echo -500 > /proc/PID/oom_score_adj # OOM에서 보호
echo 1000 > /proc/PID/oom_score_adj # OOM 우선 대상
5.4 slabtop
# 커널 슬랩 캐시 모니터링
slabtop -s c # 캐시 크기순 정렬
# 출력 예:
# OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
# 65536 62000 94% 0.19K 3277 20 13108K dentry
# 32768 30000 91% 0.50K 4096 8 16384K inode_cache
# 16384 15000 91% 1.00K 4096 4 16384K ext4_inode_cache
6. 디스크 I/O 분석
6.1 iostat
# 디스크 I/O 통계 (확장 모드, 1초 간격)
iostat -xz 1
# 출력 해석:
# Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s await r_await w_await svctm %util
# sda 150.0 200.0 6000 8000 10.0 50.0 2.50 1.80 3.00 0.85 29.8
# nvme0 500.0 800.0 50000 80000 0.0 0.0 0.25 0.20 0.28 0.08 10.4
# 핵심 지표:
# await: 평균 I/O 대기 시간 (ms) - 높으면 디스크 병목
# %util: 디바이스 이용률 - 100%에 가까우면 포화
# r_await, w_await: 읽기/쓰기 별도 대기 시간
# rrqm/wrqm: 병합된 요청 수 (높으면 순차 I/O)
6.2 I/O 스케줄러
# 현재 I/O 스케줄러 확인
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# 스케줄러 변경
echo "bfq" > /sys/block/sda/queue/scheduler
I/O 스케줄러 비교:
+-------------+----------------+-----------------------------------+
| 스케줄러 | 적합한 환경 | 특징 |
+-------------+----------------+-----------------------------------+
| none | NVMe SSD | 스케줄링 없음 (하드웨어에 위임) |
| mq-deadline | SSD/HDD 범용 | 요청 만기 시간 보장, 데이터베이스에 적합|
| bfq | 데스크톱 | I/O 공정성, 대화형 워크로드에 적합 |
| kyber | 고성능 SSD | 저지연 목표, 읽기/쓰기 큐 분리 |
+-------------+----------------+-----------------------------------+
6.3 fio 벤치마킹
# 순차 읽기 벤치마크
fio --name=seqread --rw=read --bs=1M --size=4G \
--numjobs=1 --runtime=60 --direct=1
# 랜덤 읽기 (IOPS 측정)
fio --name=randread --rw=randread --bs=4k --size=4G \
--numjobs=8 --runtime=60 --direct=1 --iodepth=32
# 혼합 워크로드 (DB 시뮬레이션)
fio --name=mixed --rw=randrw --rwmixread=70 \
--bs=8k --size=4G --numjobs=4 --runtime=60 \
--direct=1 --iodepth=16
# 결과 해석:
# read: IOPS=125000, BW=488MiB/s, lat avg=0.25ms, p99=0.50ms
# write: IOPS=53571, BW=209MiB/s, lat avg=0.45ms, p99=1.20ms
7. 네트워크 성능 분석
7.1 sar
# 네트워크 인터페이스 통계
sar -n DEV 1
# TCP 통계
sar -n TCP 1
# 출력 예:
# IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
# eth0 15000 12000 8500 6200 0.0 0.0 5.0 6.8
# TCP 에러 통계
sar -n ETCP 1
7.2 ss와 TCP 튜닝
# TCP 연결 상태 요약
ss -s
# 수신 큐 / 송신 큐 확인 (병목 진단)
ss -tnp | awk '{print $2, $3, $5}'
# 혼잡 제어 및 RTT 정보
ss -ti
# TCP 메모리 사용 확인
ss -tm
7.3 iperf3 벤치마킹
# 서버 측
iperf3 -s
# 클라이언트 측 (TCP 대역폭 측정)
iperf3 -c SERVER_IP -t 30 -P 4
# UDP 대역폭 측정
iperf3 -c SERVER_IP -u -b 10G -t 30
# 양방향 테스트
iperf3 -c SERVER_IP -t 30 --bidir
# MTU 최적화 (Jumbo Frame)
# 표준: 1500 바이트
# Jumbo: 9000 바이트 (데이터센터 내부)
ip link set eth0 mtu 9000
8. eBPF 심화
8.1 eBPF 아키텍처
User Space Kernel Space
+--------------------+ +---------------------------+
| | | |
| BCC / bpftrace / | load | eBPF Virtual Machine |
| libbpf 프로그램 |------->| (JIT compiled) |
| | | |
| Maps (데이터 공유) |<------>| Hooks: |
| | read/ | - kprobes (함수 진입) |
| 결과 출력 | write | - tracepoints (정적 추적) |
| (stdout, perf | | - XDP (네트워크 패킷) |
| buffer, ringbuf) | | - LSM (보안 모듈) |
+--------------------+ | - cgroup (리소스 제어) |
+---------------------------+
eBPF 검증기 (Verifier):
- 무한 루프 방지
- 범위 밖 메모리 접근 차단
- 커널 안정성 보장
8.2 BCC 도구
# === CPU 관련 ===
# 프로세스별 CPU 사용 추적
execsnoop # 새 프로세스 실행 추적
# 실행 큐 대기 시간 히스토그램
runqlat # CPU 스케줄러 지연 분석
# === 메모리 관련 ===
# 페이지 폴트 추적
drsnoop # 직접 회수(direct reclaim) 추적
# 메모리 할당 추적
memleak # 메모리 누수 탐지
# === 디스크 I/O ===
# 블록 I/O 레이턴시 히스토그램
biolatency # I/O 대기 시간 분포
# 블록 I/O 상위 프로세스
biotop # I/O가 많은 프로세스 실시간 확인
# === 파일 시스템 ===
# 느린 파일 시스템 작업
ext4slower 1 # 1ms 이상 걸리는 ext4 작업
# 파일 열기 추적
opensnoop # 어떤 프로세스가 어떤 파일을 열는지
# === 네트워크 ===
# TCP 연결 추적
tcpconnect # 새 TCP 연결
tcpaccept # 수신된 TCP 연결
tcpretrans # TCP 재전송
8.3 bpftrace 원라이너
# 시스템 콜별 카운트
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# read() 지연 시간 히스토그램
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
@bytes = hist(args->ret);
}'
# 프로세스별 블록 I/O 크기
bpftrace -e 'tracepoint:block:block_rq_issue {
@[comm] = hist(args->bytes);
}'
# CPU 스케줄러 지연 추적
bpftrace -e 'tracepoint:sched:sched_switch {
@[args->next_comm] = count();
}'
# TCP 재전송 추적
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
@[comm, args->daddr, args->dport] = count();
}'
# VFS 읽기 지연 시간
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
@ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
8.4 libbpf CO-RE (Compile Once - Run Everywhere)
// 간단한 CO-RE eBPF 프로그램 구조
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
struct event {
u32 pid;
u64 duration_ns;
char comm[16];
};
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
SEC("kprobe/do_sys_openat2")
int BPF_KPROBE(trace_openat2, int dfd, const char *filename)
{
struct event *e;
e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
9. Flame Graph 심화
9.1 CPU Flame Graph
# perf로 CPU 프로파일 수집
perf record -F 99 -g -a sleep 30
# Flame Graph 생성
perf script | \
stackcollapse-perf.pl | \
flamegraph.pl --title "CPU Flame Graph" > cpu_flame.svg
9.2 Off-CPU Flame Graph
프로세스가 CPU에서 실행되지 않는 시간 (I/O, 락, 슬립 등)을 분석합니다.
# BCC offcputime 사용
offcputime -df -p PID 30 | \
flamegraph.pl --color=io --title "Off-CPU" > offcpu.svg
# bpftrace로 off-CPU 분석
bpftrace -e '
tracepoint:sched:sched_switch {
@start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
@us[args->next_comm, ustack] =
hist((nsecs - @start[args->next_pid]) / 1000);
delete(@start[args->next_pid]);
}'
9.3 Memory Flame Graph
# 메모리 할당 추적
perf record -e kmem:kmalloc -g -a sleep 10
perf script | stackcollapse-perf.pl | \
flamegraph.pl --color=mem --title "Memory Allocations" > mem_flame.svg
# 또는 BCC memleak
memleak -p PID -a 30 | \
flamegraph.pl --title "Memory Leak" > memleak_flame.svg
10. sysctl 튜닝
10.1 VM (가상 메모리) 튜닝
# === 스왑 동작 ===
# vm.swappiness: 스왑 사용 경향 (0-100)
# 0: 스왑 최소화 (OOM 위험)
# 10: DB 서버 권장
# 60: 기본값
# 100: 적극적 스왑
sysctl -w vm.swappiness=10
# === 더티 페이지 (Dirty Pages) ===
# 전체 메모리 대비 더티 페이지 비율 (쓰기 시작 임계점)
sysctl -w vm.dirty_ratio=15
# 백그라운드 플러시 시작 임계점
sysctl -w vm.dirty_background_ratio=5
# 더티 페이지 만료 시간 (centiseconds)
sysctl -w vm.dirty_expire_centisecs=3000
# 더티 페이지 기록 주기
sysctl -w vm.dirty_writeback_centisecs=500
# === 오버커밋 ===
# 0: 휴리스틱 (기본)
# 1: 항상 허용
# 2: 물리 메모리 + 스왑 * ratio까지만 허용
sysctl -w vm.overcommit_memory=0
sysctl -w vm.overcommit_ratio=50
# === 최소 여유 메모리 ===
sysctl -w vm.min_free_kbytes=65536
10.2 네트워크 튜닝
# === TCP 커넥션 ===
# 최대 연결 백로그
sysctl -w net.core.somaxconn=65535
# SYN 백로그
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# TCP 연결 재사용
sysctl -w net.ipv4.tcp_tw_reuse=1
# === TCP 버퍼 ===
# 수신 버퍼 (min, default, max)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
# 송신 버퍼
sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"
# 전체 네트워크 버퍼
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
# === TCP Keepalive ===
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=60
sysctl -w net.ipv4.tcp_keepalive_probes=3
# === 혼잡 제어 ===
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
# === 기타 ===
# FIN-WAIT-2 타임아웃
sysctl -w net.ipv4.tcp_fin_timeout=15
# 포트 범위
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# SYN 쿠키 (SYN Flood 방어)
sysctl -w net.ipv4.tcp_syncookies=1
10.3 파일 시스템 튜닝
# 시스템 전체 파일 디스크립터 제한
sysctl -w fs.file-max=2097152
# inotify 워치 제한 (파일 감시)
sysctl -w fs.inotify.max_user_watches=524288
# AIO 최대 요청 수
sysctl -w fs.aio-max-nr=1048576
10.4 커널 스케줄러 튜닝
# CFS 스케줄러 최소 실행 시간 (ns)
sysctl -w kernel.sched_min_granularity_ns=1000000
# 스케줄링 지연 (ns)
sysctl -w kernel.sched_latency_ns=6000000
# 마이그레이션 비용 (ns)
sysctl -w kernel.sched_migration_cost_ns=500000
# 자동 그룹 스케줄링
sysctl -w kernel.sched_autogroup_enabled=1
11. cgroups v2
11.1 cgroups v2 기본
# cgroups v2 마운트 확인
mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2
# cgroup 생성
mkdir /sys/fs/cgroup/myapp
# 프로세스 할당
echo PID > /sys/fs/cgroup/myapp/cgroup.procs
# 현재 컨트롤러 확인
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids
11.2 CPU 제한
# CPU 최대 사용량 제한
# 형식: QUOTA PERIOD (마이크로초)
# 100ms 주기에서 50ms만 사용 = CPU 50%
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
# CPU 가중치 (1-10000, 기본 100)
echo "200" > /sys/fs/cgroup/myapp/cpu.weight
11.3 메모리 제한
# 메모리 최대 제한
echo "1G" > /sys/fs/cgroup/myapp/memory.max
# 메모리 소프트 제한 (회수 우선 대상)
echo "512M" > /sys/fs/cgroup/myapp/memory.high
# 메모리 최소 보장
echo "256M" > /sys/fs/cgroup/myapp/memory.min
# 스왑 제한
echo "0" > /sys/fs/cgroup/myapp/memory.swap.max
# 현재 메모리 사용량
cat /sys/fs/cgroup/myapp/memory.current
# 메모리 통계
cat /sys/fs/cgroup/myapp/memory.stat
11.4 I/O 제한
# 디바이스별 I/O 대역폭 제한
# 형식: MAJOR:MINOR TYPE=LIMIT
# sda의 읽기를 50MB/s로 제한
echo "8:0 rbps=52428800" > /sys/fs/cgroup/myapp/io.max
# sda의 쓰기를 20MB/s로 제한
echo "8:0 wbps=20971520" > /sys/fs/cgroup/myapp/io.max
# IOPS 제한
echo "8:0 riops=1000 wiops=500" > /sys/fs/cgroup/myapp/io.max
# I/O 가중치
echo "default 100" > /sys/fs/cgroup/myapp/io.weight
11.5 Docker/K8s와의 통합
# Docker에서 cgroups v2 리소스 제한
# docker run --cpus=2 --memory=4g --memory-swap=4g myapp
# Kubernetes Pod 리소스 설정 (cgroups v2와 매핑)
# resources:
# requests:
# cpu: "500m" -> cpu.weight
# memory: "512Mi" -> memory.min
# limits:
# cpu: "2" -> cpu.max
# memory: "4Gi" -> memory.max
12. NUMA (Non-Uniform Memory Access)
12.1 NUMA 토폴로지
# NUMA 토폴로지 확인
numactl --hardware
# 출력 예:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 0 free: 16384 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node 1 free: 15360 MB
# node distances:
# node 0 1
# 0: 10 21 <- 같은 노드: 10, 다른 노드: 21 (2.1배 느림)
# 1: 21 10
# NUMA 메모리 통계
numastat -m
# 프로세스별 NUMA 메모리 통계
numastat -p PID
12.2 NUMA 메모리 바인딩
# 특정 NUMA 노드에서 실행
numactl --cpunodebind=0 --membind=0 ./myapp
# CPU와 메모리를 모두 노드 0에 바인딩
numactl -N 0 -m 0 ./database_process
# 인터리브 모드 (양쪽 노드에 균등 분배)
numactl --interleave=all ./myapp
# 기존 프로세스의 NUMA 정책 확인
cat /proc/PID/numa_maps
13. Huge Pages
13.1 Transparent Huge Pages (THP)
# THP 상태 확인
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# 데이터베이스에서는 THP 비활성화 권장 (메모리 단편화 + 지연 스파이크)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
13.2 Explicit Huge Pages
# Huge Page 할당 (2MB 페이지)
sysctl -w vm.nr_hugepages=1024 # 1024 * 2MB = 2GB
# 현재 상태 확인
cat /proc/meminfo | grep Huge
# HugePages_Total: 1024
# HugePages_Free: 512
# HugePages_Rsvd: 256
# HugePages_Surp: 0
# Hugepagesize: 2048 kB
# 1GB Huge Pages (부팅 시 커널 파라미터)
# GRUB: hugepagesz=1G hugepages=16
13.3 데이터베이스에서의 활용
# PostgreSQL: shared_buffers에 Huge Pages 사용
# postgresql.conf:
# huge_pages = try
# shared_buffers = 8GB
# Huge Pages 크기 계산:
# shared_buffers / Hugepagesize = 필요한 Huge Pages 수
# 8GB / 2MB = 4096 pages
sysctl -w vm.nr_hugepages=4096
14. 프로세스 스케줄링
14.1 CFS (Completely Fair Scheduler)
# CFS 스케줄러 통계
cat /proc/sched_debug | head -50
# nice 값 조정 (-20 ~ 19, 낮을수록 높은 우선순위)
nice -n -10 ./critical_process
renice -n -10 -p PID
# 프로세스 스케줄링 정책 확인
chrt -p PID
14.2 실시간 스케줄링
# SCHED_FIFO 설정 (실시간, 우선순위 1-99)
chrt -f 50 ./realtime_app
# SCHED_RR (라운드로빈 실시간)
chrt -r 50 ./realtime_app
# SCHED_DEADLINE (데드라인 기반)
chrt -d --sched-runtime 5000000 --sched-deadline 10000000 \
--sched-period 10000000 0 ./deadline_app
14.3 CPU 어피니티
# 프로세스를 특정 CPU에 바인딩
taskset -c 0,1 ./myapp # CPU 0, 1에서만 실행
taskset -c 0-3 ./myapp # CPU 0-3에서 실행
taskset -pc 0,1 PID # 기존 프로세스에 적용
# IRQ 어피니티 (인터럽트 분산)
echo 2 > /proc/irq/IRQ_NUM/smp_affinity # CPU 1에 할당
# isolcpus (부팅 파라미터)
# GRUB: isolcpus=4,5,6,7
# CPU 4-7을 일반 스케줄링에서 제외하고 특정 프로세스 전용으로 사용
15. 프로덕션 튜닝 체크리스트
+----+------------------------------+-------------------------------+
| # | 항목 | 권장 설정/확인 사항 |
+----+------------------------------+-------------------------------+
| 1 | 파일 디스크립터 제한 | ulimit -n 1048576 |
| | | fs.file-max = 2097152 |
+----+------------------------------+-------------------------------+
| 2 | TCP 백로그 | net.core.somaxconn = 65535 |
+----+------------------------------+-------------------------------+
| 3 | TCP 버퍼 | tcp_rmem/wmem 최적화 |
+----+------------------------------+-------------------------------+
| 4 | TCP 혼잡 제어 | BBR 또는 환경에 맞는 알고리즘 |
+----+------------------------------+-------------------------------+
| 5 | TCP Keepalive | 600/60/3 (앱 요구에 맞게) |
+----+------------------------------+-------------------------------+
| 6 | TIME_WAIT 관리 | tcp_tw_reuse = 1 |
+----+------------------------------+-------------------------------+
| 7 | FIN 타임아웃 | tcp_fin_timeout = 15 |
+----+------------------------------+-------------------------------+
| 8 | SYN 쿠키 | tcp_syncookies = 1 |
+----+------------------------------+-------------------------------+
| 9 | 포트 범위 | ip_local_port_range 1024-65535|
+----+------------------------------+-------------------------------+
| 10 | 스왑 동작 | vm.swappiness = 10 (DB 서버) |
+----+------------------------------+-------------------------------+
| 11 | 더티 페이지 | dirty_ratio = 15 |
| | | dirty_background_ratio = 5 |
+----+------------------------------+-------------------------------+
| 12 | I/O 스케줄러 | NVMe: none, SSD: mq-deadline |
+----+------------------------------+-------------------------------+
| 13 | THP | DB 서버: 비활성화 |
+----+------------------------------+-------------------------------+
| 14 | Huge Pages | DB shared_buffers용 설정 |
+----+------------------------------+-------------------------------+
| 15 | NUMA | DB를 단일 노드에 바인딩 |
+----+------------------------------+-------------------------------+
| 16 | CPU 어피니티 | 중요 프로세스 CPU 고정 |
+----+------------------------------+-------------------------------+
| 17 | OOM 점수 조정 | 중요 프로세스 보호 |
+----+------------------------------+-------------------------------+
| 18 | cgroups v2 | 리소스 격리 및 제한 |
+----+------------------------------+-------------------------------+
| 19 | 혼잡 제어 | BBR + fq qdisc |
+----+------------------------------+-------------------------------+
| 20 | inotify 워치 | max_user_watches = 524288 |
+----+------------------------------+-------------------------------+
| 21 | AIO 요청 | aio-max-nr = 1048576 |
+----+------------------------------+-------------------------------+
| 22 | 커널 로그 모니터링 | dmesg 주기적 확인 |
+----+------------------------------+-------------------------------+
16. 실전 퀴즈
퀴즈 1: USE 방법론
서버의 CPU 사용률(Utilization)이 90%인데 성능 저하가 없다면, USE 방법론에서 다음으로 확인해야 할 것은?
정답: Saturation (포화도)
CPU 사용률이 높더라도 반드시 문제는 아닙니다. 중요한 것은 포화도입니다. 실행 큐(run queue)에 대기 중인 프로세스가 있는지 확인해야 합니다.
# 실행 큐 길이 확인
vmstat 1 | awk '{print $1}' # r 열
# BCC runqlat으로 스케줄러 지연 측정
runqlat
포화도가 없다면 (r이 CPU 수 이하) CPU가 효율적으로 사용되고 있는 것입니다. 마지막으로 Errors를 확인합니다.
퀴즈 2: Flame Graph 해석
Flame Graph에서 특정 함수가 매우 넓은 폭을 차지하지만, 그 위에 자식 함수들이 가득 차 있다면 이 함수 자체가 병목인가?
정답: 아닙니다
Flame Graph에서 함수의 폭은 해당 함수와 그 자식 함수들이 차지하는 총 CPU 시간입니다. 자식 함수들이 가득 차 있다면, 실제 CPU 시간은 자식 함수에서 소비되는 것입니다.
진짜 병목은 "plateau" (고원)를 찾아야 합니다 -- 스택 꼭대기에서 넓은 폭을 차지하는 함수가 실제 CPU를 소비하는 함수입니다.
퀴즈 3: vm.swappiness
vm.swappiness를 0으로 설정하면 스왑이 완전히 비활성화되는가?
정답: 아닙니다
vm.swappiness=0은 스왑을 완전히 비활성화하는 것이 아닙니다. 커널이 메모리 부족 상황에서 스왑을 최소화하도록 설정하는 것입니다. 극단적인 메모리 압박 시에는 여전히 스왑이 발생할 수 있습니다.
스왑을 완전히 비활성화하려면 swapoff -a 명령을 사용해야 합니다. 그러나 이는 OOM Killer가 프로세스를 종료할 위험이 있으므로 주의가 필요합니다.
퀴즈 4: THP와 데이터베이스
Transparent Huge Pages(THP)가 데이터베이스(PostgreSQL, MySQL, MongoDB) 성능에 부정적인 이유는?
정답: THP는 2MB 단위로 메모리를 할당합니다. 데이터베이스는 보통 8KB(PostgreSQL) 또는 16KB(MySQL) 페이지 단위로 작업합니다.
문제점:
- 메모리 단편화: THP 할당을 위해 커널이 메모리를 압축(compaction)하면서 지연 스파이크 발생
- 쓰기 증폭: 2MB 중 일부만 변경돼도 전체 2MB를 복사(Copy-on-Write)
- 메모리 낭비: 실제 사용보다 큰 단위로 할당
- 예측 불가능한 지연: defrag 과정에서 프로세스가 멈출 수 있음
대신 Explicit Huge Pages를 shared_buffers 전용으로 설정하는 것이 권장됩니다.
퀴즈 5: eBPF vs 기존 추적 도구
eBPF가 strace보다 프로덕션 환경에서 안전한 이유는?
정답: strace는 ptrace 시스템 콜을 사용하여 대상 프로세스의 모든 시스템 콜을 인터셉트합니다. 이는 매 시스템 콜마다 두 번의 컨텍스트 스위치를 추가하여 성능을 50-100% 이상 저하시킬 수 있습니다.
eBPF는:
- 커널 내부에서 실행: 사용자 공간-커널 전환 최소화
- JIT 컴파일: 네이티브 코드 수준의 성능
- 검증기(Verifier): 프로그램이 커널을 손상시킬 수 없음을 보장
- 선택적 추적: 필요한 이벤트만 추적 가능
- 오버헤드 최소: 일반적으로 5% 미만의 성능 영향
따라서 프로덕션 환경에서도 안전하게 사용할 수 있습니다.
17. 참고 자료
- Systems Performance, 2nd Edition - Brendan Gregg (성능 엔지니어링의 바이블)
- BPF Performance Tools - Brendan Gregg (eBPF 도구 완전 가이드)
- Brendan Gregg's Website - https://www.brendangregg.com/ (무료 자료 다수)
- Linux Perf Wiki - https://perf.wiki.kernel.org/
- Flame Graphs - https://www.brendangregg.com/flamegraphs.html
- BCC Tools - https://github.com/iovisor/bcc
- bpftrace - https://github.com/bpftrace/bpftrace
- io_uring - https://kernel.dk/io_uring.pdf
- Linux kernel documentation: cgroups v2 - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
- NUMA documentation - https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html
- Red Hat Performance Tuning Guide - https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/
- Netflix Tech Blog - Linux Performance - https://netflixtechblog.com/
- Facebook/Meta Engineering Blog - BPF - https://engineering.fb.com/
- sysctl documentation - https://www.kernel.org/doc/html/latest/admin-guide/sysctl/
Linux Performance Engineering Complete Guide 2025: Profiling, System Tuning, eBPF, Bottleneck Analysis
Table of Contents
1. Why Linux Performance Engineering
A significant portion of production incidents stem from performance issues. CPU spikes, memory leaks, disk I/O bottlenecks, network latency -- the ability to systematically analyze and resolve all of these is Linux performance engineering.
Core topics covered in this guide:
- Performance analysis methodologies (USE, RED, TSA)
- CPU analysis (perf, mpstat, pidstat, Flame Graph)
- Memory analysis (vmstat, /proc/meminfo, slab, OOM Killer)
- Disk I/O (iostat, blktrace, I/O schedulers, fio)
- Network performance (sar, ss, iperf3, TCP tuning)
- eBPF deep dive (architecture, BCC, bpftrace, libbpf CO-RE)
- Flame Graphs (CPU, off-CPU, memory, I/O)
- sysctl tuning and cgroups v2
- NUMA, Huge Pages, process scheduling
- Production tuning checklist (20+ items)
2. Performance Analysis Methodologies
2.1 USE Method (Utilization, Saturation, Errors)
A systematic performance analysis framework by Brendan Gregg.
For every resource, check 3 things:
+-------------------+--------------------------------------------+
| Resource | U (Utilization) | S (Saturation) | E (Errors)|
+-------------------+----------------+---------------+----------+
| CPU | mpstat | runq latency | perf/dmesg|
| Memory | free | vmstat si/so | dmesg OOM |
| Network Interface | sar -n DEV | ifconfig (overruns) | ifconfig (errors) |
| Storage I/O | iostat | iostat avgqu | iostat (errors) |
| Storage Capacity | df -h | (none) | stale mounts |
| File Descriptors | lsof | (none) | "Too many |
| | | | open files"|
+-------------------+----------------+---------------+----------+
2.2 RED Method (Rate, Errors, Duration)
A methodology well-suited for microservices.
Measure 3 things from a service perspective:
Rate: Requests per second (requests/sec)
Errors: Failed request ratio (error rate)
Duration: Request latency distribution (P50/P95/P99)
2.3 TSA Method (Thread State Analysis)
Thread State Classification:
On-CPU: Running (actively using CPU)
Runnable: Waiting to run (waiting for CPU)
Sleeping: Waiting for I/O, timer, lock, etc.
Idle: Nothing to do
Analysis Tools:
- perf record + Flame Graph (On-CPU analysis)
- offcputime (Off-CPU analysis)
- bpftrace (detailed thread analysis)
3. Linux Performance Tools Overview
3.1 Brendan Gregg's Linux Performance Tools Map
Applications
/ | \
/ | \
System Libs Runtime Compiler
| | |
v v v
+-----------------------------------------+
| System Call Interface |
+-----------------------------------------+
| VFS | Sockets | Scheduler | VM |
+-----------+---------+-----------+-------+
| File Sys | TCP/UDP | (sched) | (mm) |
+-----------+---------+-----------+-------+
| Volume Mgr| IP | | |
+-----------+---------+-----------+-------+
| Block Dev | Ethernet| | |
+-----------+---------+-----------+-------+
| Device Drivers |
+-----------------------------------------+
Observability Tools:
App: strace, ltrace, gdb
Sched: perf, mpstat, pidstat, runqlat
Memory: vmstat, slabtop, free, sar
FS: opensnoop, ext4slower, fileslower
Disk: iostat, biolatency, biotop
Net: sar, ss, tcpdump, nicstat
All: eBPF (bpftrace, BCC tools)
4. CPU Analysis
4.1 perf stat (Hardware Counters)
# Measure CPU counters for a process
perf stat -p PID sleep 10
# Example output:
# Performance counter stats for process id '12345':
#
# 10,234.56 msec task-clock # 1.023 CPUs utilized
# 2,345 context-switches # 229.12 /sec
# 12 cpu-migrations # 1.17 /sec
# 45,678 page-faults # 4.46K /sec
# 12,345,678,901 cycles # 1.206 GHz
# 9,876,543,210 instructions # 0.80 insn/cycle
# 1,234,567,890 branches # 120.63M /sec
# 12,345,678 branch-misses # 1.00% of all branches
# IPC (Instructions Per Cycle) is the key metric
# IPC < 1.0: likely memory-bound
# IPC > 1.0: compute-bound
4.2 perf record + perf report
# Record CPU profile (30 seconds)
perf record -g -p PID sleep 30
# Or system-wide
perf record -g -a sleep 30
# Analyze profile
perf report --stdio
# Example output:
# Overhead Command Shared Object Symbol
# 23.45% nginx libc.so.6 [.] __memcpy_avx2
# 15.67% nginx nginx [.] ngx_http_parse_request_line
# 12.34% nginx [kernel.vmlinux] [k] copy_user_enhanced_fast_string
# 8.90% nginx libssl.so.3 [.] EVP_EncryptUpdate
4.3 Flame Graph Generation
# 1. Collect perf data
perf record -F 99 -g -a sleep 30
# 2. Generate Flame Graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu.svg
# Or as a one-liner:
perf record -F 99 -g -a -- sleep 30 && \
perf script | \
stackcollapse-perf.pl | \
flamegraph.pl > flamegraph.svg
How to Read Flame Graphs:
+-----------------------------------------------------------+
| main() |
+-------------------+---------------------------------------+
| process_request()| handle_connection() |
+--------+----------+------------------+--------------------+
| parse()| route() | read_file() | send_response() |
+--------+----+-----+------+-----------+----------+---------+
|sort| |read()| |write() |encrypt()|
+----+ +------+ +----------+---------+
- X-axis: CPU time proportion (left-to-right order is meaningless)
- Y-axis: Call stack depth (bottom-to-top = call direction)
- Width: CPU time consumed by that function
- Find wide "plateaus" -- those are optimization targets!
4.4 mpstat, pidstat
# Per-CPU utilization
mpstat -P ALL 1
# Example output:
# CPU %usr %nice %sys %iowait %irq %soft %steal %idle
# all 25.3 0.0 5.2 2.1 0.0 0.5 0.0 66.9
# 0 45.6 0.0 8.3 0.0 0.0 1.2 0.0 44.9 <- hot CPU
# 1 12.3 0.0 3.1 4.2 0.0 0.1 0.0 80.3
# 2 35.7 0.0 6.8 0.0 0.0 0.8 0.0 56.7
# 3 7.5 0.0 2.5 4.1 0.0 0.0 0.0 85.9
# Per-process CPU utilization
pidstat -p ALL 1
# Per-thread CPU utilization
pidstat -t -p PID 1
5. Memory Analysis
5.1 vmstat
# Memory/CPU statistics at 1-second intervals
vmstat 1
# Interpreting the output:
# procs --------memory-------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa
# 2 0 0 524288 65536 2097152 0 0 4 12 156 312 15 5 78 2
# 5 1 0 491520 65536 2097152 0 0 0 256 892 1543 45 12 38 5
# Key metrics:
# r: Run queue waiting processes (r > CPU count = saturated)
# b: Uninterruptible sleep (usually I/O wait)
# si/so: Swap in/out (non-zero = memory pressure)
# wa: I/O wait (high = disk bottleneck)
5.2 /proc/meminfo Details
cat /proc/meminfo
# Key entries:
# MemTotal: 16384000 kB Total physical memory
# MemFree: 1024000 kB Completely unused memory
# MemAvailable: 8192000 kB Actually available (including reclaimable cache)
# Buffers: 524288 kB Block device I/O buffers
# Cached: 6553600 kB Page cache
# SwapCached: 0 kB Cache read back from swap
# Active: 4096000 kB Recently accessed memory
# Inactive: 3072000 kB Old memory (reclaim candidate)
# Slab: 512000 kB Kernel data structure cache
# SReclaimable: 384000 kB Reclaimable slab
# SUnreclaim: 128000 kB Non-reclaimable slab
5.3 Page Cache and OOM Killer
# Page cache status
free -h
# total used free shared buff/cache available
# Mem: 16G 4.2G 1.0G 256M 10.8G 11.2G
# Swap: 4G 0B 4G
# Clear cache (use with caution in production)
# echo 1 > /proc/sys/vm/drop_caches # Page cache only
# echo 2 > /proc/sys/vm/drop_caches # dentries + inodes
# echo 3 > /proc/sys/vm/drop_caches # All
# Check OOM Killer logs
dmesg | grep -i "oom\|out of memory\|killed process"
# Check per-process OOM score
cat /proc/PID/oom_score
# Adjust OOM score (-1000 to 1000)
echo -500 > /proc/PID/oom_score_adj # Protect from OOM
echo 1000 > /proc/PID/oom_score_adj # OOM kill priority target
5.4 slabtop
# Monitor kernel slab caches
slabtop -s c # Sort by cache size
# Example output:
# OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
# 65536 62000 94% 0.19K 3277 20 13108K dentry
# 32768 30000 91% 0.50K 4096 8 16384K inode_cache
# 16384 15000 91% 1.00K 4096 4 16384K ext4_inode_cache
6. Disk I/O Analysis
6.1 iostat
# Disk I/O statistics (extended mode, 1-second interval)
iostat -xz 1
# Interpreting the output:
# Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s await r_await w_await svctm %util
# sda 150.0 200.0 6000 8000 10.0 50.0 2.50 1.80 3.00 0.85 29.8
# nvme0 500.0 800.0 50000 80000 0.0 0.0 0.25 0.20 0.28 0.08 10.4
# Key metrics:
# await: Average I/O wait time (ms) - high = disk bottleneck
# %util: Device utilization - near 100% = saturated
# r_await, w_await: Separate read/write wait times
# rrqm/wrqm: Merged requests (high = sequential I/O)
6.2 I/O Schedulers
# Check current I/O scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# Change scheduler
echo "bfq" > /sys/block/sda/queue/scheduler
I/O Scheduler Comparison:
+-------------+------------------+----------------------------------+
| Scheduler | Best For | Characteristics |
+-------------+------------------+----------------------------------+
| none | NVMe SSD | No scheduling (delegates to HW) |
| mq-deadline | SSD/HDD general | Request deadline guarantee, |
| | | good for databases |
| bfq | Desktop | I/O fairness, interactive work |
| kyber | High-perf SSD | Low latency, read/write queues |
+-------------+------------------+----------------------------------+
6.3 fio Benchmarking
# Sequential read benchmark
fio --name=seqread --rw=read --bs=1M --size=4G \
--numjobs=1 --runtime=60 --direct=1
# Random read (IOPS measurement)
fio --name=randread --rw=randread --bs=4k --size=4G \
--numjobs=8 --runtime=60 --direct=1 --iodepth=32
# Mixed workload (DB simulation)
fio --name=mixed --rw=randrw --rwmixread=70 \
--bs=8k --size=4G --numjobs=4 --runtime=60 \
--direct=1 --iodepth=16
# Interpreting results:
# read: IOPS=125000, BW=488MiB/s, lat avg=0.25ms, p99=0.50ms
# write: IOPS=53571, BW=209MiB/s, lat avg=0.45ms, p99=1.20ms
7. Network Performance Analysis
7.1 sar
# Network interface statistics
sar -n DEV 1
# TCP statistics
sar -n TCP 1
# Example output:
# IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
# eth0 15000 12000 8500 6200 0.0 0.0 5.0 6.8
# TCP error statistics
sar -n ETCP 1
7.2 ss and TCP Tuning
# TCP connection state summary
ss -s
# Check receive/send queues (bottleneck diagnosis)
ss -tnp | awk '{print $2, $3, $5}'
# Congestion control and RTT info
ss -ti
# TCP memory usage
ss -tm
7.3 iperf3 Benchmarking
# Server side
iperf3 -s
# Client side (TCP bandwidth measurement)
iperf3 -c SERVER_IP -t 30 -P 4
# UDP bandwidth measurement
iperf3 -c SERVER_IP -u -b 10G -t 30
# Bidirectional test
iperf3 -c SERVER_IP -t 30 --bidir
# MTU optimization (Jumbo Frames)
# Standard: 1500 bytes
# Jumbo: 9000 bytes (within data centers)
ip link set eth0 mtu 9000
8. eBPF Deep Dive
8.1 eBPF Architecture
User Space Kernel Space
+--------------------+ +---------------------------+
| | | |
| BCC / bpftrace / | load | eBPF Virtual Machine |
| libbpf programs |------->| (JIT compiled) |
| | | |
| Maps (data sharing)|<------>| Hooks: |
| | read/ | - kprobes (func entry) |
| Output | write | - tracepoints (static) |
| (stdout, perf | | - XDP (network packets) |
| buffer, ringbuf) | | - LSM (security module) |
+--------------------+ | - cgroup (resource ctrl) |
+---------------------------+
eBPF Verifier:
- Prevents infinite loops
- Blocks out-of-bounds memory access
- Guarantees kernel stability
8.2 BCC Tools
# === CPU Related ===
# Trace new process execution
execsnoop
# Run queue latency histogram
runqlat # CPU scheduler delay analysis
# === Memory Related ===
# Trace page faults
drsnoop # Direct reclaim tracing
# Memory allocation tracing
memleak # Memory leak detection
# === Disk I/O ===
# Block I/O latency histogram
biolatency # I/O wait time distribution
# Top block I/O processes
biotop # Real-time I/O heavy processes
# === File System ===
# Slow filesystem operations
ext4slower 1 # ext4 operations taking more than 1ms
# File open tracing
opensnoop # Which process opens which files
# === Network ===
# TCP connection tracing
tcpconnect # New TCP connections
tcpaccept # Accepted TCP connections
tcpretrans # TCP retransmissions
8.3 bpftrace One-Liners
# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# read() return size histogram
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
@bytes = hist(args->ret);
}'
# Block I/O size by process
bpftrace -e 'tracepoint:block:block_rq_issue {
@[comm] = hist(args->bytes);
}'
# CPU scheduler switch tracing
bpftrace -e 'tracepoint:sched:sched_switch {
@[args->next_comm] = count();
}'
# TCP retransmission tracing
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
@[comm, args->daddr, args->dport] = count();
}'
# VFS read latency
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
@ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
8.4 libbpf CO-RE (Compile Once - Run Everywhere)
// Simple CO-RE eBPF program structure
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
struct event {
u32 pid;
u64 duration_ns;
char comm[16];
};
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
SEC("kprobe/do_sys_openat2")
int BPF_KPROBE(trace_openat2, int dfd, const char *filename)
{
struct event *e;
e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
9. Flame Graphs Deep Dive
9.1 CPU Flame Graph
# Collect CPU profile with perf
perf record -F 99 -g -a sleep 30
# Generate Flame Graph
perf script | \
stackcollapse-perf.pl | \
flamegraph.pl --title "CPU Flame Graph" > cpu_flame.svg
9.2 Off-CPU Flame Graph
Analyzes time when a process is NOT running on CPU (I/O, locks, sleeps, etc.).
# Using BCC offcputime
offcputime -df -p PID 30 | \
flamegraph.pl --color=io --title "Off-CPU" > offcpu.svg
# bpftrace off-CPU analysis
bpftrace -e '
tracepoint:sched:sched_switch {
@start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
@us[args->next_comm, ustack] =
hist((nsecs - @start[args->next_pid]) / 1000);
delete(@start[args->next_pid]);
}'
9.3 Memory Flame Graph
# Trace memory allocations
perf record -e kmem:kmalloc -g -a sleep 10
perf script | stackcollapse-perf.pl | \
flamegraph.pl --color=mem --title "Memory Allocations" > mem_flame.svg
# Or using BCC memleak
memleak -p PID -a 30 | \
flamegraph.pl --title "Memory Leak" > memleak_flame.svg
10. sysctl Tuning
10.1 VM (Virtual Memory) Tuning
# === Swap Behavior ===
# vm.swappiness: Swap tendency (0-100)
# 0: Minimize swap (OOM risk)
# 10: Recommended for DB servers
# 60: Default
# 100: Aggressive swapping
sysctl -w vm.swappiness=10
# === Dirty Pages ===
# Dirty page ratio (write-back trigger threshold)
sysctl -w vm.dirty_ratio=15
# Background flush start threshold
sysctl -w vm.dirty_background_ratio=5
# Dirty page expiry (centiseconds)
sysctl -w vm.dirty_expire_centisecs=3000
# Dirty page writeback interval
sysctl -w vm.dirty_writeback_centisecs=500
# === Overcommit ===
# 0: Heuristic (default)
# 1: Always allow
# 2: Allow up to physical mem + swap * ratio
sysctl -w vm.overcommit_memory=0
sysctl -w vm.overcommit_ratio=50
# === Minimum Free Memory ===
sysctl -w vm.min_free_kbytes=65536
10.2 Network Tuning
# === TCP Connections ===
# Max connection backlog
sysctl -w net.core.somaxconn=65535
# SYN backlog
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# TCP connection reuse
sysctl -w net.ipv4.tcp_tw_reuse=1
# === TCP Buffers ===
# Receive buffer (min, default, max)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
# Send buffer
sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"
# Global network buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
# === TCP Keepalive ===
sysctl -w net.ipv4.tcp_keepalive_time=600
sysctl -w net.ipv4.tcp_keepalive_intvl=60
sysctl -w net.ipv4.tcp_keepalive_probes=3
# === Congestion Control ===
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
# === Miscellaneous ===
# FIN-WAIT-2 timeout
sysctl -w net.ipv4.tcp_fin_timeout=15
# Port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# SYN cookies (SYN Flood defense)
sysctl -w net.ipv4.tcp_syncookies=1
10.3 Filesystem Tuning
# System-wide file descriptor limit
sysctl -w fs.file-max=2097152
# inotify watch limit (file monitoring)
sysctl -w fs.inotify.max_user_watches=524288
# AIO max request count
sysctl -w fs.aio-max-nr=1048576
10.4 Kernel Scheduler Tuning
# CFS scheduler minimum runtime (ns)
sysctl -w kernel.sched_min_granularity_ns=1000000
# Scheduling latency (ns)
sysctl -w kernel.sched_latency_ns=6000000
# Migration cost (ns)
sysctl -w kernel.sched_migration_cost_ns=500000
# Automatic group scheduling
sysctl -w kernel.sched_autogroup_enabled=1
11. cgroups v2
11.1 cgroups v2 Basics
# Check cgroups v2 mount
mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2
# Create a cgroup
mkdir /sys/fs/cgroup/myapp
# Assign a process
echo PID > /sys/fs/cgroup/myapp/cgroup.procs
# Check available controllers
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids
11.2 CPU Limits
# CPU max usage limit
# Format: QUOTA PERIOD (microseconds)
# 50ms out of 100ms = 50% CPU
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
# CPU weight (1-10000, default 100)
echo "200" > /sys/fs/cgroup/myapp/cpu.weight
11.3 Memory Limits
# Memory hard limit
echo "1G" > /sys/fs/cgroup/myapp/memory.max
# Memory soft limit (reclaim priority target)
echo "512M" > /sys/fs/cgroup/myapp/memory.high
# Memory minimum guarantee
echo "256M" > /sys/fs/cgroup/myapp/memory.min
# Swap limit
echo "0" > /sys/fs/cgroup/myapp/memory.swap.max
# Current memory usage
cat /sys/fs/cgroup/myapp/memory.current
# Memory statistics
cat /sys/fs/cgroup/myapp/memory.stat
11.4 I/O Limits
# Per-device I/O bandwidth limit
# Format: MAJOR:MINOR TYPE=LIMIT
# Limit sda reads to 50MB/s
echo "8:0 rbps=52428800" > /sys/fs/cgroup/myapp/io.max
# Limit sda writes to 20MB/s
echo "8:0 wbps=20971520" > /sys/fs/cgroup/myapp/io.max
# IOPS limits
echo "8:0 riops=1000 wiops=500" > /sys/fs/cgroup/myapp/io.max
# I/O weight
echo "default 100" > /sys/fs/cgroup/myapp/io.weight
11.5 Docker/K8s Integration
# Docker cgroups v2 resource limits
# docker run --cpus=2 --memory=4g --memory-swap=4g myapp
# Kubernetes Pod resources (maps to cgroups v2)
# resources:
# requests:
# cpu: "500m" -> cpu.weight
# memory: "512Mi" -> memory.min
# limits:
# cpu: "2" -> cpu.max
# memory: "4Gi" -> memory.max
12. NUMA (Non-Uniform Memory Access)
12.1 NUMA Topology
# Check NUMA topology
numactl --hardware
# Example output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 32768 MB
# node 0 free: 16384 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 32768 MB
# node 1 free: 15360 MB
# node distances:
# node 0 1
# 0: 10 21 <- Same node: 10, different node: 21 (2.1x slower)
# 1: 21 10
# NUMA memory statistics
numastat -m
# Per-process NUMA memory stats
numastat -p PID
12.2 NUMA Memory Binding
# Run on a specific NUMA node
numactl --cpunodebind=0 --membind=0 ./myapp
# Bind both CPU and memory to node 0
numactl -N 0 -m 0 ./database_process
# Interleave mode (distribute evenly across nodes)
numactl --interleave=all ./myapp
# Check NUMA policy of existing process
cat /proc/PID/numa_maps
13. Huge Pages
13.1 Transparent Huge Pages (THP)
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# Disable THP for databases (memory fragmentation + latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
13.2 Explicit Huge Pages
# Allocate Huge Pages (2MB pages)
sysctl -w vm.nr_hugepages=1024 # 1024 * 2MB = 2GB
# Check current status
cat /proc/meminfo | grep Huge
# HugePages_Total: 1024
# HugePages_Free: 512
# HugePages_Rsvd: 256
# HugePages_Surp: 0
# Hugepagesize: 2048 kB
# 1GB Huge Pages (kernel boot parameter)
# GRUB: hugepagesz=1G hugepages=16
13.3 Database Usage
# PostgreSQL: Use Huge Pages for shared_buffers
# postgresql.conf:
# huge_pages = try
# shared_buffers = 8GB
# Calculate Huge Pages needed:
# shared_buffers / Hugepagesize = required pages
# 8GB / 2MB = 4096 pages
sysctl -w vm.nr_hugepages=4096
14. Process Scheduling
14.1 CFS (Completely Fair Scheduler)
# CFS scheduler statistics
cat /proc/sched_debug | head -50
# Adjust nice value (-20 to 19, lower = higher priority)
nice -n -10 ./critical_process
renice -n -10 -p PID
# Check process scheduling policy
chrt -p PID
14.2 Real-Time Scheduling
# Set SCHED_FIFO (real-time, priority 1-99)
chrt -f 50 ./realtime_app
# SCHED_RR (round-robin real-time)
chrt -r 50 ./realtime_app
# SCHED_DEADLINE (deadline-based)
chrt -d --sched-runtime 5000000 --sched-deadline 10000000 \
--sched-period 10000000 0 ./deadline_app
14.3 CPU Affinity
# Bind process to specific CPUs
taskset -c 0,1 ./myapp # Run only on CPU 0, 1
taskset -c 0-3 ./myapp # Run on CPU 0-3
taskset -pc 0,1 PID # Apply to existing process
# IRQ affinity (interrupt distribution)
echo 2 > /proc/irq/IRQ_NUM/smp_affinity # Assign to CPU 1
# isolcpus (boot parameter)
# GRUB: isolcpus=4,5,6,7
# Exclude CPU 4-7 from general scheduling for dedicated process use
15. Production Tuning Checklist
+----+------------------------------+-------------------------------+
| # | Item | Recommended Setting |
+----+------------------------------+-------------------------------+
| 1 | File descriptor limit | ulimit -n 1048576 |
| | | fs.file-max = 2097152 |
+----+------------------------------+-------------------------------+
| 2 | TCP backlog | net.core.somaxconn = 65535 |
+----+------------------------------+-------------------------------+
| 3 | TCP buffers | Optimize tcp_rmem/wmem |
+----+------------------------------+-------------------------------+
| 4 | TCP congestion control | BBR or env-appropriate algo |
+----+------------------------------+-------------------------------+
| 5 | TCP Keepalive | 600/60/3 (adjust per app) |
+----+------------------------------+-------------------------------+
| 6 | TIME_WAIT management | tcp_tw_reuse = 1 |
+----+------------------------------+-------------------------------+
| 7 | FIN timeout | tcp_fin_timeout = 15 |
+----+------------------------------+-------------------------------+
| 8 | SYN cookies | tcp_syncookies = 1 |
+----+------------------------------+-------------------------------+
| 9 | Port range | ip_local_port_range 1024-65535|
+----+------------------------------+-------------------------------+
| 10 | Swap behavior | vm.swappiness = 10 (DB) |
+----+------------------------------+-------------------------------+
| 11 | Dirty pages | dirty_ratio = 15 |
| | | dirty_background_ratio = 5 |
+----+------------------------------+-------------------------------+
| 12 | I/O scheduler | NVMe: none, SSD: mq-deadline |
+----+------------------------------+-------------------------------+
| 13 | THP | Disable for DB servers |
+----+------------------------------+-------------------------------+
| 14 | Huge Pages | Configure for DB shared_buf |
+----+------------------------------+-------------------------------+
| 15 | NUMA | Bind DB to single node |
+----+------------------------------+-------------------------------+
| 16 | CPU affinity | Pin critical processes |
+----+------------------------------+-------------------------------+
| 17 | OOM score adjustment | Protect critical processes |
+----+------------------------------+-------------------------------+
| 18 | cgroups v2 | Resource isolation and limits |
+----+------------------------------+-------------------------------+
| 19 | Congestion control | BBR + fq qdisc |
+----+------------------------------+-------------------------------+
| 20 | inotify watches | max_user_watches = 524288 |
+----+------------------------------+-------------------------------+
| 21 | AIO requests | aio-max-nr = 1048576 |
+----+------------------------------+-------------------------------+
| 22 | Kernel log monitoring | Check dmesg periodically |
+----+------------------------------+-------------------------------+
16. Practice Quiz
Quiz 1: USE Method
If a server's CPU utilization is 90% but there is no performance degradation, what should you check next according to the USE method?
Answer: Saturation
High CPU utilization is not necessarily a problem. What matters is saturation. Check whether processes are waiting in the run queue.
# Check run queue length
vmstat 1 | awk '{print $1}' # r column
# Measure scheduler latency with BCC runqlat
runqlat
If there is no saturation (r is less than or equal to CPU count), the CPU is being efficiently used. Finally, check for Errors.
Quiz 2: Flame Graph Interpretation
In a Flame Graph, a function has a very wide width but its child functions fill up all the space above it. Is this function itself the bottleneck?
Answer: No
In a Flame Graph, a function's width represents the total CPU time of that function AND all its children. If child functions fill up the space, the actual CPU time is being consumed by the children.
The real bottleneck is found by looking for "plateaus" -- functions at the top of the stack that have wide width are the ones actually consuming CPU time.
Quiz 3: vm.swappiness
Does setting vm.swappiness to 0 completely disable swap?
Answer: No
vm.swappiness=0 does not completely disable swap. It instructs the kernel to minimize swap usage under memory pressure. Under extreme memory pressure, swapping can still occur.
To completely disable swap, use the swapoff -a command. However, this risks the OOM Killer terminating processes, so caution is required.
Quiz 4: THP and Databases
Why do Transparent Huge Pages (THP) negatively impact database performance (PostgreSQL, MySQL, MongoDB)?
Answer: THP allocates memory in 2MB chunks. Databases typically work with 8KB (PostgreSQL) or 16KB (MySQL) page sizes.
Issues:
- Memory fragmentation: Kernel memory compaction for THP allocation causes latency spikes
- Write amplification: Even modifying a small portion triggers Copy-on-Write of the entire 2MB
- Memory waste: Allocations are larger than what is actually used
- Unpredictable latency: Defrag process can stall processes
Instead, configure Explicit Huge Pages dedicated to shared_buffers.
Quiz 5: eBPF vs Traditional Tracing
Why is eBPF safer for production environments than strace?
Answer: strace uses the ptrace syscall to intercept every system call of the target process. This adds two context switches per syscall, potentially degrading performance by 50-100% or more.
eBPF:
- Runs inside the kernel: Minimizes user-kernel transitions
- JIT compiled: Native code-level performance
- Verifier: Guarantees programs cannot corrupt the kernel
- Selective tracing: Only traces events you need
- Minimal overhead: Typically under 5% performance impact
This makes it safe to use even in production environments.
17. References
- Systems Performance, 2nd Edition - Brendan Gregg (The Bible of performance engineering)
- BPF Performance Tools - Brendan Gregg (Complete eBPF tools guide)
- Brendan Gregg's Website - https://www.brendangregg.com/ (Extensive free resources)
- Linux Perf Wiki - https://perf.wiki.kernel.org/
- Flame Graphs - https://www.brendangregg.com/flamegraphs.html
- BCC Tools - https://github.com/iovisor/bcc
- bpftrace - https://github.com/bpftrace/bpftrace
- io_uring - https://kernel.dk/io_uring.pdf
- Linux kernel documentation: cgroups v2 - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
- NUMA documentation - https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html
- Red Hat Performance Tuning Guide - https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/
- Netflix Tech Blog - Linux Performance - https://netflixtechblog.com/
- Facebook/Meta Engineering Blog - BPF - https://engineering.fb.com/
- sysctl documentation - https://www.kernel.org/doc/html/latest/admin-guide/sysctl/