Split View: [Linux] cgroup 완전 가이드: 컨테이너 리소스 제어의 핵심
[Linux] cgroup 완전 가이드: 컨테이너 리소스 제어의 핵심
목차
- cgroup이란?
- cgroup v1 vs v2
- 주요 컨트롤러 심층 분석
- Docker와 cgroup
- Kubernetes와 cgroup
- cgroup 실전 트러블슈팅
- cgroup과 보안
- 정리
1. cgroup이란?
Control Groups의 탄생
cgroup(Control Groups)은 Linux 커널이 제공하는 프로세스 그룹 단위의 리소스 제한 및 격리 메커니즘이다. 2006년 Google 엔지니어 Paul Menage와 Rohit Seth가 "process containers"라는 이름으로 개발을 시작했으며, 2007년 Linux 커널 2.6.24에 merge되면서 기존 커널의 "container"라는 용어와의 혼동을 피하기 위해 cgroup으로 이름이 변경되었다.
cgroup이 해결하는 문제
cgroup이 없던 시절, 하나의 프로세스가 시스템 전체의 CPU나 메모리를 독점하는 것을 막기 어려웠다. cgroup은 다음과 같은 기능을 제공한다:
- 리소스 제한(Resource Limiting): CPU, 메모리, IO, 네트워크 대역폭 등의 사용량 상한 설정
- 우선순위(Prioritization): 프로세스 그룹 간 리소스 할당 비율 조정
- 계정(Accounting): 그룹별 리소스 사용량 측정 및 보고
- 제어(Control): 프로세스 그룹의 일시 중지, 재개, 체크포인트
Namespace와 cgroup의 관계
컨테이너 기술의 두 축은 Namespace와 cgroup이다.
+-------------------------------------------+
| Linux 컨테이너 격리 모델 |
+-------------------------------------------+
| |
| Namespace (격리 - Visibility) |
| ┌─────────────────────────────────────┐ |
| │ PID NS : 프로세스 ID 격리 │ |
| │ NET NS : 네트워크 스택 격리 │ |
| │ MNT NS : 파일시스템 마운트 격리 │ |
| │ UTS NS : 호스트네임 격리 │ |
| │ IPC NS : IPC 리소스 격리 │ |
| │ USER NS : UID/GID 매핑 격리 │ |
| └─────────────────────────────────────┘ |
| |
| cgroup (제한 - Resource Limits) |
| ┌─────────────────────────────────────┐ |
| │ CPU : CPU 시간 제한 │ |
| │ Memory : 메모리 사용량 제한 │ |
| │ IO : 블록 디바이스 IO 제한 │ |
| │ PIDs : 프로세스 수 제한 │ |
| │ Devices : 디바이스 접근 제어 │ |
| └─────────────────────────────────────┘ |
| |
+-------------------------------------------+
핵심 차이를 요약하면:
- Namespace = "무엇을 볼 수 있는가" (격리)
- cgroup = "얼마나 사용할 수 있는가" (제한)
cgroup 파일시스템 기본 구조
cgroup은 가상 파일시스템(VFS)을 통해 관리된다. 모든 설정은 파일 읽기/쓰기로 이루어진다.
# cgroup 마운트 확인
mount | grep cgroup
# cgroup v2가 마운트된 경우
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
# 현재 프로세스의 cgroup 확인
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-1.scope
2. cgroup v1 vs v2
cgroup v1의 구조
cgroup v1에서는 각 컨트롤러(CPU, memory, blkio 등)가 독립적인 hierarchy를 가진다.
cgroup v1 구조:
/sys/fs/cgroup/
├── cpu/ # CPU 컨트롤러
│ ├── docker/
│ │ ├── container-abc/
│ │ │ ├── cpu.cfs_quota_us
│ │ │ ├── cpu.cfs_period_us
│ │ │ └── cpu.shares
│ │ └── container-xyz/
│ └── tasks
├── memory/ # Memory 컨트롤러
│ ├── docker/
│ │ ├── container-abc/
│ │ │ ├── memory.limit_in_bytes
│ │ │ └── memory.usage_in_bytes
│ │ └── container-xyz/
│ └── tasks
├── blkio/ # Block IO 컨트롤러
├── pids/ # PID 컨트롤러
├── devices/ # Device 컨트롤러
└── freezer/ # Freezer 컨트롤러
v1의 특징:
- 각 컨트롤러가 별도의 디렉토리 트리를 가짐
- 하나의 프로세스가 각 컨트롤러에서 서로 다른 그룹에 속할 수 있음
- 유연하지만 관리가 복잡함
cgroup v2의 구조 (Unified Hierarchy)
cgroup v2는 단일 통합 hierarchy를 사용한다.
cgroup v2 구조 (Unified):
/sys/fs/cgroup/ # 루트 cgroup
├── cgroup.controllers # 사용 가능한 컨트롤러 목록
├── cgroup.subtree_control # 자식에게 활성화할 컨트롤러
├── system.slice/ # systemd 시스템 서비스
│ ├── docker.service/
│ └── sshd.service/
├── user.slice/ # 사용자 세션
│ └── user-1000.slice/
└── kubepods.slice/ # Kubernetes pods
├── kubepods-burstable.slice/
└── kubepods-besteffort.slice/
v1 vs v2 비교표
| 특성 | cgroup v1 | cgroup v2 |
|---|---|---|
| Hierarchy | 컨트롤러별 독립 | 단일 통합 (Unified) |
| 마운트 포인트 | /sys/fs/cgroup/cpu/, /sys/fs/cgroup/memory/ 등 | /sys/fs/cgroup/ 하나 |
| 프로세스 배치 | 각 컨트롤러마다 다른 그룹 가능 | 모든 컨트롤러가 동일 그룹 |
| PSI (Pressure Stall Info) | 미지원 | 지원 |
| Threaded mode | 미지원 | 지원 |
| 서브트리 위임 | 제한적 | 완전한 delegation 모델 |
| Memory QoS | memory.limit만 | memory.min/low/high/max 4단계 |
| IO 제어 | blkio (buffered IO 미지원) | io (buffered IO 지원) |
| CPU burst | 미지원 (커널 패치 필요) | 지원 (kernel 5.14+) |
v2 마이그레이션 호환성
| 소프트웨어 | cgroup v2 지원 시작 버전 |
|---|---|
| systemd | 236+ |
| Docker | 20.10+ |
| containerd | 1.4+ |
| Kubernetes | 1.25+ (GA) |
| Podman | 네이티브 지원 |
| RHEL | 9+ (기본값) |
| Ubuntu | 21.10+ (기본값) |
cgroup v2 활성화 확인
# cgroup v2가 사용 중인지 확인
stat -fc %T /sys/fs/cgroup/
# cgroup2fs -> v2
# tmpfs -> v1
# 또는 마운트 정보로 확인
grep cgroup /proc/mounts
# 커널 부트 파라미터로 v2 강제 사용
# GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"
3. 주요 컨트롤러 심층 분석
3.1 CPU 컨트롤러
CFS Bandwidth Control
Linux의 CFS(Completely Fair Scheduler)는 CPU 시간을 공정하게 분배한다. cgroup CPU 컨트롤러는 CFS 위에 bandwidth throttling을 추가한다.
CFS Bandwidth Control 원리:
Period = 100ms (기본값)
Quota = 50ms (할당량)
0ms 50ms 100ms 150ms 200ms
|----------|----------|----------|----------|
[##########]...........[##########]...........
^-- 실행 --^ ^- 쓰로틀 -^ ^-- 실행 --^ ^- 쓰로틀 -^
CPU 1개 기준: quota/period = 50ms/100ms = 0.5 CPU (50%)
v1 vs v2 CPU 파라미터
# ===== cgroup v1 =====
# CFS quota: 마이크로초 단위, -1은 무제한
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_quota_us
# 50000 (50ms)
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_period_us
# 100000 (100ms)
# CPU shares: 상대적 가중치 (기본 1024)
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.shares
# 512
# ===== cgroup v2 =====
# cpu.max: "quota period" 형식
cat /sys/fs/cgroup/kubepods.slice/.../cpu.max
# 50000 100000 (50ms quota, 100ms period)
# "max 100000" -> 무제한
# cpu.weight: 1-10000 (기본 100)
cat /sys/fs/cgroup/kubepods.slice/.../cpu.weight
# 100
CPU shares/weight 변환
v1 cpu.shares -> v2 cpu.weight 변환:
v2_weight = (1 + ((v1_shares - 2) * 9999) / 262142)
예시:
v1 shares=1024 (기본) -> v2 weight=39 (약)
v1 shares=512 -> v2 weight=20 (약)
v1 shares=2048 -> v2 weight=78 (약)
cpuset: CPU 핀닝
특정 CPU 코어에 프로세스를 고정(pinning)할 수 있다.
# v2: 사용할 CPU 코어 지정
echo "0-3" > /sys/fs/cgroup/mygroup/cpuset.cpus
echo "0" > /sys/fs/cgroup/mygroup/cpuset.mems
# 현재 설정 확인
cat /sys/fs/cgroup/mygroup/cpuset.cpus.effective
실습: CPU 제한 cgroup 직접 생성
# cgroup v2 환경에서 실습
# 1. 새 cgroup 생성
sudo mkdir /sys/fs/cgroup/cpu-test
# 2. CPU 컨트롤러 활성화 (루트에서)
echo "+cpu" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
# 3. CPU 50%로 제한 (50ms quota / 100ms period)
echo "50000 100000" | sudo tee /sys/fs/cgroup/cpu-test/cpu.max
# 4. 현재 쉘을 cgroup에 추가
echo $$ | sudo tee /sys/fs/cgroup/cpu-test/cgroup.procs
# 5. CPU 부하 생성 후 제한 확인
stress --cpu 1 --timeout 10 &
# 6. 쓰로틀링 통계 확인
cat /sys/fs/cgroup/cpu-test/cpu.stat
# usage_usec 4500000
# user_usec 4500000
# system_usec 0
# nr_periods 100
# nr_throttled 50
# throttled_usec 5000000
# 7. 정리
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/cpu-test
CPU Throttling 문제와 burst
CPU 쓰로틀링은 짧은 burst 워크로드에서 심각한 latency spike를 유발할 수 있다.
문제 시나리오: 평균 CPU 사용량 30%인 웹 서버 (limit: 1 CPU)
정상 시: [###-------][###-------][###-------] -> 응답 10ms
버스트 시: [##########][..........][###-------] -> 응답 110ms!
^-- 100ms 전부 사용 --^ ^-- 쓰로틀 --^
해결: CPU burst (kernel 5.14+, cgroup v2)
cpu.max.burst = 허용 burst 마이크로초
이전 period에서 미사용 quota를 저축하여 burst 시 사용
# CPU burst 설정 (cgroup v2, kernel 5.14+)
echo 20000 > /sys/fs/cgroup/mygroup/cpu.max.burst
# 20ms까지 burst 허용
3.2 Memory 컨트롤러
v2의 4단계 메모리 제한
cgroup v2는 v1보다 세밀한 4단계 메모리 제어를 제공한다.
Memory 제어 4단계 (cgroup v2):
memory.min : 절대 보호 (이 이하로 reclaim 불가)
↑ 최소 보장 영역
memory.low : 소프트 보호 (가능한 reclaim 방지, best-effort)
↑ 선호 영역
memory.high : 소프트 제한 (초과 시 throttle, OOM 아님)
↑ 경고 영역
memory.max : 하드 제한 (초과 시 OOM Killer 발동)
↑ 금지 영역
0 MB 1024 MB
|=====[min]==[low]==========[high]====[max]========|
|<-보호됨->|<-선호->|<-정상 사용->|<-쓰로틀->|<-OOM->|
v1 vs v2 메모리 파라미터
# ===== cgroup v1 =====
# 하드 제한
echo 536870912 > memory.limit_in_bytes # 512MB
# 소프트 제한 (reclaim 압력)
echo 268435456 > memory.soft_limit_in_bytes # 256MB
# Swap 포함 제한
echo 1073741824 > memory.memsw.limit_in_bytes # 1GB (mem+swap)
# OOM 제어
echo 1 > memory.oom_control # OOM Killer 비활성화
# ===== cgroup v2 =====
echo 512M > memory.max # 하드 제한
echo 256M > memory.high # 소프트 제한 (throttle)
echo 128M > memory.low # best-effort 보호
echo 64M > memory.min # 절대 보호
echo 256M > memory.swap.max # swap 제한
memory.stat 분석
cat /sys/fs/cgroup/kubepods.slice/.../memory.stat
# anon 104857600 # 익명 메모리 (heap, stack)
# file 52428800 # 파일 캐시
# kernel 8388608 # 커널 메모리 (slab 등)
# shmem 0 # 공유 메모리
# pgfault 250000 # 페이지 폴트 횟수
# pgmajfault 10 # 메이저 페이지 폴트
# workingset_refault 500 # working set에서 evict 후 재참조
# oom_kill 0 # OOM kill 횟수
중요 지표 해석:
anon높음: 애플리케이션 힙 메모리 사용량이 큼file높음: 파일 캐시가 많음 (보통 정상)pgmajfault높음: 디스크에서 페이지를 읽어오는 빈도가 높음 (메모리 부족 신호)oom_kill > 0: OOM이 발생했음
실습: 메모리 제한과 OOM 테스트
# 1. 메모리 제한 cgroup 생성
sudo mkdir /sys/fs/cgroup/mem-test
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
# 2. 100MB 하드 제한 설정
echo 104857600 | sudo tee /sys/fs/cgroup/mem-test/memory.max
# 3. 80MB 소프트 제한 설정 (초과 시 throttle)
echo 83886080 | sudo tee /sys/fs/cgroup/mem-test/memory.high
# 4. 프로세스를 cgroup에 추가하고 메모리 할당
echo $$ | sudo tee /sys/fs/cgroup/mem-test/cgroup.procs
# 5. 100MB 초과 메모리 할당 시도
python3 -c "
data = []
for i in range(200):
data.append(bytearray(1024 * 1024)) # 1MB씩 할당
print(f'Allocated {i+1} MB')
"
# -> 약 100MB에서 OOM Killed
# 6. OOM 이벤트 확인
cat /sys/fs/cgroup/mem-test/memory.events
# oom 1
# oom_kill 1
# high 150
# 7. dmesg에서 OOM 로그 확인
dmesg | grep -i "killed process"
Swap 제어
# cgroup v2에서 swap 제한
echo 0 > /sys/fs/cgroup/mygroup/memory.swap.max # swap 사용 금지
echo max > /sys/fs/cgroup/mygroup/memory.swap.max # swap 무제한
# v1에서는 memory + swap 합산으로 제어
echo 1073741824 > memory.memsw.limit_in_bytes # mem + swap = 1GB
3.3 IO 컨트롤러
v1 (blkio) vs v2 (io)
v1의 blkio 컨트롤러는 Direct IO만 제어할 수 있었다. Buffered IO는 page cache를 거치므로 blkio가 추적하지 못했다. v2의 io 컨트롤러는 writeback 메커니즘을 통해 Buffered IO도 제어한다.
# ===== cgroup v2: io.max =====
# 형식: MAJ:MIN rbps=NUM wbps=NUM riops=NUM wiops=NUM
# 디바이스 번호 확인
lsblk -o NAME,MAJ:MIN
# sda 8:0
# 읽기 10MB/s, 쓰기 5MB/s로 제한
echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/mygroup/io.max
# IOPS 제한: 읽기 1000 IOPS, 쓰기 500 IOPS
echo "8:0 riops=1000 wiops=500" > /sys/fs/cgroup/mygroup/io.max
io.weight (상대적 가중치)
# io.weight: 1-10000 (기본 100)
echo "default 200" > /sys/fs/cgroup/mygroup/io.weight
# 특정 디바이스에 대해 가중치 설정
echo "8:0 500" > /sys/fs/cgroup/mygroup/io.weight
io.latency (v2 전용)
latency target을 설정하여 IO 지연시간을 보장한다.
# 5ms latency target 설정
echo "8:0 target=5000" > /sys/fs/cgroup/mygroup/io.latency
3.4 PID 컨트롤러
Fork Bomb 방지
# PID 제한 설정
echo 100 > /sys/fs/cgroup/mygroup/pids.max
# 현재 PID 수 확인
cat /sys/fs/cgroup/mygroup/pids.current
# 5
# Fork bomb 테스트 (안전한 환경에서만!)
# 제한이 없으면 시스템 전체가 멈출 수 있다
# pids.max가 설정되어 있으면 100개에서 fork() 실패
Fork Bomb 방지 원리:
pids.max = 100
Process Tree:
init(1)
├── bash(2)
│ ├── worker(3)
│ ├── worker(4)
│ │ ├── child(5)
│ │ └── child(6)
│ ...
│ └── worker(100) <- pids.max 도달
│ └── fork() -> EAGAIN (실패!)
3.5 기타 컨트롤러
devices 컨트롤러
디바이스 접근을 화이트리스트/블랙리스트 방식으로 제어한다.
# v1: devices.allow / devices.deny
echo 'c 1:3 rmw' > devices.allow # /dev/null 접근 허용
echo 'b 8:0 r' > devices.allow # /dev/sda 읽기 허용
echo 'a' > devices.deny # 모든 디바이스 거부
# GPU 접근 제어 (NVIDIA)
echo 'c 195:* rmw' > devices.allow # NVIDIA 디바이스 허용
freezer 컨트롤러
프로세스 그룹을 일시 중지(freeze)하고 재개(thaw)할 수 있다.
# v2: cgroup.freeze
echo 1 > /sys/fs/cgroup/mygroup/cgroup.freeze # 일시 중지
echo 0 > /sys/fs/cgroup/mygroup/cgroup.freeze # 재개
# 상태 확인
cat /sys/fs/cgroup/mygroup/cgroup.events
# frozen 1
hugetlb 컨트롤러
Huge page 사용량을 제한한다.
# 2MB huge pages 제한
echo 1073741824 > hugetlb.2MB.limit_in_bytes # 1GB
cat hugetlb.2MB.usage_in_bytes
4. Docker와 cgroup
Docker 리소스 제한 옵션
# CPU 제한
docker run -d \
--cpus="1.5" \ # CPU 1.5개 (quota=150000, period=100000)
--cpu-shares=512 \ # CPU shares (상대적 가중치)
--cpuset-cpus="0,1" \ # CPU 0, 1번 코어만 사용
nginx
# 메모리 제한
docker run -d \
--memory=512m \ # 메모리 하드 제한 512MB
--memory-swap=1g \ # 메모리+스왑 합계 1GB (스왑 512MB)
--memory-reservation=256m \ # 소프트 제한
--oom-kill-disable \ # OOM Killer 비활성화 (주의!)
nginx
# IO 제한
docker run -d \
--device-read-bps /dev/sda:10mb \ # 읽기 10MB/s
--device-write-bps /dev/sda:5mb \ # 쓰기 5MB/s
--device-read-iops /dev/sda:1000 \ # 읽기 1000 IOPS
nginx
# PID 제한
docker run -d \
--pids-limit=100 \ # 최대 100개 프로세스
nginx
Docker가 생성하는 cgroup 경로
# systemd cgroup driver 사용 시
# /sys/fs/cgroup/system.slice/docker-CONTAINER_ID.scope/
# 컨테이너 ID 확인
CONTAINER_ID=$(docker ps -q --filter name=my-nginx)
# cgroup 경로 확인
docker inspect --format='{{.HostConfig.CgroupParent}}' $CONTAINER_ID
# 직접 cgroup 파일 확인 (v2)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max
docker stats로 실시간 모니터링
# 모든 컨테이너 리소스 사용량
docker stats
# CONTAINER ID NAME CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O
# abc123 nginx 0.50% 50MiB / 512MiB 9.77% 1.2kB/0B 0B/0B
# xyz789 redis 1.20% 30MiB / 256MiB 11.7% 500B/200B 4kB/0B
# 특정 컨테이너만
docker stats my-nginx --no-stream
Docker cgroup driver: cgroupfs vs systemd
+---------------------------+---------------------+---------------------+
| | cgroupfs | systemd |
+---------------------------+---------------------+---------------------+
| cgroup 관리 방식 | Docker가 직접 관리 | systemd에 위임 |
| init 시스템과의 충돌 | 가능성 있음 | 없음 (통합) |
| Kubernetes 권장 | 비권장 | 권장 (기본값) |
| 설정 위치 | Docker가 직접 생성 | systemd scope/slice |
| cgroup v2 호환성 | 제한적 | 완전 지원 |
+---------------------------+---------------------+---------------------+
// /etc/docker/daemon.json
// systemd cgroup driver 설정
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
5. Kubernetes와 cgroup
kubelet cgroup driver 설정
# kubelet 설정 (/var/lib/kubelet/config.yaml)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd # 권장
cgroupsPerQOS: true # QoS별 cgroup hierarchy 생성
enforceNodeAllocatable:
- pods
- system-reserved
- kube-reserved
kubeReserved:
cpu: '500m'
memory: '1Gi'
systemReserved:
cpu: '500m'
memory: '1Gi'
Pod QoS 클래스와 cgroup
Kubernetes는 Pod의 requests/limits 설정에 따라 3가지 QoS 클래스를 부여한다.
Pod QoS 클래스 결정 로직:
1. Guaranteed: 모든 컨테이너에 requests == limits 설정
- 가장 높은 우선순위, OOM score = -997
2. Burstable: requests < limits (일부라도)
- 중간 우선순위, OOM score = 2~999
3. BestEffort: requests/limits 모두 미설정
- 가장 낮은 우선순위, OOM score = 1000
# Guaranteed Pod 예시
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: '500m'
memory: '256Mi'
limits:
cpu: '500m' # requests == limits
memory: '256Mi' # requests == limits
# Burstable Pod 예시
apiVersion: v1
kind: Pod
metadata:
name: burstable-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: '250m'
memory: '128Mi'
limits:
cpu: '500m' # requests < limits
memory: '256Mi'
Kubernetes가 생성하는 cgroup hierarchy
/sys/fs/cgroup/
└── kubepods.slice/ # 모든 Pod
├── kubepods-burstable.slice/ # Burstable QoS
│ └── kubepods-burstable-podABCD.slice/ # Pod 단위
│ ├── cri-containerd-XXXX.scope # 컨테이너 단위
│ │ ├── cpu.max # limits.cpu
│ │ ├── cpu.weight # requests.cpu 기반
│ │ ├── memory.max # limits.memory
│ │ ├── memory.min # requests.memory (v2)
│ │ └── pids.max # pod pid limit
│ └── cri-containerd-YYYY.scope
├── kubepods-besteffort.slice/ # BestEffort QoS
│ └── kubepods-besteffort-podEFGH.slice/
└── kubepods-podIJKL.slice/ # Guaranteed QoS
└── cri-containerd-ZZZZ.scope
Kubernetes 리소스 요청이 cgroup에 매핑되는 방식
Kubernetes -> cgroup v2 매핑:
resources.requests.cpu: 500m
-> cpu.weight = (500m / 총 allocatable CPU) 비례
-> Pod가 경쟁 시 보장받는 CPU 비율
resources.limits.cpu: 1000m
-> cpu.max = "100000 100000" (100ms/100ms = 1 CPU)
resources.requests.memory: 256Mi
-> memory.min = 268435456 (Guaranteed QoS in v2)
resources.limits.memory: 512Mi
-> memory.max = 536870912
pid limit (kubelet 설정):
-> pids.max = podPidsLimit (기본 -1, 무제한)
노드 리소스 예약
노드 리소스 분배:
총 노드 리소스 (예: 16 CPU, 64Gi Memory)
├── kube-reserved : kubelet, kube-proxy 등 (0.5 CPU, 1Gi)
├── system-reserved : sshd, journald 등 (0.5 CPU, 1Gi)
├── eviction-threshold: 회수 임계값 (memory.available=100Mi)
└── allocatable : Pod에 할당 가능한 리소스 (15 CPU, 61.9Gi)
allocatable = total - kube-reserved - system-reserved - eviction-threshold
cgroup v2 + Kubernetes: MemoryQoS
Kubernetes 1.22+에서 MemoryQoS 기능(alpha/beta)은 cgroup v2의 memory.high를 활용한다.
MemoryQoS 동작:
Burstable Pod (requests: 256Mi, limits: 512Mi)
cgroup v2 설정:
memory.min = 0 (BestEffort 보호 없음)
memory.low = 0 (기본)
memory.high = 268435456 (requests 기반, throttle point)
memory.max = 536870912 (limits 기반, OOM point)
메모리 사용이 requests(256Mi)를 넘으면:
-> memory.high에 의해 throttle (할당 속도 저하)
-> 바로 OOM kill 되지 않음
-> limits(512Mi)를 넘으면 OOM kill
6. cgroup 실전 트러블슈팅
6.1 CPU Throttling 확인
# 컨테이너의 CPU throttling 확인
# Pod의 cgroup 경로 찾기
CGROUP_PATH=$(cat /proc/1/cgroup | grep -oP '(?<=::).*')
# cpu.stat 확인
cat /sys/fs/cgroup${CGROUP_PATH}/cpu.stat
# usage_usec 45000000
# user_usec 40000000
# system_usec 5000000
# nr_periods 1000
# nr_throttled 350 <- 35% throttle 비율!
# throttled_usec 15000000
# throttle 비율 계산
# throttle_ratio = nr_throttled / nr_periods
# 350 / 1000 = 35% -> 높음! limits 증가 검토 필요
CPU Throttling 해결 전략
CPU Throttling 진단 플로우:
1. nr_throttled / nr_periods > 5% 인가?
├── Yes -> 2로 진행
└── No -> CPU throttling은 문제가 아님
2. 평균 CPU 사용량이 limits에 가까운가?
├── Yes -> limits 증가 필요
└── No -> burst 패턴 문제 -> 3으로 진행
3. 짧은 시간에 CPU burst가 발생하는가?
├── Yes -> cpu.max.burst 설정 (kernel 5.14+)
│ 또는 CPU limits 제거 검토
└── No -> period 조정 검토
6.2 OOM Killer 트러블슈팅
# 1. dmesg에서 OOM 이벤트 확인
dmesg | grep -i "killed process"
# [12345.678] Killed process 1234 (java) total-vm:4096000kB,
# anon-rss:524288kB, file-rss:8192kB, shmem-rss:0kB,
# UID:1000 pgrp:1234
# 2. cgroup 메모리 이벤트 확인
cat /sys/fs/cgroup/kubepods.slice/.../memory.events
# low 0
# high 500 # memory.high 초과 횟수
# max 10 # memory.max 초과 횟수
# oom 2 # OOM 발생 횟수
# oom_kill 2 # OOM kill 횟수
# 3. 현재 메모리 사용량 vs 제한
cat /sys/fs/cgroup/kubepods.slice/.../memory.current
# 524288000 (500MB)
cat /sys/fs/cgroup/kubepods.slice/.../memory.max
# 536870912 (512MB) <- 거의 도달!
# 4. 메모리 사용 상세 분석
cat /sys/fs/cgroup/kubepods.slice/.../memory.stat | head -10
OOM 해결 전략
OOM Kill 진단 플로우:
1. memory.events에서 oom_kill > 0 확인
├── oom_kill 증가 중 -> 2로 진행
└── oom_kill = 0 -> OOM 문제 아님
2. memory.current vs memory.max 비교
├── current >= max * 0.9 -> 메모리 부족
│ ├── anon 높음 -> 힙 메모리 누수 가능성 -> 프로파일링
│ ├── file 높음 -> 파일 캐시 과다 -> 정상일 수 있음
│ └── shmem 높음 -> 공유 메모리 사용 확인
└── current << max -> 급격한 할당 패턴 확인
3. 해결책:
a) memory limits 증가
b) 애플리케이션 메모리 누수 수정
c) JVM: -XX:MaxRAMPercentage=75 설정
d) memory.high 설정으로 soft throttle 적용
6.3 프로세스의 cgroup 확인
# 특정 프로세스의 cgroup 확인
cat /proc/PID/cgroup
# v2 출력: 0::/kubepods.slice/kubepods-burstable.slice/...
# v1 출력:
# 12:pids:/docker/abc123
# 11:memory:/docker/abc123
# 10:cpu,cpuacct:/docker/abc123
# systemd-cgls: cgroup 트리 표시
systemd-cgls
# Control group /:
# -.slice
# ├─user.slice
# │ └─user-1000.slice
# ├─system.slice
# │ ├─docker.service
# │ └─sshd.service
# └─kubepods.slice
# ├─kubepods-burstable.slice
# └─kubepods-besteffort.slice
6.4 systemd-cgtop: 실시간 cgroup 리소스 모니터링
# systemd-cgtop: top과 유사한 cgroup 모니터링
systemd-cgtop
# Control Group Tasks %CPU Memory Input/s Output/s
# / 235 15.2 3.5G - -
# /system.slice 45 5.1 1.2G - -
# /kubepods.slice 120 8.3 2.0G - -
# /kubepods.slice/burstable 80 6.1 1.5G - -
6.5 cAdvisor와 Prometheus 메트릭
주요 cAdvisor Prometheus 메트릭:
CPU:
container_cpu_usage_seconds_total # 누적 CPU 사용 시간
container_cpu_cfs_throttled_seconds_total # 누적 throttle 시간
container_cpu_cfs_periods_total # 총 CFS period 수
container_cpu_cfs_throttled_periods_total # throttle된 period 수
Memory:
container_memory_usage_bytes # 현재 메모리 사용량
container_memory_working_set_bytes # working set (OOM 판단 기준)
container_memory_rss # RSS (실제 물리 메모리)
container_memory_cache # 페이지 캐시
OOM:
container_oom_events_total # OOM 이벤트 횟수
kube_pod_container_status_last_terminated_reason # OOMKilled 등
유용한 PromQL 쿼리
# CPU throttle 비율 (5분 평균)
rate(container_cpu_cfs_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m])
# 메모리 사용률 (limits 대비)
container_memory_working_set_bytes
/ container_spec_memory_limit_bytes
# OOM Kill 발생 Pod
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
7. cgroup과 보안
컨테이너 격리의 3요소
컨테이너 보안 격리 계층:
+--------------------------------------------------+
| Host Kernel |
| |
| ┌──────────┐ ┌──────────┐ ┌──────────┐ |
| │ Namespace│ │ cgroup │ │ Seccomp/ │ |
| │ (격리) │ │ (제한) │ │ AppArmor │ |
| │ │ │ │ │ (보안) │ |
| │ - PID │ │ - CPU │ │ - syscall│ |
| │ - NET │ │ - Memory │ │ 필터링 │ |
| │ - MNT │ │ - IO │ │ - MAC │ |
| │ - USER │ │ - PIDs │ │ 정책 │ |
| └──────────┘ └──────────┘ └──────────┘ |
| |
+--------------------------------------------------+
3가지가 모두 갖추어져야 안전한 격리!
CVE-2022-0492: cgroup escape 취약점
CVE-2022-0492는 cgroup v1의 release_agent 메커니즘을 악용한 컨테이너 탈출 취약점이다.
공격 원리:
1. release_agent는 cgroup의 마지막 프로세스가 종료될 때 실행되는 프로그램
2. 취약점: cgroup_release_agent_write()에서 권한 검증 누락
3. 공격자가 unshare()로 새 user/cgroup namespace를 생성
4. writable cgroupfs를 마운트
5. release_agent에 호스트에서 실행할 바이너리 경로 설정
6. 호스트 권한으로 임의 코드 실행
방어:
- Seccomp + AppArmor/SELinux가 적용되어 있으면 방어됨
- unshare() syscall을 차단하거나
- cgroupfs 마운트를 제한
rootless containers와 cgroup v2 delegation
# cgroup v2에서 rootless 컨테이너가 작동하려면
# 사용자에게 cgroup 서브트리를 위임해야 함
# systemd에서 위임 설정
sudo systemctl edit user@1000.service
# [Service]
# Delegate=cpu memory pids io
# 또는 사용자 단위로
mkdir -p /etc/systemd/system/user@.service.d/
cat > /etc/systemd/system/user@.service.d/delegate.conf << 'CONF'
[Service]
Delegate=cpu cpuset io memory pids
CONF
systemctl daemon-reload
Kubernetes Pod Security Standards와 cgroup
# Pod Security Standard: restricted
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: nginx
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ['ALL']
readOnlyRootFilesystem: true
resources:
limits: # cgroup 리소스 제한 필수
cpu: '500m'
memory: '256Mi'
requests:
cpu: '250m'
memory: '128Mi'
8. 정리
cgroup v1에서 v2 마이그레이션 체크리스트
- 커널 버전 확인: 5.8+ 권장 (PSI, CPU burst 등)
- systemd 버전 확인: 236+
- 컨테이너 런타임 확인: Docker 20.10+, containerd 1.4+
- Kubernetes 버전 확인: 1.25+ (GA)
- GRUB 부트 파라미터 설정
- kubelet cgroup driver를 systemd로 변경
- 모니터링 도구 호환성 확인 (cAdvisor, Prometheus)
- 커스텀 cgroup 스크립트가 있다면 v2 인터페이스로 업데이트
- 테스트 환경에서 충분히 검증 후 프로덕션 적용
프로덕션 추천 설정
# Kubernetes 노드 권장 설정
# kubelet config
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
cgroupsPerQOS: true
podPidsLimit: 1024 # Fork bomb 방지
enforceNodeAllocatable:
- pods
- system-reserved
- kube-reserved
kubeReserved:
cpu: '500m'
memory: '1Gi'
ephemeral-storage: '1Gi'
systemReserved:
cpu: '500m'
memory: '1Gi'
ephemeral-storage: '1Gi'
evictionHard:
memory.available: '100Mi'
nodefs.available: '10%'
imagefs.available: '15%'
핵심 명령어 치트시트
# === 정보 확인 ===
cat /proc/self/cgroup # 현재 프로세스 cgroup
stat -fc %T /sys/fs/cgroup/ # v1(tmpfs) vs v2(cgroup2fs)
cat /sys/fs/cgroup/cgroup.controllers # 사용 가능 컨트롤러 목록
systemd-cgls # cgroup 트리 표시
systemd-cgtop # 실시간 cgroup 모니터링
# === CPU ===
cat /sys/fs/cgroup/PATH/cpu.max # CPU quota/period
cat /sys/fs/cgroup/PATH/cpu.weight # CPU weight
cat /sys/fs/cgroup/PATH/cpu.stat # throttle 통계
# === Memory ===
cat /sys/fs/cgroup/PATH/memory.max # 하드 제한
cat /sys/fs/cgroup/PATH/memory.current # 현재 사용량
cat /sys/fs/cgroup/PATH/memory.stat # 상세 통계
cat /sys/fs/cgroup/PATH/memory.events # OOM 이벤트
# === IO ===
cat /sys/fs/cgroup/PATH/io.max # IO 제한
cat /sys/fs/cgroup/PATH/io.stat # IO 통계
# === PID ===
cat /sys/fs/cgroup/PATH/pids.max # PID 제한
cat /sys/fs/cgroup/PATH/pids.current # 현재 PID 수
# === Docker ===
docker stats # 실시간 리소스 모니터링
docker inspect CONTAINER | grep -i cgroup # cgroup 설정 확인
# === 트러블슈팅 ===
dmesg | grep -i "killed process" # OOM kill 로그
cat /proc/PID/cgroup # 프로세스 cgroup 확인
cat /proc/PID/oom_score # OOM 점수 확인
퀴즈: cgroup 이해도 테스트
Q1. cgroup v1과 v2의 가장 큰 구조적 차이는?
A: v1은 컨트롤러별로 독립적인 hierarchy를 가지지만, v2는 단일 통합(unified) hierarchy를 사용한다. v2에서는 하나의 프로세스가 모든 컨트롤러에서 동일한 cgroup에 속한다.
Q2. Kubernetes Pod QoS 클래스 중 "Guaranteed"가 되려면 어떤 조건이 필요한가?
A: 모든 컨테이너에서 CPU와 메모리의 requests와 limits가 동일하게 설정되어야 한다.
Q3. cgroup v2의 memory.high와 memory.max의 차이는?
A: memory.high는 소프트 제한으로, 초과 시 프로세스를 throttle(속도 제한)한다. memory.max는 하드 제한으로, 초과 시 OOM Killer가 발동한다.
Q4. CPU throttle 비율이 35%일 때 의미하는 것은?
A: 전체 CFS period 중 35%에서 CPU quota를 모두 소진하여 프로세스가 실행되지 못했다는 의미이다. CPU limits 증가 또는 CPU burst 설정을 검토해야 한다.
Q5. Docker에서 systemd cgroup driver가 cgroupfs보다 권장되는 이유는?
A: systemd가 init 시스템으로 동작할 때, cgroupfs와 systemd 두 개의 cgroup 관리자가 동시에 존재하면 리소스 압박 상황에서 시스템이 불안정해질 수 있다. systemd driver를 사용하면 단일 cgroup 관리자로 통합된다.
Q6. CVE-2022-0492는 어떤 cgroup 메커니즘을 악용하는가?
A: cgroup v1의 release_agent 메커니즘을 악용한다. release_agent는 cgroup의 마지막 프로세스 종료 시 실행되는 프로그램인데, 권한 검증이 누락되어 컨테이너에서 호스트 권한으로 임의 코드를 실행할 수 있었다.
참고 자료
- Linux Kernel Documentation: Control Group v2
- Kubernetes Documentation: About cgroup v2
- Red Hat: Migrating from CGroups V1 to V2
- CFS Bandwidth Control - Linux Kernel Documentation
[Linux] Complete Guide to cgroups: The Core of Container Resource Control
Table of Contents
- What are cgroups?
- cgroup v1 vs v2
- Deep Dive into Major Controllers
- Docker and cgroups
- Kubernetes and cgroups
- Real-World Troubleshooting
- cgroups and Security
- Summary
1. What are cgroups?
The Birth of Control Groups
cgroup (Control Groups) is a process group-level resource limiting and isolation mechanism provided by the Linux kernel. In 2006, Google engineers Paul Menage and Rohit Seth began development under the name "process containers." In 2007, it was merged into Linux kernel 2.6.24 and renamed to cgroup to avoid confusion with the existing kernel term "container."
Problems cgroups Solve
Before cgroups, it was difficult to prevent a single process from monopolizing all CPU or memory resources on a system. cgroups provide the following capabilities:
- Resource Limiting: Set upper bounds on CPU, memory, IO, and network bandwidth usage
- Prioritization: Adjust resource allocation ratios between process groups
- Accounting: Measure and report resource usage per group
- Control: Suspend, resume, and checkpoint process groups
The Relationship Between Namespaces and cgroups
The two pillars of container technology are Namespaces and cgroups.
+-------------------------------------------+
| Linux Container Isolation Model |
+-------------------------------------------+
| |
| Namespace (Isolation - Visibility) |
| ┌─────────────────────────────────────┐ |
| │ PID NS : Process ID isolation │ |
| │ NET NS : Network stack isolation │ |
| │ MNT NS : Filesystem mount isolation│ |
| │ UTS NS : Hostname isolation │ |
| │ IPC NS : IPC resource isolation │ |
| │ USER NS : UID/GID mapping isolation │ |
| └─────────────────────────────────────┘ |
| |
| cgroup (Limits - Resource Limits) |
| ┌─────────────────────────────────────┐ |
| │ CPU : CPU time limits │ |
| │ Memory : Memory usage limits │ |
| │ IO : Block device IO limits │ |
| │ PIDs : Process count limits │ |
| │ Devices : Device access control │ |
| └─────────────────────────────────────┘ |
| |
+-------------------------------------------+
The key difference in a nutshell:
- Namespace = "What can you see?" (isolation)
- cgroup = "How much can you use?" (limits)
cgroup Filesystem Basics
cgroups are managed through a virtual filesystem (VFS). All configuration is done through file reads and writes.
# Check cgroup mounts
mount | grep cgroup
# When cgroup v2 is mounted
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
# Check current process cgroup
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-1.scope
2. cgroup v1 vs v2
cgroup v1 Structure
In cgroup v1, each controller (CPU, memory, blkio, etc.) has an independent hierarchy.
cgroup v1 structure:
/sys/fs/cgroup/
├── cpu/ # CPU controller
│ ├── docker/
│ │ ├── container-abc/
│ │ │ ├── cpu.cfs_quota_us
│ │ │ ├── cpu.cfs_period_us
│ │ │ └── cpu.shares
│ │ └── container-xyz/
│ └── tasks
├── memory/ # Memory controller
│ ├── docker/
│ │ ├── container-abc/
│ │ │ ├── memory.limit_in_bytes
│ │ │ └── memory.usage_in_bytes
│ │ └── container-xyz/
│ └── tasks
├── blkio/ # Block IO controller
├── pids/ # PID controller
├── devices/ # Device controller
└── freezer/ # Freezer controller
Key v1 characteristics:
- Each controller has its own separate directory tree
- A single process can belong to different groups in each controller
- Flexible but complex to manage
cgroup v2 Structure (Unified Hierarchy)
cgroup v2 uses a single unified hierarchy.
cgroup v2 structure (Unified):
/sys/fs/cgroup/ # Root cgroup
├── cgroup.controllers # Available controllers list
├── cgroup.subtree_control # Controllers to enable for children
├── system.slice/ # systemd system services
│ ├── docker.service/
│ └── sshd.service/
├── user.slice/ # User sessions
│ └── user-1000.slice/
└── kubepods.slice/ # Kubernetes pods
├── kubepods-burstable.slice/
└── kubepods-besteffort.slice/
v1 vs v2 Comparison Table
| Feature | cgroup v1 | cgroup v2 |
|---|---|---|
| Hierarchy | Independent per controller | Single unified |
| Mount points | /sys/fs/cgroup/cpu/, /sys/fs/cgroup/memory/, etc. | /sys/fs/cgroup/ only |
| Process placement | Different group per controller possible | Same group for all controllers |
| PSI (Pressure Stall Info) | Not supported | Supported |
| Threaded mode | Not supported | Supported |
| Subtree delegation | Limited | Full delegation model |
| Memory QoS | memory.limit only | 4-tier: memory.min/low/high/max |
| IO control | blkio (Direct IO only) | io (Buffered IO supported) |
| CPU burst | Not supported (needs kernel patch) | Supported (kernel 5.14+) |
v2 Migration Compatibility
| Software | cgroup v2 Support Since |
|---|---|
| systemd | 236+ |
| Docker | 20.10+ |
| containerd | 1.4+ |
| Kubernetes | 1.25+ (GA) |
| Podman | Native support |
| RHEL | 9+ (default) |
| Ubuntu | 21.10+ (default) |
Checking cgroup v2 Activation
# Check if cgroup v2 is in use
stat -fc %T /sys/fs/cgroup/
# cgroup2fs -> v2
# tmpfs -> v1
# Or check mount info
grep cgroup /proc/mounts
# Force v2 via kernel boot parameters
# GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"
3. Deep Dive into Major Controllers
3.1 CPU Controller
CFS Bandwidth Control
The Linux CFS (Completely Fair Scheduler) distributes CPU time fairly. The cgroup CPU controller adds bandwidth throttling on top of CFS.
CFS Bandwidth Control Principle:
Period = 100ms (default)
Quota = 50ms (allocation)
0ms 50ms 100ms 150ms 200ms
|----------|----------|----------|----------|
[##########]...........[##########]...........
^- running -^ ^throttled^ ^- running -^ ^throttled^
For 1 CPU: quota/period = 50ms/100ms = 0.5 CPU (50%)
v1 vs v2 CPU Parameters
# ===== cgroup v1 =====
# CFS quota: in microseconds, -1 means unlimited
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_quota_us
# 50000 (50ms)
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_period_us
# 100000 (100ms)
# CPU shares: relative weight (default 1024)
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.shares
# 512
# ===== cgroup v2 =====
# cpu.max: "quota period" format
cat /sys/fs/cgroup/kubepods.slice/.../cpu.max
# 50000 100000 (50ms quota, 100ms period)
# "max 100000" -> unlimited
# cpu.weight: 1-10000 (default 100)
cat /sys/fs/cgroup/kubepods.slice/.../cpu.weight
# 100
CPU shares/weight Conversion
v1 cpu.shares -> v2 cpu.weight conversion:
v2_weight = (1 + ((v1_shares - 2) * 9999) / 262142)
Examples:
v1 shares=1024 (default) -> v2 weight ~ 39
v1 shares=512 -> v2 weight ~ 20
v1 shares=2048 -> v2 weight ~ 78
cpuset: CPU Pinning
Pin processes to specific CPU cores.
# v2: Specify CPU cores to use
echo "0-3" > /sys/fs/cgroup/mygroup/cpuset.cpus
echo "0" > /sys/fs/cgroup/mygroup/cpuset.mems
# Check current settings
cat /sys/fs/cgroup/mygroup/cpuset.cpus.effective
Hands-on: Creating a CPU-Limited cgroup
# Hands-on in a cgroup v2 environment
# 1. Create a new cgroup
sudo mkdir /sys/fs/cgroup/cpu-test
# 2. Enable CPU controller (from root)
echo "+cpu" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
# 3. Limit to 50% CPU (50ms quota / 100ms period)
echo "50000 100000" | sudo tee /sys/fs/cgroup/cpu-test/cpu.max
# 4. Add current shell to cgroup
echo $$ | sudo tee /sys/fs/cgroup/cpu-test/cgroup.procs
# 5. Generate CPU load and verify the limit
stress --cpu 1 --timeout 10 &
# 6. Check throttling statistics
cat /sys/fs/cgroup/cpu-test/cpu.stat
# usage_usec 4500000
# user_usec 4500000
# system_usec 0
# nr_periods 100
# nr_throttled 50
# throttled_usec 5000000
# 7. Cleanup
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/cpu-test
CPU Throttling Issues and Burst
CPU throttling can cause severe latency spikes in bursty workloads.
Problem scenario: Web server with 30% average CPU usage (limit: 1 CPU)
Normal: [###-------][###-------][###-------] -> Response 10ms
Burst: [##########][..........][###-------] -> Response 110ms!
^- used all 100ms -^ ^- throttled -^
Solution: CPU burst (kernel 5.14+, cgroup v2)
cpu.max.burst = allowed burst in microseconds
Saves unused quota from previous periods for burst usage
# Set CPU burst (cgroup v2, kernel 5.14+)
echo 20000 > /sys/fs/cgroup/mygroup/cpu.max.burst
# Allow up to 20ms burst
3.2 Memory Controller
v2's 4-Tier Memory Limits
cgroup v2 provides more granular 4-tier memory control compared to v1.
Memory Control 4 Tiers (cgroup v2):
memory.min : Absolute protection (no reclaim below this)
^ Minimum guaranteed zone
memory.low : Soft protection (best-effort reclaim prevention)
^ Preferred zone
memory.high : Soft limit (throttle on exceed, no OOM)
^ Warning zone
memory.max : Hard limit (OOM Killer triggers on exceed)
^ Forbidden zone
0 MB 1024 MB
|=====[min]==[low]==========[high]====[max]========|
|<protected>|<prefer>|<normal use>|<throttle>|<OOM>|
v1 vs v2 Memory Parameters
# ===== cgroup v1 =====
# Hard limit
echo 536870912 > memory.limit_in_bytes # 512MB
# Soft limit (reclaim pressure)
echo 268435456 > memory.soft_limit_in_bytes # 256MB
# Swap inclusive limit
echo 1073741824 > memory.memsw.limit_in_bytes # 1GB (mem+swap)
# OOM control
echo 1 > memory.oom_control # Disable OOM Killer
# ===== cgroup v2 =====
echo 512M > memory.max # Hard limit
echo 256M > memory.high # Soft limit (throttle)
echo 128M > memory.low # Best-effort protection
echo 64M > memory.min # Absolute protection
echo 256M > memory.swap.max # Swap limit
Analyzing memory.stat
cat /sys/fs/cgroup/kubepods.slice/.../memory.stat
# anon 104857600 # Anonymous memory (heap, stack)
# file 52428800 # File cache
# kernel 8388608 # Kernel memory (slab, etc.)
# shmem 0 # Shared memory
# pgfault 250000 # Page fault count
# pgmajfault 10 # Major page faults
# workingset_refault 500 # Refaults after eviction from working set
# oom_kill 0 # OOM kill count
Key metric interpretation:
- High
anon: Application heap memory usage is large - High
file: Lots of file cache (usually normal) - High
pgmajfault: Frequent disk reads (sign of memory pressure) oom_kill > 0: OOM events have occurred
Hands-on: Memory Limits and OOM Testing
# 1. Create a memory-limited cgroup
sudo mkdir /sys/fs/cgroup/mem-test
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
# 2. Set 100MB hard limit
echo 104857600 | sudo tee /sys/fs/cgroup/mem-test/memory.max
# 3. Set 80MB soft limit (throttle on exceed)
echo 83886080 | sudo tee /sys/fs/cgroup/mem-test/memory.high
# 4. Add process to cgroup and allocate memory
echo $$ | sudo tee /sys/fs/cgroup/mem-test/cgroup.procs
# 5. Attempt to allocate over 100MB
python3 -c "
data = []
for i in range(200):
data.append(bytearray(1024 * 1024)) # 1MB each
print(f'Allocated {i+1} MB')
"
# -> OOM Killed at around 100MB
# 6. Check OOM events
cat /sys/fs/cgroup/mem-test/memory.events
# oom 1
# oom_kill 1
# high 150
# 7. Check OOM logs in dmesg
dmesg | grep -i "killed process"
Swap Control
# Swap limits in cgroup v2
echo 0 > /sys/fs/cgroup/mygroup/memory.swap.max # Disable swap
echo max > /sys/fs/cgroup/mygroup/memory.swap.max # Unlimited swap
# v1 uses combined memory + swap limit
echo 1073741824 > memory.memsw.limit_in_bytes # mem + swap = 1GB
3.3 IO Controller
v1 (blkio) vs v2 (io)
The v1 blkio controller could only control Direct IO. Buffered IO goes through the page cache and was invisible to blkio. The v2 io controller handles Buffered IO as well through the writeback mechanism.
# ===== cgroup v2: io.max =====
# Format: MAJ:MIN rbps=NUM wbps=NUM riops=NUM wiops=NUM
# Check device numbers
lsblk -o NAME,MAJ:MIN
# sda 8:0
# Limit read 10MB/s, write 5MB/s
echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/mygroup/io.max
# IOPS limit: read 1000 IOPS, write 500 IOPS
echo "8:0 riops=1000 wiops=500" > /sys/fs/cgroup/mygroup/io.max
io.weight (Relative Weight)
# io.weight: 1-10000 (default 100)
echo "default 200" > /sys/fs/cgroup/mygroup/io.weight
# Set weight for a specific device
echo "8:0 500" > /sys/fs/cgroup/mygroup/io.weight
io.latency (v2 Only)
Set latency targets to guarantee IO response times.
# Set 5ms latency target
echo "8:0 target=5000" > /sys/fs/cgroup/mygroup/io.latency
3.4 PID Controller
Fork Bomb Prevention
# Set PID limit
echo 100 > /sys/fs/cgroup/mygroup/pids.max
# Check current PID count
cat /sys/fs/cgroup/mygroup/pids.current
# 5
# Fork bomb test (safe environments only!)
# Without limits, the entire system can freeze
# With pids.max set, fork() fails at 100 processes
Fork Bomb Prevention Principle:
pids.max = 100
Process Tree:
init(1)
├── bash(2)
│ ├── worker(3)
│ ├── worker(4)
│ │ ├── child(5)
│ │ └── child(6)
│ ...
│ └── worker(100) <- pids.max reached
│ └── fork() -> EAGAIN (fails!)
3.5 Other Controllers
devices Controller
Control device access via whitelist/blacklist.
# v1: devices.allow / devices.deny
echo 'c 1:3 rmw' > devices.allow # Allow /dev/null access
echo 'b 8:0 r' > devices.allow # Allow /dev/sda read
echo 'a' > devices.deny # Deny all devices
# GPU access control (NVIDIA)
echo 'c 195:* rmw' > devices.allow # Allow NVIDIA devices
freezer Controller
Freeze and thaw process groups.
# v2: cgroup.freeze
echo 1 > /sys/fs/cgroup/mygroup/cgroup.freeze # Freeze
echo 0 > /sys/fs/cgroup/mygroup/cgroup.freeze # Thaw
# Check status
cat /sys/fs/cgroup/mygroup/cgroup.events
# frozen 1
hugetlb Controller
Limit huge page usage.
# 2MB huge pages limit
echo 1073741824 > hugetlb.2MB.limit_in_bytes # 1GB
cat hugetlb.2MB.usage_in_bytes
4. Docker and cgroups
Docker Resource Limit Options
# CPU limits
docker run -d \
--cpus="1.5" \ # 1.5 CPUs (quota=150000, period=100000)
--cpu-shares=512 \ # CPU shares (relative weight)
--cpuset-cpus="0,1" \ # Use CPU cores 0 and 1 only
nginx
# Memory limits
docker run -d \
--memory=512m \ # Hard memory limit 512MB
--memory-swap=1g \ # Memory+swap total 1GB (512MB swap)
--memory-reservation=256m \ # Soft limit
--oom-kill-disable \ # Disable OOM Killer (use with caution!)
nginx
# IO limits
docker run -d \
--device-read-bps /dev/sda:10mb \ # Read 10MB/s
--device-write-bps /dev/sda:5mb \ # Write 5MB/s
--device-read-iops /dev/sda:1000 \ # Read 1000 IOPS
nginx
# PID limit
docker run -d \
--pids-limit=100 \ # Max 100 processes
nginx
cgroup Paths Created by Docker
# With systemd cgroup driver
# /sys/fs/cgroup/system.slice/docker-CONTAINER_ID.scope/
# Get container ID
CONTAINER_ID=$(docker ps -q --filter name=my-nginx)
# Check cgroup path
docker inspect --format='{{.HostConfig.CgroupParent}}' $CONTAINER_ID
# Directly inspect cgroup files (v2)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max
Real-Time Monitoring with docker stats
# Resource usage for all containers
docker stats
# CONTAINER ID NAME CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O
# abc123 nginx 0.50% 50MiB / 512MiB 9.77% 1.2kB/0B 0B/0B
# xyz789 redis 1.20% 30MiB / 256MiB 11.7% 500B/200B 4kB/0B
# Specific container only
docker stats my-nginx --no-stream
Docker cgroup Driver: cgroupfs vs systemd
+---------------------------+---------------------+---------------------+
| | cgroupfs | systemd |
+---------------------------+---------------------+---------------------+
| cgroup management | Docker manages | Delegated to |
| | directly | systemd |
| Conflict with init system | Possible | None (integrated) |
| Kubernetes recommendation | Not recommended | Recommended |
| | | (default) |
| Config location | Docker creates | systemd scope/slice |
| | directly | |
| cgroup v2 compatibility | Limited | Full support |
+---------------------------+---------------------+---------------------+
// /etc/docker/daemon.json
// systemd cgroup driver configuration
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
5. Kubernetes and cgroups
kubelet cgroup Driver Configuration
# kubelet config (/var/lib/kubelet/config.yaml)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd # Recommended
cgroupsPerQOS: true # Create QoS-based cgroup hierarchy
enforceNodeAllocatable:
- pods
- system-reserved
- kube-reserved
kubeReserved:
cpu: '500m'
memory: '1Gi'
systemReserved:
cpu: '500m'
memory: '1Gi'
Pod QoS Classes and cgroups
Kubernetes assigns one of three QoS classes based on the Pod's requests/limits configuration.
Pod QoS Class Decision Logic:
1. Guaranteed: All containers have requests == limits
- Highest priority, OOM score = -997
2. Burstable: requests < limits (for any container)
- Medium priority, OOM score = 2~999
3. BestEffort: No requests/limits set at all
- Lowest priority, OOM score = 1000
# Guaranteed Pod example
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: '500m'
memory: '256Mi'
limits:
cpu: '500m' # requests == limits
memory: '256Mi' # requests == limits
# Burstable Pod example
apiVersion: v1
kind: Pod
metadata:
name: burstable-pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: '250m'
memory: '128Mi'
limits:
cpu: '500m' # requests < limits
memory: '256Mi'
Kubernetes cgroup Hierarchy
/sys/fs/cgroup/
└── kubepods.slice/ # All Pods
├── kubepods-burstable.slice/ # Burstable QoS
│ └── kubepods-burstable-podABCD.slice/ # Per-Pod
│ ├── cri-containerd-XXXX.scope # Per-container
│ │ ├── cpu.max # limits.cpu
│ │ ├── cpu.weight # Based on requests.cpu
│ │ ├── memory.max # limits.memory
│ │ ├── memory.min # requests.memory (v2)
│ │ └── pids.max # Pod pid limit
│ └── cri-containerd-YYYY.scope
├── kubepods-besteffort.slice/ # BestEffort QoS
│ └── kubepods-besteffort-podEFGH.slice/
└── kubepods-podIJKL.slice/ # Guaranteed QoS
└── cri-containerd-ZZZZ.scope
How Kubernetes Resource Requests Map to cgroups
Kubernetes -> cgroup v2 Mapping:
resources.requests.cpu: 500m
-> cpu.weight = proportional to (500m / total allocatable CPU)
-> Guaranteed CPU ratio under contention
resources.limits.cpu: 1000m
-> cpu.max = "100000 100000" (100ms/100ms = 1 CPU)
resources.requests.memory: 256Mi
-> memory.min = 268435456 (for Guaranteed QoS in v2)
resources.limits.memory: 512Mi
-> memory.max = 536870912
pid limit (kubelet setting):
-> pids.max = podPidsLimit (default -1, unlimited)
Node Resource Reservation
Node Resource Distribution:
Total node resources (e.g., 16 CPU, 64Gi Memory)
├── kube-reserved : kubelet, kube-proxy, etc. (0.5 CPU, 1Gi)
├── system-reserved : sshd, journald, etc. (0.5 CPU, 1Gi)
├── eviction-threshold: Reclaim threshold (memory.available=100Mi)
└── allocatable : Resources available for Pods (15 CPU, 61.9Gi)
allocatable = total - kube-reserved - system-reserved - eviction-threshold
cgroup v2 + Kubernetes: MemoryQoS
Kubernetes 1.22+ MemoryQoS feature (alpha/beta) leverages cgroup v2's memory.high.
MemoryQoS Behavior:
Burstable Pod (requests: 256Mi, limits: 512Mi)
cgroup v2 settings:
memory.min = 0 (No BestEffort protection)
memory.low = 0 (default)
memory.high = 268435456 (based on requests, throttle point)
memory.max = 536870912 (based on limits, OOM point)
When memory usage exceeds requests (256Mi):
-> Throttled by memory.high (allocation speed slows)
-> NOT immediately OOM killed
-> OOM kill when exceeding limits (512Mi)
6. Real-World Troubleshooting
6.1 Checking CPU Throttling
# Check CPU throttling for a container
# Find the Pod cgroup path
CGROUP_PATH=$(cat /proc/1/cgroup | grep -oP '(?<=::).*')
# Check cpu.stat
cat /sys/fs/cgroup${CGROUP_PATH}/cpu.stat
# usage_usec 45000000
# user_usec 40000000
# system_usec 5000000
# nr_periods 1000
# nr_throttled 350 <- 35% throttle ratio!
# throttled_usec 15000000
# Calculate throttle ratio
# throttle_ratio = nr_throttled / nr_periods
# 350 / 1000 = 35% -> High! Consider increasing limits
CPU Throttling Resolution Strategy
CPU Throttling Diagnostic Flow:
1. Is nr_throttled / nr_periods > 5%?
├── Yes -> Proceed to step 2
└── No -> CPU throttling is not the issue
2. Is average CPU usage close to limits?
├── Yes -> Need to increase limits
└── No -> Burst pattern issue -> Proceed to step 3
3. Are there short CPU bursts?
├── Yes -> Set cpu.max.burst (kernel 5.14+)
│ or consider removing CPU limits
└── No -> Consider period adjustment
6.2 OOM Killer Troubleshooting
# 1. Check OOM events in dmesg
dmesg | grep -i "killed process"
# [12345.678] Killed process 1234 (java) total-vm:4096000kB,
# anon-rss:524288kB, file-rss:8192kB, shmem-rss:0kB,
# UID:1000 pgrp:1234
# 2. Check cgroup memory events
cat /sys/fs/cgroup/kubepods.slice/.../memory.events
# low 0
# high 500 # memory.high exceeded count
# max 10 # memory.max exceeded count
# oom 2 # OOM event count
# oom_kill 2 # OOM kill count
# 3. Current memory usage vs limit
cat /sys/fs/cgroup/kubepods.slice/.../memory.current
# 524288000 (500MB)
cat /sys/fs/cgroup/kubepods.slice/.../memory.max
# 536870912 (512MB) <- Nearly reached!
# 4. Detailed memory usage analysis
cat /sys/fs/cgroup/kubepods.slice/.../memory.stat | head -10
OOM Resolution Strategy
OOM Kill Diagnostic Flow:
1. Verify oom_kill > 0 in memory.events
├── oom_kill increasing -> Proceed to step 2
└── oom_kill = 0 -> Not an OOM issue
2. Compare memory.current vs memory.max
├── current >= max * 0.9 -> Memory shortage
│ ├── High anon -> Possible heap memory leak -> Profile
│ ├── High file -> Excessive file cache -> May be normal
│ └── High shmem -> Check shared memory usage
└── current << max -> Check sudden allocation patterns
3. Solutions:
a) Increase memory limits
b) Fix application memory leaks
c) JVM: Set -XX:MaxRAMPercentage=75
d) Set memory.high for soft throttling
6.3 Checking Process cgroup Membership
# Check cgroup for a specific process
cat /proc/PID/cgroup
# v2 output: 0::/kubepods.slice/kubepods-burstable.slice/...
# v1 output:
# 12:pids:/docker/abc123
# 11:memory:/docker/abc123
# 10:cpu,cpuacct:/docker/abc123
# systemd-cgls: Display cgroup tree
systemd-cgls
# Control group /:
# -.slice
# ├─user.slice
# │ └─user-1000.slice
# ├─system.slice
# │ ├─docker.service
# │ └─sshd.service
# └─kubepods.slice
# ├─kubepods-burstable.slice
# └─kubepods-besteffort.slice
6.4 systemd-cgtop: Real-Time cgroup Resource Monitoring
# systemd-cgtop: top-like cgroup monitoring
systemd-cgtop
# Control Group Tasks %CPU Memory Input/s Output/s
# / 235 15.2 3.5G - -
# /system.slice 45 5.1 1.2G - -
# /kubepods.slice 120 8.3 2.0G - -
# /kubepods.slice/burstable 80 6.1 1.5G - -
6.5 cAdvisor and Prometheus Metrics
Key cAdvisor Prometheus Metrics:
CPU:
container_cpu_usage_seconds_total # Cumulative CPU time
container_cpu_cfs_throttled_seconds_total # Cumulative throttle time
container_cpu_cfs_periods_total # Total CFS periods
container_cpu_cfs_throttled_periods_total # Throttled periods
Memory:
container_memory_usage_bytes # Current memory usage
container_memory_working_set_bytes # Working set (OOM criterion)
container_memory_rss # RSS (actual physical memory)
container_memory_cache # Page cache
OOM:
container_oom_events_total # OOM event count
kube_pod_container_status_last_terminated_reason # OOMKilled, etc.
Useful PromQL Queries
# CPU throttle ratio (5-minute average)
rate(container_cpu_cfs_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m])
# Memory utilization (relative to limits)
container_memory_working_set_bytes
/ container_spec_memory_limit_bytes
# Pods with OOM Kill events
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
7. cgroups and Security
The Three Pillars of Container Isolation
Container Security Isolation Layers:
+--------------------------------------------------+
| Host Kernel |
| |
| ┌──────────┐ ┌──────────┐ ┌──────────┐ |
| │ Namespace│ │ cgroup │ │ Seccomp/ │ |
| │(Isolate) │ │ (Limit) │ │ AppArmor │ |
| │ │ │ │ │(Security)│ |
| │ - PID │ │ - CPU │ │ - syscall│ |
| │ - NET │ │ - Memory │ │ filter │ |
| │ - MNT │ │ - IO │ │ - MAC │ |
| │ - USER │ │ - PIDs │ │ policy │ |
| └──────────┘ └──────────┘ └──────────┘ |
| |
+--------------------------------------------------+
All three must be in place for secure isolation!
CVE-2022-0492: cgroup Escape Vulnerability
CVE-2022-0492 is a container escape vulnerability exploiting the cgroup v1 release_agent mechanism.
Attack Principle:
1. release_agent runs a program when the last process in a cgroup exits
2. Vulnerability: Missing permission check in cgroup_release_agent_write()
3. Attacker uses unshare() to create new user/cgroup namespaces
4. Mounts a writable cgroupfs
5. Sets release_agent to a binary to execute on the host
6. Achieves arbitrary code execution with host privileges
Defense:
- Seccomp + AppArmor/SELinux blocks the attack
- Block unshare() syscall, or
- Restrict cgroupfs mounting
Rootless Containers and cgroup v2 Delegation
# For rootless containers to work with cgroup v2,
# you must delegate a cgroup subtree to the user
# Configure delegation in systemd
sudo systemctl edit user@1000.service
# [Service]
# Delegate=cpu memory pids io
# Or configure per-user
mkdir -p /etc/systemd/system/user@.service.d/
cat > /etc/systemd/system/user@.service.d/delegate.conf << 'CONF'
[Service]
Delegate=cpu cpuset io memory pids
CONF
systemctl daemon-reload
Kubernetes Pod Security Standards and cgroups
# Pod Security Standard: restricted
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: nginx
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ['ALL']
readOnlyRootFilesystem: true
resources:
limits: # cgroup resource limits are mandatory
cpu: '500m'
memory: '256Mi'
requests:
cpu: '250m'
memory: '128Mi'
8. Summary
cgroup v1 to v2 Migration Checklist
- Check kernel version: 5.8+ recommended (PSI, CPU burst, etc.)
- Check systemd version: 236+
- Check container runtime: Docker 20.10+, containerd 1.4+
- Check Kubernetes version: 1.25+ (GA)
- Configure GRUB boot parameters
- Change kubelet cgroup driver to systemd
- Verify monitoring tool compatibility (cAdvisor, Prometheus)
- Update custom cgroup scripts to v2 interface
- Thoroughly test in staging before production deployment
Recommended Production Settings
# Kubernetes node recommended settings
# kubelet config
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
cgroupsPerQOS: true
podPidsLimit: 1024 # Fork bomb prevention
enforceNodeAllocatable:
- pods
- system-reserved
- kube-reserved
kubeReserved:
cpu: '500m'
memory: '1Gi'
ephemeral-storage: '1Gi'
systemReserved:
cpu: '500m'
memory: '1Gi'
ephemeral-storage: '1Gi'
evictionHard:
memory.available: '100Mi'
nodefs.available: '10%'
imagefs.available: '15%'
Essential Command Cheat Sheet
# === Information ===
cat /proc/self/cgroup # Current process cgroup
stat -fc %T /sys/fs/cgroup/ # v1(tmpfs) vs v2(cgroup2fs)
cat /sys/fs/cgroup/cgroup.controllers # Available controllers
systemd-cgls # Display cgroup tree
systemd-cgtop # Real-time cgroup monitoring
# === CPU ===
cat /sys/fs/cgroup/PATH/cpu.max # CPU quota/period
cat /sys/fs/cgroup/PATH/cpu.weight # CPU weight
cat /sys/fs/cgroup/PATH/cpu.stat # Throttle statistics
# === Memory ===
cat /sys/fs/cgroup/PATH/memory.max # Hard limit
cat /sys/fs/cgroup/PATH/memory.current # Current usage
cat /sys/fs/cgroup/PATH/memory.stat # Detailed statistics
cat /sys/fs/cgroup/PATH/memory.events # OOM events
# === IO ===
cat /sys/fs/cgroup/PATH/io.max # IO limits
cat /sys/fs/cgroup/PATH/io.stat # IO statistics
# === PID ===
cat /sys/fs/cgroup/PATH/pids.max # PID limit
cat /sys/fs/cgroup/PATH/pids.current # Current PID count
# === Docker ===
docker stats # Real-time resource monitoring
docker inspect CONTAINER | grep -i cgroup # cgroup config check
# === Troubleshooting ===
dmesg | grep -i "killed process" # OOM kill logs
cat /proc/PID/cgroup # Process cgroup check
cat /proc/PID/oom_score # OOM score check
Quiz: Test Your cgroup Knowledge
Q1. What is the biggest structural difference between cgroup v1 and v2?
A: v1 has independent hierarchies per controller, while v2 uses a single unified hierarchy. In v2, a process belongs to the same cgroup across all controllers.
Q2. What conditions must be met for a Kubernetes Pod to get "Guaranteed" QoS class?
A: All containers must have CPU and memory requests set equal to their limits.
Q3. What is the difference between memory.high and memory.max in cgroup v2?
A: memory.high is a soft limit that throttles the process when exceeded. memory.max is a hard limit that triggers the OOM Killer when exceeded.
Q4. What does a 35% CPU throttle ratio mean?
A: It means that in 35% of all CFS periods, the CPU quota was fully consumed and the process could not run. This indicates you should consider increasing CPU limits or configuring CPU burst.
Q5. Why is the systemd cgroup driver recommended over cgroupfs in Docker?
A: When systemd is the init system and cgroupfs is used, two cgroup managers operate simultaneously, which can make the system unstable under resource pressure. Using the systemd driver unifies cgroup management under a single manager.
Q6. What cgroup mechanism does CVE-2022-0492 exploit?
A: It exploits the release_agent mechanism in cgroup v1. The release_agent executes a program when the last process in a cgroup exits. A missing permission check allowed attackers to execute arbitrary code with host privileges from within a container.
References
- Linux Kernel Documentation: Control Group v2
- Kubernetes Documentation: About cgroup v2
- Red Hat: Migrating from CGroups V1 to V2
- CFS Bandwidth Control - Linux Kernel Documentation