Skip to content
Published on

[Linux] Complete Guide to cgroups: The Core of Container Resource Control

Authors

Table of Contents

  1. What are cgroups?
  2. cgroup v1 vs v2
  3. Deep Dive into Major Controllers
  4. Docker and cgroups
  5. Kubernetes and cgroups
  6. Real-World Troubleshooting
  7. cgroups and Security
  8. Summary

1. What are cgroups?

The Birth of Control Groups

cgroup (Control Groups) is a process group-level resource limiting and isolation mechanism provided by the Linux kernel. In 2006, Google engineers Paul Menage and Rohit Seth began development under the name "process containers." In 2007, it was merged into Linux kernel 2.6.24 and renamed to cgroup to avoid confusion with the existing kernel term "container."

Problems cgroups Solve

Before cgroups, it was difficult to prevent a single process from monopolizing all CPU or memory resources on a system. cgroups provide the following capabilities:

  • Resource Limiting: Set upper bounds on CPU, memory, IO, and network bandwidth usage
  • Prioritization: Adjust resource allocation ratios between process groups
  • Accounting: Measure and report resource usage per group
  • Control: Suspend, resume, and checkpoint process groups

The Relationship Between Namespaces and cgroups

The two pillars of container technology are Namespaces and cgroups.

+-------------------------------------------+
|       Linux Container Isolation Model      |
+-------------------------------------------+
|                                           |
|  Namespace (Isolation - Visibility)        |
|  ┌─────────────────────────────────────┐  |
|  │ PID NS  : Process ID isolation      │  |
|  │ NET NS  : Network stack isolation   │  |
|  │ MNT NS  : Filesystem mount isolation│  |
|  │ UTS NS  : Hostname isolation        │  |
|  │ IPC NS  : IPC resource isolation    │  |
|  │ USER NS : UID/GID mapping isolation │  |
|  └─────────────────────────────────────┘  |
|                                           |
|  cgroup (Limits - Resource Limits)         |
|  ┌─────────────────────────────────────┐  |
|  │ CPU     : CPU time limits           │  |
|  │ Memory  : Memory usage limits       │  |
|  │ IO      : Block device IO limits    │  |
|  │ PIDs    : Process count limits      │  |
|  │ Devices : Device access control     │  |
|  └─────────────────────────────────────┘  |
|                                           |
+-------------------------------------------+

The key difference in a nutshell:

  • Namespace = "What can you see?" (isolation)
  • cgroup = "How much can you use?" (limits)

cgroup Filesystem Basics

cgroups are managed through a virtual filesystem (VFS). All configuration is done through file reads and writes.

# Check cgroup mounts
mount | grep cgroup

# When cgroup v2 is mounted
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

# Check current process cgroup
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-1.scope

2. cgroup v1 vs v2

cgroup v1 Structure

In cgroup v1, each controller (CPU, memory, blkio, etc.) has an independent hierarchy.

cgroup v1 structure:

/sys/fs/cgroup/
├── cpu/                    # CPU controller
│   ├── docker/
│   │   ├── container-abc/
│   │   │   ├── cpu.cfs_quota_us
│   │   │   ├── cpu.cfs_period_us
│   │   │   └── cpu.shares
│   │   └── container-xyz/
│   └── tasks
├── memory/                 # Memory controller
│   ├── docker/
│   │   ├── container-abc/
│   │   │   ├── memory.limit_in_bytes
│   │   │   └── memory.usage_in_bytes
│   │   └── container-xyz/
│   └── tasks
├── blkio/                  # Block IO controller
├── pids/                   # PID controller
├── devices/                # Device controller
└── freezer/                # Freezer controller

Key v1 characteristics:

  • Each controller has its own separate directory tree
  • A single process can belong to different groups in each controller
  • Flexible but complex to manage

cgroup v2 Structure (Unified Hierarchy)

cgroup v2 uses a single unified hierarchy.

cgroup v2 structure (Unified):

/sys/fs/cgroup/                     # Root cgroup
├── cgroup.controllers              # Available controllers list
├── cgroup.subtree_control          # Controllers to enable for children
├── system.slice/                   # systemd system services
│   ├── docker.service/
│   └── sshd.service/
├── user.slice/                     # User sessions
│   └── user-1000.slice/
└── kubepods.slice/                 # Kubernetes pods
    ├── kubepods-burstable.slice/
    └── kubepods-besteffort.slice/

v1 vs v2 Comparison Table

Featurecgroup v1cgroup v2
HierarchyIndependent per controllerSingle unified
Mount points/sys/fs/cgroup/cpu/, /sys/fs/cgroup/memory/, etc./sys/fs/cgroup/ only
Process placementDifferent group per controller possibleSame group for all controllers
PSI (Pressure Stall Info)Not supportedSupported
Threaded modeNot supportedSupported
Subtree delegationLimitedFull delegation model
Memory QoSmemory.limit only4-tier: memory.min/low/high/max
IO controlblkio (Direct IO only)io (Buffered IO supported)
CPU burstNot supported (needs kernel patch)Supported (kernel 5.14+)

v2 Migration Compatibility

Softwarecgroup v2 Support Since
systemd236+
Docker20.10+
containerd1.4+
Kubernetes1.25+ (GA)
PodmanNative support
RHEL9+ (default)
Ubuntu21.10+ (default)

Checking cgroup v2 Activation

# Check if cgroup v2 is in use
stat -fc %T /sys/fs/cgroup/
# cgroup2fs  -> v2
# tmpfs      -> v1

# Or check mount info
grep cgroup /proc/mounts

# Force v2 via kernel boot parameters
# GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"

3. Deep Dive into Major Controllers

3.1 CPU Controller

CFS Bandwidth Control

The Linux CFS (Completely Fair Scheduler) distributes CPU time fairly. The cgroup CPU controller adds bandwidth throttling on top of CFS.

CFS Bandwidth Control Principle:

Period = 100ms (default)
Quota  = 50ms  (allocation)

 0ms        50ms       100ms       150ms       200ms
  |----------|----------|----------|----------|
  [##########]...........[##########]...........
  ^- running -^ ^throttled^ ^- running -^ ^throttled^

  For 1 CPU: quota/period = 50ms/100ms = 0.5 CPU (50%)

v1 vs v2 CPU Parameters

# ===== cgroup v1 =====
# CFS quota: in microseconds, -1 means unlimited
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_quota_us
# 50000 (50ms)

cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_period_us
# 100000 (100ms)

# CPU shares: relative weight (default 1024)
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.shares
# 512

# ===== cgroup v2 =====
# cpu.max: "quota period" format
cat /sys/fs/cgroup/kubepods.slice/.../cpu.max
# 50000 100000 (50ms quota, 100ms period)
# "max 100000" -> unlimited

# cpu.weight: 1-10000 (default 100)
cat /sys/fs/cgroup/kubepods.slice/.../cpu.weight
# 100

CPU shares/weight Conversion

v1 cpu.shares -> v2 cpu.weight conversion:

v2_weight = (1 + ((v1_shares - 2) * 9999) / 262142)

Examples:
  v1 shares=1024 (default) -> v2 weight ~ 39
  v1 shares=512            -> v2 weight ~ 20
  v1 shares=2048           -> v2 weight ~ 78

cpuset: CPU Pinning

Pin processes to specific CPU cores.

# v2: Specify CPU cores to use
echo "0-3" > /sys/fs/cgroup/mygroup/cpuset.cpus
echo "0" > /sys/fs/cgroup/mygroup/cpuset.mems

# Check current settings
cat /sys/fs/cgroup/mygroup/cpuset.cpus.effective

Hands-on: Creating a CPU-Limited cgroup

# Hands-on in a cgroup v2 environment

# 1. Create a new cgroup
sudo mkdir /sys/fs/cgroup/cpu-test

# 2. Enable CPU controller (from root)
echo "+cpu" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# 3. Limit to 50% CPU (50ms quota / 100ms period)
echo "50000 100000" | sudo tee /sys/fs/cgroup/cpu-test/cpu.max

# 4. Add current shell to cgroup
echo $$ | sudo tee /sys/fs/cgroup/cpu-test/cgroup.procs

# 5. Generate CPU load and verify the limit
stress --cpu 1 --timeout 10 &

# 6. Check throttling statistics
cat /sys/fs/cgroup/cpu-test/cpu.stat
# usage_usec 4500000
# user_usec 4500000
# system_usec 0
# nr_periods 100
# nr_throttled 50
# throttled_usec 5000000

# 7. Cleanup
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/cpu-test

CPU Throttling Issues and Burst

CPU throttling can cause severe latency spikes in bursty workloads.

Problem scenario: Web server with 30% average CPU usage (limit: 1 CPU)

Normal:  [###-------][###-------][###-------]  -> Response 10ms
Burst:   [##########][..........][###-------]  -> Response 110ms!
          ^- used all 100ms -^  ^- throttled -^

Solution: CPU burst (kernel 5.14+, cgroup v2)
cpu.max.burst = allowed burst in microseconds
Saves unused quota from previous periods for burst usage
# Set CPU burst (cgroup v2, kernel 5.14+)
echo 20000 > /sys/fs/cgroup/mygroup/cpu.max.burst
# Allow up to 20ms burst

3.2 Memory Controller

v2's 4-Tier Memory Limits

cgroup v2 provides more granular 4-tier memory control compared to v1.

Memory Control 4 Tiers (cgroup v2):

memory.min   : Absolute protection (no reclaim below this)
                ^ Minimum guaranteed zone
memory.low   : Soft protection (best-effort reclaim prevention)
                ^ Preferred zone
memory.high  : Soft limit (throttle on exceed, no OOM)
                ^ Warning zone
memory.max   : Hard limit (OOM Killer triggers on exceed)
                ^ Forbidden zone

  0 MB                                              1024 MB
  |=====[min]==[low]==========[high]====[max]========|
  |<protected>|<prefer>|<normal use>|<throttle>|<OOM>|

v1 vs v2 Memory Parameters

# ===== cgroup v1 =====
# Hard limit
echo 536870912 > memory.limit_in_bytes   # 512MB

# Soft limit (reclaim pressure)
echo 268435456 > memory.soft_limit_in_bytes  # 256MB

# Swap inclusive limit
echo 1073741824 > memory.memsw.limit_in_bytes  # 1GB (mem+swap)

# OOM control
echo 1 > memory.oom_control  # Disable OOM Killer

# ===== cgroup v2 =====
echo 512M > memory.max       # Hard limit
echo 256M > memory.high      # Soft limit (throttle)
echo 128M > memory.low       # Best-effort protection
echo 64M  > memory.min       # Absolute protection
echo 256M > memory.swap.max  # Swap limit

Analyzing memory.stat

cat /sys/fs/cgroup/kubepods.slice/.../memory.stat
# anon 104857600          # Anonymous memory (heap, stack)
# file 52428800           # File cache
# kernel 8388608          # Kernel memory (slab, etc.)
# shmem 0                 # Shared memory
# pgfault 250000          # Page fault count
# pgmajfault 10           # Major page faults
# workingset_refault 500  # Refaults after eviction from working set
# oom_kill 0              # OOM kill count

Key metric interpretation:

  • High anon: Application heap memory usage is large
  • High file: Lots of file cache (usually normal)
  • High pgmajfault: Frequent disk reads (sign of memory pressure)
  • oom_kill > 0: OOM events have occurred

Hands-on: Memory Limits and OOM Testing

# 1. Create a memory-limited cgroup
sudo mkdir /sys/fs/cgroup/mem-test
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# 2. Set 100MB hard limit
echo 104857600 | sudo tee /sys/fs/cgroup/mem-test/memory.max

# 3. Set 80MB soft limit (throttle on exceed)
echo 83886080 | sudo tee /sys/fs/cgroup/mem-test/memory.high

# 4. Add process to cgroup and allocate memory
echo $$ | sudo tee /sys/fs/cgroup/mem-test/cgroup.procs

# 5. Attempt to allocate over 100MB
python3 -c "
data = []
for i in range(200):
    data.append(bytearray(1024 * 1024))  # 1MB each
    print(f'Allocated {i+1} MB')
"
# -> OOM Killed at around 100MB

# 6. Check OOM events
cat /sys/fs/cgroup/mem-test/memory.events
# oom 1
# oom_kill 1
# high 150

# 7. Check OOM logs in dmesg
dmesg | grep -i "killed process"

Swap Control

# Swap limits in cgroup v2
echo 0 > /sys/fs/cgroup/mygroup/memory.swap.max  # Disable swap
echo max > /sys/fs/cgroup/mygroup/memory.swap.max  # Unlimited swap

# v1 uses combined memory + swap limit
echo 1073741824 > memory.memsw.limit_in_bytes  # mem + swap = 1GB

3.3 IO Controller

v1 (blkio) vs v2 (io)

The v1 blkio controller could only control Direct IO. Buffered IO goes through the page cache and was invisible to blkio. The v2 io controller handles Buffered IO as well through the writeback mechanism.

# ===== cgroup v2: io.max =====
# Format: MAJ:MIN rbps=NUM wbps=NUM riops=NUM wiops=NUM

# Check device numbers
lsblk -o NAME,MAJ:MIN
# sda   8:0

# Limit read 10MB/s, write 5MB/s
echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/mygroup/io.max

# IOPS limit: read 1000 IOPS, write 500 IOPS
echo "8:0 riops=1000 wiops=500" > /sys/fs/cgroup/mygroup/io.max

io.weight (Relative Weight)

# io.weight: 1-10000 (default 100)
echo "default 200" > /sys/fs/cgroup/mygroup/io.weight

# Set weight for a specific device
echo "8:0 500" > /sys/fs/cgroup/mygroup/io.weight

io.latency (v2 Only)

Set latency targets to guarantee IO response times.

# Set 5ms latency target
echo "8:0 target=5000" > /sys/fs/cgroup/mygroup/io.latency

3.4 PID Controller

Fork Bomb Prevention

# Set PID limit
echo 100 > /sys/fs/cgroup/mygroup/pids.max

# Check current PID count
cat /sys/fs/cgroup/mygroup/pids.current
# 5

# Fork bomb test (safe environments only!)
# Without limits, the entire system can freeze
# With pids.max set, fork() fails at 100 processes
Fork Bomb Prevention Principle:

pids.max = 100

Process Tree:
init(1)
├── bash(2)
│   ├── worker(3)
│   ├── worker(4)
│   │   ├── child(5)
│   │   └── child(6)
│   ...
│   └── worker(100)    <- pids.max reached
│       └── fork() -> EAGAIN (fails!)

3.5 Other Controllers

devices Controller

Control device access via whitelist/blacklist.

# v1: devices.allow / devices.deny
echo 'c 1:3 rmw' > devices.allow   # Allow /dev/null access
echo 'b 8:0 r' > devices.allow     # Allow /dev/sda read
echo 'a' > devices.deny            # Deny all devices

# GPU access control (NVIDIA)
echo 'c 195:* rmw' > devices.allow  # Allow NVIDIA devices

freezer Controller

Freeze and thaw process groups.

# v2: cgroup.freeze
echo 1 > /sys/fs/cgroup/mygroup/cgroup.freeze  # Freeze
echo 0 > /sys/fs/cgroup/mygroup/cgroup.freeze  # Thaw

# Check status
cat /sys/fs/cgroup/mygroup/cgroup.events
# frozen 1

hugetlb Controller

Limit huge page usage.

# 2MB huge pages limit
echo 1073741824 > hugetlb.2MB.limit_in_bytes  # 1GB
cat hugetlb.2MB.usage_in_bytes

4. Docker and cgroups

Docker Resource Limit Options

# CPU limits
docker run -d \
  --cpus="1.5" \              # 1.5 CPUs (quota=150000, period=100000)
  --cpu-shares=512 \          # CPU shares (relative weight)
  --cpuset-cpus="0,1" \       # Use CPU cores 0 and 1 only
  nginx

# Memory limits
docker run -d \
  --memory=512m \             # Hard memory limit 512MB
  --memory-swap=1g \          # Memory+swap total 1GB (512MB swap)
  --memory-reservation=256m \ # Soft limit
  --oom-kill-disable \        # Disable OOM Killer (use with caution!)
  nginx

# IO limits
docker run -d \
  --device-read-bps /dev/sda:10mb \   # Read 10MB/s
  --device-write-bps /dev/sda:5mb \   # Write 5MB/s
  --device-read-iops /dev/sda:1000 \  # Read 1000 IOPS
  nginx

# PID limit
docker run -d \
  --pids-limit=100 \          # Max 100 processes
  nginx

cgroup Paths Created by Docker

# With systemd cgroup driver
# /sys/fs/cgroup/system.slice/docker-CONTAINER_ID.scope/

# Get container ID
CONTAINER_ID=$(docker ps -q --filter name=my-nginx)

# Check cgroup path
docker inspect --format='{{.HostConfig.CgroupParent}}' $CONTAINER_ID

# Directly inspect cgroup files (v2)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max

Real-Time Monitoring with docker stats

# Resource usage for all containers
docker stats

# CONTAINER ID  NAME    CPU %  MEM USAGE/LIMIT   MEM %  NET I/O       BLOCK I/O
# abc123        nginx   0.50%  50MiB / 512MiB    9.77%  1.2kB/0B      0B/0B
# xyz789        redis   1.20%  30MiB / 256MiB    11.7%  500B/200B     4kB/0B

# Specific container only
docker stats my-nginx --no-stream

Docker cgroup Driver: cgroupfs vs systemd

+---------------------------+---------------------+---------------------+
|                           |    cgroupfs          |     systemd         |
+---------------------------+---------------------+---------------------+
| cgroup management         | Docker manages       | Delegated to        |
|                           | directly             | systemd             |
| Conflict with init system | Possible             | None (integrated)   |
| Kubernetes recommendation | Not recommended      | Recommended         |
|                           |                      | (default)           |
| Config location           | Docker creates       | systemd scope/slice |
|                           | directly             |                     |
| cgroup v2 compatibility   | Limited              | Full support        |
+---------------------------+---------------------+---------------------+
// /etc/docker/daemon.json
// systemd cgroup driver configuration
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}

5. Kubernetes and cgroups

kubelet cgroup Driver Configuration

# kubelet config (/var/lib/kubelet/config.yaml)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd # Recommended
cgroupsPerQOS: true # Create QoS-based cgroup hierarchy
enforceNodeAllocatable:
  - pods
  - system-reserved
  - kube-reserved
kubeReserved:
  cpu: '500m'
  memory: '1Gi'
systemReserved:
  cpu: '500m'
  memory: '1Gi'

Pod QoS Classes and cgroups

Kubernetes assigns one of three QoS classes based on the Pod's requests/limits configuration.

Pod QoS Class Decision Logic:

1. Guaranteed: All containers have requests == limits
   - Highest priority, OOM score = -997

2. Burstable: requests < limits (for any container)
   - Medium priority, OOM score = 2~999

3. BestEffort: No requests/limits set at all
   - Lowest priority, OOM score = 1000
# Guaranteed Pod example
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
    - name: app
      image: nginx
      resources:
        requests:
          cpu: '500m'
          memory: '256Mi'
        limits:
          cpu: '500m' # requests == limits
          memory: '256Mi' # requests == limits
# Burstable Pod example
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
    - name: app
      image: nginx
      resources:
        requests:
          cpu: '250m'
          memory: '128Mi'
        limits:
          cpu: '500m' # requests < limits
          memory: '256Mi'

Kubernetes cgroup Hierarchy

/sys/fs/cgroup/
└── kubepods.slice/                              # All Pods
    ├── kubepods-burstable.slice/                # Burstable QoS
    │   └── kubepods-burstable-podABCD.slice/    # Per-Pod
    │       ├── cri-containerd-XXXX.scope        # Per-container
    │       │   ├── cpu.max                      # limits.cpu
    │       │   ├── cpu.weight                   # Based on requests.cpu
    │       │   ├── memory.max                   # limits.memory
    │       │   ├── memory.min                   # requests.memory (v2)
    │       │   └── pids.max                     # Pod pid limit
    │       └── cri-containerd-YYYY.scope
    ├── kubepods-besteffort.slice/               # BestEffort QoS
    │   └── kubepods-besteffort-podEFGH.slice/
    └── kubepods-podIJKL.slice/                  # Guaranteed QoS
        └── cri-containerd-ZZZZ.scope

How Kubernetes Resource Requests Map to cgroups

Kubernetes -> cgroup v2 Mapping:

resources.requests.cpu: 500m
  -> cpu.weight = proportional to (500m / total allocatable CPU)
  -> Guaranteed CPU ratio under contention

resources.limits.cpu: 1000m
  -> cpu.max = "100000 100000" (100ms/100ms = 1 CPU)

resources.requests.memory: 256Mi
  -> memory.min = 268435456 (for Guaranteed QoS in v2)

resources.limits.memory: 512Mi
  -> memory.max = 536870912

pid limit (kubelet setting):
  -> pids.max = podPidsLimit (default -1, unlimited)

Node Resource Reservation

Node Resource Distribution:

Total node resources (e.g., 16 CPU, 64Gi Memory)
├── kube-reserved    : kubelet, kube-proxy, etc. (0.5 CPU, 1Gi)
├── system-reserved  : sshd, journald, etc. (0.5 CPU, 1Gi)
├── eviction-threshold: Reclaim threshold (memory.available=100Mi)
└── allocatable      : Resources available for Pods (15 CPU, 61.9Gi)

allocatable = total - kube-reserved - system-reserved - eviction-threshold

cgroup v2 + Kubernetes: MemoryQoS

Kubernetes 1.22+ MemoryQoS feature (alpha/beta) leverages cgroup v2's memory.high.

MemoryQoS Behavior:

Burstable Pod (requests: 256Mi, limits: 512Mi)

cgroup v2 settings:
  memory.min  = 0       (No BestEffort protection)
  memory.low  = 0       (default)
  memory.high = 268435456 (based on requests, throttle point)
  memory.max  = 536870912 (based on limits, OOM point)

When memory usage exceeds requests (256Mi):
  -> Throttled by memory.high (allocation speed slows)
  -> NOT immediately OOM killed
  -> OOM kill when exceeding limits (512Mi)

6. Real-World Troubleshooting

6.1 Checking CPU Throttling

# Check CPU throttling for a container
# Find the Pod cgroup path
CGROUP_PATH=$(cat /proc/1/cgroup | grep -oP '(?<=::).*')

# Check cpu.stat
cat /sys/fs/cgroup${CGROUP_PATH}/cpu.stat
# usage_usec 45000000
# user_usec 40000000
# system_usec 5000000
# nr_periods 1000
# nr_throttled 350       <- 35% throttle ratio!
# throttled_usec 15000000

# Calculate throttle ratio
# throttle_ratio = nr_throttled / nr_periods
# 350 / 1000 = 35% -> High! Consider increasing limits

CPU Throttling Resolution Strategy

CPU Throttling Diagnostic Flow:

1. Is nr_throttled / nr_periods > 5%?
   ├── Yes -> Proceed to step 2
   └── No  -> CPU throttling is not the issue

2. Is average CPU usage close to limits?
   ├── Yes -> Need to increase limits
   └── No  -> Burst pattern issue -> Proceed to step 3

3. Are there short CPU bursts?
   ├── Yes -> Set cpu.max.burst (kernel 5.14+)
   │         or consider removing CPU limits
   └── No  -> Consider period adjustment

6.2 OOM Killer Troubleshooting

# 1. Check OOM events in dmesg
dmesg | grep -i "killed process"
# [12345.678] Killed process 1234 (java) total-vm:4096000kB,
#   anon-rss:524288kB, file-rss:8192kB, shmem-rss:0kB,
#   UID:1000 pgrp:1234

# 2. Check cgroup memory events
cat /sys/fs/cgroup/kubepods.slice/.../memory.events
# low 0
# high 500       # memory.high exceeded count
# max 10         # memory.max exceeded count
# oom 2          # OOM event count
# oom_kill 2     # OOM kill count

# 3. Current memory usage vs limit
cat /sys/fs/cgroup/kubepods.slice/.../memory.current
# 524288000  (500MB)
cat /sys/fs/cgroup/kubepods.slice/.../memory.max
# 536870912  (512MB)  <- Nearly reached!

# 4. Detailed memory usage analysis
cat /sys/fs/cgroup/kubepods.slice/.../memory.stat | head -10

OOM Resolution Strategy

OOM Kill Diagnostic Flow:

1. Verify oom_kill > 0 in memory.events
   ├── oom_kill increasing -> Proceed to step 2
   └── oom_kill = 0       -> Not an OOM issue

2. Compare memory.current vs memory.max
   ├── current >= max * 0.9 -> Memory shortage
   │   ├── High anon   -> Possible heap memory leak -> Profile
   │   ├── High file   -> Excessive file cache -> May be normal
   │   └── High shmem  -> Check shared memory usage
   └── current << max  -> Check sudden allocation patterns

3. Solutions:
   a) Increase memory limits
   b) Fix application memory leaks
   c) JVM: Set -XX:MaxRAMPercentage=75
   d) Set memory.high for soft throttling

6.3 Checking Process cgroup Membership

# Check cgroup for a specific process
cat /proc/PID/cgroup
# v2 output: 0::/kubepods.slice/kubepods-burstable.slice/...
# v1 output:
# 12:pids:/docker/abc123
# 11:memory:/docker/abc123
# 10:cpu,cpuacct:/docker/abc123

# systemd-cgls: Display cgroup tree
systemd-cgls
# Control group /:
# -.slice
# ├─user.slice
# │ └─user-1000.slice
# ├─system.slice
# │ ├─docker.service
# │ └─sshd.service
# └─kubepods.slice
#   ├─kubepods-burstable.slice
#   └─kubepods-besteffort.slice

6.4 systemd-cgtop: Real-Time cgroup Resource Monitoring

# systemd-cgtop: top-like cgroup monitoring
systemd-cgtop

# Control Group              Tasks   %CPU   Memory  Input/s Output/s
# /                             235   15.2     3.5G        -        -
# /system.slice                  45    5.1     1.2G        -        -
# /kubepods.slice               120    8.3     2.0G        -        -
# /kubepods.slice/burstable      80    6.1     1.5G        -        -

6.5 cAdvisor and Prometheus Metrics

Key cAdvisor Prometheus Metrics:

CPU:
  container_cpu_usage_seconds_total      # Cumulative CPU time
  container_cpu_cfs_throttled_seconds_total  # Cumulative throttle time
  container_cpu_cfs_periods_total        # Total CFS periods
  container_cpu_cfs_throttled_periods_total  # Throttled periods

Memory:
  container_memory_usage_bytes           # Current memory usage
  container_memory_working_set_bytes     # Working set (OOM criterion)
  container_memory_rss                   # RSS (actual physical memory)
  container_memory_cache                 # Page cache

OOM:
  container_oom_events_total             # OOM event count
  kube_pod_container_status_last_terminated_reason  # OOMKilled, etc.

Useful PromQL Queries

# CPU throttle ratio (5-minute average)
rate(container_cpu_cfs_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m])

# Memory utilization (relative to limits)
container_memory_working_set_bytes
/ container_spec_memory_limit_bytes

# Pods with OOM Kill events
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

7. cgroups and Security

The Three Pillars of Container Isolation

Container Security Isolation Layers:

+--------------------------------------------------+
|                 Host Kernel                       |
|                                                  |
|  ┌──────────┐ ┌──────────┐ ┌──────────┐         |
|  │ Namespace│ │  cgroup  │ │ Seccomp/ │         |
|  │(Isolate) │ │ (Limit)  │ │ AppArmor │         |
|  │          │ │          │ │(Security)│         |
|  │ - PID    │ │ - CPU    │ │ - syscall│         |
|  │ - NET    │ │ - Memory │ │   filter │         |
|  │ - MNT    │ │ - IO     │ │ - MAC    │         |
|  │ - USER   │ │ - PIDs   │ │   policy │         |
|  └──────────┘ └──────────┘ └──────────┘         |
|                                                  |
+--------------------------------------------------+

All three must be in place for secure isolation!

CVE-2022-0492: cgroup Escape Vulnerability

CVE-2022-0492 is a container escape vulnerability exploiting the cgroup v1 release_agent mechanism.

Attack Principle:

1. release_agent runs a program when the last process in a cgroup exits
2. Vulnerability: Missing permission check in cgroup_release_agent_write()
3. Attacker uses unshare() to create new user/cgroup namespaces
4. Mounts a writable cgroupfs
5. Sets release_agent to a binary to execute on the host
6. Achieves arbitrary code execution with host privileges

Defense:
- Seccomp + AppArmor/SELinux blocks the attack
- Block unshare() syscall, or
- Restrict cgroupfs mounting

Rootless Containers and cgroup v2 Delegation

# For rootless containers to work with cgroup v2,
# you must delegate a cgroup subtree to the user

# Configure delegation in systemd
sudo systemctl edit user@1000.service
# [Service]
# Delegate=cpu memory pids io

# Or configure per-user
mkdir -p /etc/systemd/system/user@.service.d/
cat > /etc/systemd/system/user@.service.d/delegate.conf << 'CONF'
[Service]
Delegate=cpu cpuset io memory pids
CONF
systemctl daemon-reload

Kubernetes Pod Security Standards and cgroups

# Pod Security Standard: restricted
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: nginx
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop: ['ALL']
        readOnlyRootFilesystem: true
      resources:
        limits: # cgroup resource limits are mandatory
          cpu: '500m'
          memory: '256Mi'
        requests:
          cpu: '250m'
          memory: '128Mi'

8. Summary

cgroup v1 to v2 Migration Checklist

  • Check kernel version: 5.8+ recommended (PSI, CPU burst, etc.)
  • Check systemd version: 236+
  • Check container runtime: Docker 20.10+, containerd 1.4+
  • Check Kubernetes version: 1.25+ (GA)
  • Configure GRUB boot parameters
  • Change kubelet cgroup driver to systemd
  • Verify monitoring tool compatibility (cAdvisor, Prometheus)
  • Update custom cgroup scripts to v2 interface
  • Thoroughly test in staging before production deployment
# Kubernetes node recommended settings
# kubelet config
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
cgroupsPerQOS: true
podPidsLimit: 1024 # Fork bomb prevention
enforceNodeAllocatable:
  - pods
  - system-reserved
  - kube-reserved
kubeReserved:
  cpu: '500m'
  memory: '1Gi'
  ephemeral-storage: '1Gi'
systemReserved:
  cpu: '500m'
  memory: '1Gi'
  ephemeral-storage: '1Gi'
evictionHard:
  memory.available: '100Mi'
  nodefs.available: '10%'
  imagefs.available: '15%'

Essential Command Cheat Sheet

# === Information ===
cat /proc/self/cgroup                    # Current process cgroup
stat -fc %T /sys/fs/cgroup/             # v1(tmpfs) vs v2(cgroup2fs)
cat /sys/fs/cgroup/cgroup.controllers   # Available controllers
systemd-cgls                            # Display cgroup tree
systemd-cgtop                           # Real-time cgroup monitoring

# === CPU ===
cat /sys/fs/cgroup/PATH/cpu.max         # CPU quota/period
cat /sys/fs/cgroup/PATH/cpu.weight      # CPU weight
cat /sys/fs/cgroup/PATH/cpu.stat        # Throttle statistics

# === Memory ===
cat /sys/fs/cgroup/PATH/memory.max      # Hard limit
cat /sys/fs/cgroup/PATH/memory.current  # Current usage
cat /sys/fs/cgroup/PATH/memory.stat     # Detailed statistics
cat /sys/fs/cgroup/PATH/memory.events   # OOM events

# === IO ===
cat /sys/fs/cgroup/PATH/io.max          # IO limits
cat /sys/fs/cgroup/PATH/io.stat         # IO statistics

# === PID ===
cat /sys/fs/cgroup/PATH/pids.max        # PID limit
cat /sys/fs/cgroup/PATH/pids.current    # Current PID count

# === Docker ===
docker stats                             # Real-time resource monitoring
docker inspect CONTAINER | grep -i cgroup  # cgroup config check

# === Troubleshooting ===
dmesg | grep -i "killed process"         # OOM kill logs
cat /proc/PID/cgroup                     # Process cgroup check
cat /proc/PID/oom_score                  # OOM score check

Quiz: Test Your cgroup Knowledge

Q1. What is the biggest structural difference between cgroup v1 and v2?

A: v1 has independent hierarchies per controller, while v2 uses a single unified hierarchy. In v2, a process belongs to the same cgroup across all controllers.

Q2. What conditions must be met for a Kubernetes Pod to get "Guaranteed" QoS class?

A: All containers must have CPU and memory requests set equal to their limits.

Q3. What is the difference between memory.high and memory.max in cgroup v2?

A: memory.high is a soft limit that throttles the process when exceeded. memory.max is a hard limit that triggers the OOM Killer when exceeded.

Q4. What does a 35% CPU throttle ratio mean?

A: It means that in 35% of all CFS periods, the CPU quota was fully consumed and the process could not run. This indicates you should consider increasing CPU limits or configuring CPU burst.

Q5. Why is the systemd cgroup driver recommended over cgroupfs in Docker?

A: When systemd is the init system and cgroupfs is used, two cgroup managers operate simultaneously, which can make the system unstable under resource pressure. Using the systemd driver unifies cgroup management under a single manager.

Q6. What cgroup mechanism does CVE-2022-0492 exploit?

A: It exploits the release_agent mechanism in cgroup v1. The release_agent executes a program when the last process in a cgroup exits. A missing permission check allowed attackers to execute arbitrary code with host privileges from within a container.


References

  • Linux Kernel Documentation: Control Group v2
  • Kubernetes Documentation: About cgroup v2
  • Red Hat: Migrating from CGroups V1 to V2
  • CFS Bandwidth Control - Linux Kernel Documentation