- Authors
- Name
Introduction
Kernel parameter tuning is a task that directly impacts server performance and stability. A single incorrect value can trigger OOM kills, drop network connections, or create security vulnerabilities.
This article distinguishes between sysctl (runtime parameters) and boot parameters (kernel command line), covering the meaning, recommended values, and application methods for each, and presents safe change procedures and rollback strategies for production environments.
1. Two Paths for Kernel Parameters
| Category | sysctl (Runtime) | Boot Params (Boot Time) |
|---|---|---|
| When Applied | Immediately (no reboot) | At next boot |
| Config File | /etc/sysctl.d/*.conf | /etc/default/grub -> grub.cfg |
| Check Command | sysctl <param> | cat /proc/cmdline |
| Persist | sysctl.d + sysctl -p | grub2-mkconfig / update-grub |
| Rollback | Restore previous values | Select previous GRUB entry |
| Scope | Items under /proc/sys/ | All kernel boot options |
2. Safe Change Procedures (Production Protocol)
2.1 Pre-Change Checklist
#!/usr/bin/env bash
# pre-tuning-check.sh - Back up state before tuning
BACKUP_DIR="/root/kernel-tuning-backup/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# 1. Full sysctl dump
sysctl -a > "$BACKUP_DIR/sysctl-before.txt" 2>/dev/null
# 2. Current boot parameters
cat /proc/cmdline > "$BACKUP_DIR/cmdline-before.txt"
# 3. GRUB config backup
cp /etc/default/grub "$BACKUP_DIR/grub-before"
[[ -d /etc/sysctl.d ]] && cp -r /etc/sysctl.d "$BACKUP_DIR/sysctl.d-before"
# 4. System state snapshot
free -h > "$BACKUP_DIR/memory-before.txt"
ss -s > "$BACKUP_DIR/socket-stats-before.txt"
vmstat 1 5 > "$BACKUP_DIR/vmstat-before.txt"
cat /proc/net/sockstat > "$BACKUP_DIR/sockstat-before.txt"
echo "Backup complete: $BACKUP_DIR"
2.2 Change Steps
1. Test in staging environment
2. Apply to 1 canary server -> Monitor (minimum 24 hours)
3. If no issues, rolling apply by group
4. Compare metrics after application (before vs after)
2.3 Rollback Procedure
# sysctl rollback - Restore specific parameter from backup
PARAM="net.core.somaxconn"
OLD_VALUE=$(grep "^${PARAM}" /root/kernel-tuning-backup/latest/sysctl-before.txt | awk '{print $3}')
sysctl -w "${PARAM}=${OLD_VALUE}"
# Full sysctl rollback
while IFS='= ' read -r key value; do
sysctl -w "${key}=${value}" 2>/dev/null
done < /root/kernel-tuning-backup/latest/sysctl-before.txt
# Boot parameter rollback - Restore previous GRUB config
cp /root/kernel-tuning-backup/latest/grub-before /etc/default/grub
grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL
# update-grub # Ubuntu
3. Network Tuning
3.1 TCP Connection Management
# /etc/sysctl.d/10-network.conf
# TCP backlog - Essential for high-traffic servers
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# TCP socket buffers (bytes)
# min / default / max
net.core.rmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_default = 262144
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 262144 16777216
net.ipv4.tcp_wmem = 4096 262144 16777216
# TCP congestion control
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
# TIME_WAIT management
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_max_tw_buckets = 2000000
# Keepalive (servers behind load balancers)
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
3.2 Network Parameter Reference Table
| Parameter | Default | Recommended | Description |
|---|---|---|---|
net.core.somaxconn | 4096 | 65535 | Max listen() backlog |
net.ipv4.tcp_max_syn_backlog | 1024 | 65535 | Max SYN queue size |
net.core.rmem_max | 212992 | 16MB | Max receive socket buffer |
net.core.wmem_max | 212992 | 16MB | Max send socket buffer |
net.ipv4.tcp_congestion_control | cubic | bbr | Congestion control algo |
net.ipv4.tcp_tw_reuse | 0(2) | 1 | TIME_WAIT socket reuse |
net.ipv4.tcp_fin_timeout | 60 | 15 | FIN-WAIT-2 timeout |
net.ipv4.tcp_keepalive_time | 7200 | 60 | Keepalive start time (sec) |
net.ipv4.ip_local_port_range | 32768-60999 | 1024-65535 | Outbound port range |
3.3 Enabling BBR
# Load BBR kernel module
modprobe tcp_bbr
echo "tcp_bbr" >> /etc/modules-load.d/bbr.conf
# Apply sysctl
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr
# Verify
sysctl net.ipv4.tcp_congestion_control
# net.ipv4.tcp_congestion_control = bbr
BBR vs CUBIC: BBR uses bandwidth estimation-based congestion control rather than packet loss-based, significantly improving performance especially on long-distance, high-latency networks.
4. Memory Tuning
4.1 Virtual Memory Management
# /etc/sysctl.d/20-memory.conf
# Swap tendency (0=minimal, 100=aggressive)
# DB servers: 1-10, Web servers: 10-30
vm.swappiness = 10
# Dirty page ratio - disk write delay
vm.dirty_ratio = 40 # Max dirty page ratio relative to total memory
vm.dirty_background_ratio = 10 # Background flush start ratio
# OOM-related
vm.overcommit_memory = 0 # 0=default(heuristic), 1=always allow, 2=restrict
vm.panic_on_oom = 0 # Whether to panic on OOM (0=run OOM Killer)
# Max memory map areas (Elasticsearch, MongoDB, etc.)
vm.max_map_count = 262144
# Filesystem cache release (emergency only)
# echo 3 > /proc/sys/vm/drop_caches # 1=pagecache, 2=dentries+inodes, 3=all
4.2 Memory Parameter Guide
| Parameter | DB Server | Web/API Server | ML Workload |
|---|---|---|---|
vm.swappiness | 1~5 | 10~30 | 1 |
vm.dirty_ratio | 40 | 20 | 40 |
vm.dirty_background_ratio | 10 | 5 | 10 |
vm.overcommit_memory | 0 | 0 | 1 |
vm.max_map_count | 262144 | 65530 | 262144 |
4.3 Huge Pages
# Transparent Huge Pages (THP) - Recommended to disable for DB
# Set via boot parameter
# Add to GRUB_CMDLINE_LINUX:
# transparent_hugepage=never
# Runtime check/change
cat /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Static Huge Pages (Oracle DB, DPDK, etc.)
# /etc/sysctl.d/20-memory.conf
vm.nr_hugepages = 1024 # 2MB * 1024 = 2GB
# Verify
grep -i huge /proc/meminfo
5. Filesystem and I/O Tuning
# /etc/sysctl.d/30-fs.conf
# Max open files (system-wide)
fs.file-max = 2097152
# inotify watch limits (IDE, file watch services)
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 8192
# AIO (Asynchronous I/O) max request count
fs.aio-max-nr = 1048576
ulimit Integration
# /etc/security/limits.d/99-app.conf
# Must be configured together with sysctl's fs.file-max to take effect
* soft nofile 1048576
* hard nofile 1048576
* soft nproc 65535
* hard nproc 65535
* soft memlock unlimited
* hard memlock unlimited
I/O Scheduler Configuration
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# SSD: none or mq-deadline recommended
echo mq-deadline > /sys/block/sda/queue/scheduler
# Persistent config (udev rule)
# /etc/udev/rules.d/60-scheduler.rules
# ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
| Scheduler | Disk Type | Characteristics |
|---|---|---|
none (noop) | NVMe SSD | Minimal overhead |
mq-deadline | SATA SSD | Guaranteed latency |
bfq | HDD | Fair bandwidth allocation |
kyber | Fast SSD | Read/write latency control |
6. Security-Related Parameters
# /etc/sysctl.d/40-security.conf
# ASLR (Address Space Layout Randomization)
kernel.randomize_va_space = 2 # 0=off, 1=partial, 2=full
# SysRq restriction (allow emergency recovery only)
kernel.sysrq = 176 # Bitmask: sync + remount-ro + reboot
# Core dump restriction
kernel.core_pattern = |/bin/false
fs.suid_dumpable = 0
# dmesg access restriction
kernel.dmesg_restrict = 1
# Kernel pointer hiding
kernel.kptr_restrict = 2
# BPF restriction (unprivileged users)
kernel.unprivileged_bpf_disabled = 1
# Network security
net.ipv4.conf.all.rp_filter = 1 # Reverse Path Filtering
net.ipv4.conf.all.accept_redirects = 0 # Reject ICMP redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0 # Reject Source Routing
net.ipv4.conf.all.log_martians = 1 # Log suspicious packets
net.ipv4.icmp_echo_ignore_broadcasts = 1 # Smurf attack defense
# IP forwarding (disable unless router/container host)
net.ipv4.ip_forward = 0
# For container hosts:
# net.ipv4.ip_forward = 1
Security Parameter Checklist
| Parameter | Secure Value | CIS Benchmark | Notes |
|---|---|---|---|
kernel.randomize_va_space | 2 | Required | Full ASLR enabled |
kernel.dmesg_restrict | 1 | Recommended | Block dmesg for regular users |
kernel.kptr_restrict | 2 | Recommended | Prevent kernel address exposure |
net.ipv4.conf.all.rp_filter | 1 | Required | Prevent IP spoofing |
net.ipv4.conf.all.accept_redirects | 0 | Required | Prevent MITM attacks |
net.ipv4.conf.all.log_martians | 1 | Recommended | Audit abnormal packets |
fs.suid_dumpable | 0 | Required | Prevent SUID core dumps |
7. Boot Parameters (Kernel Command Line)
7.1 Configuration Method
# Check current boot parameters
cat /proc/cmdline
# RHEL / Rocky
vi /etc/default/grub
# GRUB_CMDLINE_LINUX="... parameters_to_add"
grub2-mkconfig -o /boot/grub2/grub.cfg
# Ubuntu
vi /etc/default/grub
# GRUB_CMDLINE_LINUX_DEFAULT="... parameters_to_add"
update-grub
7.2 Key Boot Parameters
| Parameter | Value | Purpose |
|---|---|---|
transparent_hugepage=never | never / always / madvise | Disable THP for DB servers |
mitigations=auto | off / auto / auto,nosmt | CPU vulnerability mitigation |
numa_balancing=disable | disable / enable | NUMA auto-balancing |
isolcpus=2-7 | CPU list | Isolate specific CPUs from scheduler |
nohz_full=2-7 | CPU list | Tick-less mode (real-time workloads) |
intel_iommu=on | on / off | Enable IOMMU (SR-IOV, VFIO) |
iommu=pt | pt / off | IOMMU pass-through |
default_hugepagesz=1G | 2M / 1G | Default Huge Page size |
hugepagesz=1G hugepages=16 | size + count | 1GB Huge Page allocation |
crashkernel=256M | size | kdump memory reservation |
audit=1 | 0 / 1 | Kernel audit logging |
7.3 CPU Vulnerability Mitigation vs Performance
# Check currently applied mitigations
grep -r . /sys/devices/system/cpu/vulnerabilities/ 2>/dev/null
# Disable mitigations (benchmark/isolated environments only!)
# Add to GRUB_CMDLINE_LINUX:
# mitigations=off
# Performance impact (varies by workload)
# mitigations=auto: 5~30% overhead on syscall-intensive workloads
# mitigations=off: Security risk - not recommended for production
Warning:
mitigations=offdisables all security mitigations including Spectre/Meltdown/MDS. Use only in isolated benchmark environments and never in production.
8. Workload-Specific Tuning Profiles
8.1 Web Server / API Server
# /etc/sysctl.d/99-web-server.conf
# Network
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# Files
fs.file-max = 2097152
# Memory
vm.swappiness = 10
8.2 Database Server
# /etc/sysctl.d/99-database.conf
# Memory
vm.swappiness = 1
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.overcommit_memory = 0
vm.max_map_count = 262144
# Files
fs.file-max = 2097152
fs.aio-max-nr = 1048576
# Network (primarily internal communication)
net.core.somaxconn = 65535
net.ipv4.tcp_keepalive_time = 60
# Huge Pages (PostgreSQL, Oracle, etc.)
# vm.nr_hugepages calculation: shared_buffers / 2MB + some headroom
# Example: shared_buffers=8GB -> vm.nr_hugepages = 4200
vm.nr_hugepages = 4200
Boot parameters:
transparent_hugepage=never
8.3 Container Host (Docker/K8s)
# /etc/sysctl.d/99-container-host.conf
# IP forwarding required
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
# Network
net.core.somaxconn = 65535
net.ipv4.ip_local_port_range = 1024 65535
net.netfilter.nf_conntrack_max = 1048576
# inotify (when running many Pods)
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 8192
# PID limit
kernel.pid_max = 4194304
# Files
fs.file-max = 2097152
9. Automation and Verification
Managing sysctl with Ansible
# roles/sysctl/tasks/main.yml
- name: Apply sysctl parameters
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
sysctl_file: /etc/sysctl.d/99-tuning.conf
reload: true
state: present
loop: "{{ sysctl_params | dict2items }}"
# roles/sysctl/defaults/main.yml
sysctl_params:
net.core.somaxconn: 65535
net.ipv4.tcp_max_syn_backlog: 65535
vm.swappiness: 10
fs.file-max: 2097152
Post-Application Verification Script
#!/usr/bin/env bash
# verify-tuning.sh - Verify tuning values
declare -A EXPECTED=(
["net.core.somaxconn"]="65535"
["net.ipv4.tcp_congestion_control"]="bbr"
["vm.swappiness"]="10"
["fs.file-max"]="2097152"
)
FAILED=0
for param in "${!EXPECTED[@]}"; do
actual=$(sysctl -n "$param" 2>/dev/null)
expected="${EXPECTED[$param]}"
if [[ "$actual" != "$expected" ]]; then
echo "FAIL: $param = $actual (expected: $expected)"
(( FAILED++ ))
else
echo "OK: $param = $actual"
fi
done
echo "---"
if (( FAILED > 0 )); then
echo "Verification failed: ${FAILED} item(s)"
exit 1
else
echo "All parameters verified successfully"
fi
10. Troubleshooting
| Symptom | Check Command | Related Parameter |
|---|---|---|
| "Too many open files" | ulimit -n, sysctl fs.file-max | fs.file-max, limits.conf |
| "Connection refused" (backlog full) | ss -lnt, netstat -s | grep overflow | net.core.somaxconn |
| TIME_WAIT explosion | ss -s | tcp_tw_reuse, tcp_fin_timeout |
| Frequent OOM kills | dmesg | grep -i oom, /proc/meminfo | vm.swappiness, vm.overcommit_memory |
| High I/O wait | iostat -x 1, vmstat 1 | vm.dirty_ratio, I/O scheduler |
| "Cannot allocate memory" (mmap) | sysctl vm.max_map_count | vm.max_map_count |
| nf_conntrack table full | dmesg | grep conntrack, sysctl net.netfilter.nf_conntrack_count | nf_conntrack_max |
Conclusion
Here are the key principles of kernel parameter tuning:
- Measure first, tune later: Do not change values unless a bottleneck has been identified.
- One at a time: Changing multiple parameters simultaneously makes it impossible to isolate effects.
- Always back up: Record current values before making changes. Never make changes without a rollback path.
- Canary deployment: Do not apply to all servers at once -- verify on 1-2 servers first.
- Document everything: Record why you changed to this value and what effect was observed.
Proper tuning can dramatically improve server performance, but incorrect tuning can directly cause outages. Always approach tuning safely, incrementally, and measurably.