Kernel Tuning for Low-Latency Trading Systems — The War Against Microseconds

Introduction
- A Sense of Scale First — ns, us, ms
Decomposing Latency — Where Does a Packet Lose Time?
The Kernel-Bypass Spectrum — How Far Do You Go?
The CPU Isolation Recipe — Renting Out Cores
NUMA Alignment — Locality of Memory and the NIC
Interrupts vs Polling — Do Not Wait to Be Woken
Eliminating Idle States — A Sleeping Core Is a Slow Core
Memory — Exterminate Page Faults and TLB Misses
Network Stack Tuning Parameters
Time Synchronization — PTP and Hardware Timestamps
Measurement Methodology — Averages Lie
Application-Side Considerations — The Kernel Is Not the Whole Story
Why This Is Overkill for Ordinary Services — Settling the Trade-Offs
Checklist
Pitfalls and Anti-Patterns
Closing
References

Introduction

In most services, 100 microseconds of latency is noise you would not even bother measuring. In high-frequency trading (HFT) and market making, 100 microseconds is the difference between "your order fills" and "you watch someone else fill it." The order that reaches the exchange matching engine first takes the favorable price; the late order gets the worsened quote. In this domain, latency is not an abstract quality metric — it is a number recorded directly on the profit and loss statement.

This article organizes Linux kernel tuning from the perspective of low-latency trading systems. A disclaimer up front: most techniques here are overkill for ordinary services, and the latter part of the article covers "why you should not apply this to your web server." Also, this is a systems-engineering article, not investment advice.

A Sense of Scale First — ns, us, ms

To discuss low-latency tuning, you first need intuition for time units.

Unit	Magnitude	Intuition	Example operations
1 ns	One billionth of a second	A few CPU cycles (one cycle is about 0.3ns)	L1 cache access (about 1ns)
10 ns	---	L2 cache access	Branch misprediction penalty
100 ns	---	Main memory access (about 60 to 100ns)	Remote NUMA node access
1 us	One millionth of a second	Half a well-tuned userspace networking round trip	Kernel-bypass NIC transmit
10 us	---	One context switch plus cache pollution	Standard kernel stack UDP receive
100 us	---	Interrupt latency spike on a misconfigured box	Waking from a deep C-state
1 ms	One thousandth of a second	A "fast" response for a normal web service	SSD access, minor GC pause
10 ms	---	Longer than a Seoul-Busan fiber round trip	HDD seek, timeslice expiry

For trading systems, tick-to-trade (market data in, order out) competes in the single-digit microsecond range for well-built software systems and in the hundreds of nanoseconds for FPGA-based ones. The unit we fight in is microseconds, and the enemy is "the occasional millisecond-class spike."

Decomposing Latency — Where Does a Packet Lose Time?

Trace the path of a market data packet from NIC arrival to the application sending an order.

[Exchange market data]
    |
    v
(1) NIC receive: wire --> NIC buffer          physical layer, tens of ns
    |
(2) DMA + interrupt/polling                   hundreds of ns to several us
    |     - interrupt path: IRQ -> softirq -> protocol stack
    |     - polling/bypass path: userspace reads the ring buffer directly
    v
(3) Protocol processing (IP/UDP decode)       standard stack: 1-5 us / bypass: hundreds of ns
    |
(4) Socket/queue delivery --> app wakeup      large variance here:
    |     - if the thread is already busy-waiting on a core: ~0
    |     - if a scheduler wakeup is needed: 1-10 us plus spikes
    v
(5) Strategy logic (quote calc, risk checks)  hundreds of ns to several us (app domain)
    |
(6) Order encoding + transmit                 reverse of (2)-(3)
    v
[Exchange order gateway]

Two key insights. First, the mean is dominated by (3), (5), (6), but the tail is dominated by (2) and (4). Interrupt coalescing, scheduler interference, C-state wakeups, TLB misses, SMIs (System Management Interrupts) — these are what ruin your 99.99th percentile. Second, the essence of tuning is not reducing the mean but killing the variance. In trading, an unpredictable system is worse than a slow one.

The Kernel-Bypass Spectrum — How Far Do You Go?

How much of the kernel network stack to bypass is a cost-benefit spectrum.

Approach	Representative tech	Receive latency (approx.)	Dev difficulty	Ops difficulty	Best fit
Standard stack tuning	sysctl, busy_poll, IRQ affinity	5-20 us	Low	Low	General low latency, back-office feeds
XDP/AF_XDP	eBPF in-kernel processing	2-5 us	Medium	Medium	Filtering, market data fan-out, DDoS defense
Bypass with socket compat	Onload-style (socket API kept)	1-3 us	Low (no recompile)	Medium	Accelerating existing socket apps
Full userspace stack	DPDK + own UDP/TCP handling	Under 1 us	High	High	Tick-to-trade core path
Hardware offload	FPGA/SmartNIC	Hundreds of ns	Very high	Very high	Top-tier HFT

The decision criterion is simple: is your competition measured in microseconds or milliseconds? For market data analytics or risk batches, standard stack tuning suffices. For a speed race against the matching engine, DPDK or socket-compatible bypass (Onload, VMA-style) is the starting point. AF_XDP occupies a unique middle ground — "bypass in cooperation with the kernel" — and is useful for peeling only part of the traffic on one NIC into a low-latency path.

The CPU Isolation Recipe — Renting Out Cores

The first principle of low latency: evict everything else from the cores where hot-path threads run. The trio of kernel boot parameters is the foundation.

# /etc/default/grub
# Example: isolate cores 4-23 for trading on a 24-core machine
GRUB_CMDLINE_LINUX="isolcpus=nohz,domain,managed_irq,4-23 \
  nohz_full=4-23 \
  rcu_nocbs=4-23 \
  rcu_nocb_poll \
  irqaffinity=0-3 \
  intel_idle.max_cstate=0 processor.max_cstate=1 \
  intel_pstate=disable idle=poll \
  mitigations=off \
  transparent_hugepage=never \
  audit=0 nosoftlockup"

# Apply
sudo update-grub && sudo reboot

What each piece means:

isolcpus: removes the listed cores from general scheduler load balancing. Only threads with explicit affinity land there.
nohz_full: when exactly one task runs on the core, the periodic scheduler tick (100 to 1000 per second by default) is stopped, removing tick-induced micro-jitter.
rcu_nocbs: moves RCU callback processing off the isolated cores onto housekeeping cores (0-3 here).
idle=poll / max_cstate: keeps cores from sleeping (details later).
mitigations=off: disables side-channel mitigations. Syscall and context-switch costs drop meaningfully, but this is an option only for closed, dedicated networks where the security trade-off is formally approved by the organization.

After boot, place threads and interrupts.

#!/usr/bin/env bash
set -euo pipefail

# 1) irqbalance is the enemy of isolation — disable it
systemctl stop irqbalance || true
systemctl disable irqbalance || true

# 2) default all IRQ affinity to housekeeping cores (0-3)
for irq in /proc/irq/*/smp_affinity_list; do
  echo "0-3" > "$irq" 2>/dev/null || true
done

# 3) pin only the trading NIC RX queue IRQs to dedicated cores (4,5)
NIC="ens1f0"
i=0
for irq in $(grep "$NIC" /proc/interrupts | awk -F: '{print $1}'); do
  core=$((4 + i % 2))
  echo "$core" > "/proc/irq/$irq/smp_affinity_list"
  i=$((i + 1))
done

# 4) pin worker threads to isolated cores (pthread_setaffinity_np inside
#    the app is the proper way; from outside, taskset)
taskset -c 6 ./market_data_handler &
taskset -c 7 ./order_gateway &

# 5) kernel workqueues to housekeeping cores too
echo 0f > /sys/devices/virtual/workqueue/cpumask

Verify with perf and the /proc interfaces.

# Confirm ticks really stopped on isolated cores
watch -n1 'grep -E "LOC|RES" /proc/interrupts | awk "{print \$1, \$8, \$9, \$10}"'

# Monitor context switches on a given core (should be zero)
perf stat -C 6 -e context-switches,cpu-migrations -- sleep 10

NUMA Alignment — Locality of Memory and the NIC

Ignore NUMA on a multi-socket server and you pay a tens-of-nanoseconds tax on every memory access. More important still: which NUMA node is the NIC attached to? PCIe slots are wired to a specific socket, so the rule is to use cores and memory on the same node as the NIC.

# Find the NIC NUMA node
cat /sys/class/net/ens1f0/device/numa_node
# Output: 1  --> this NIC hangs off node 1

# List node 1 cores
lscpu | grep "NUMA node1"

# Launch the hot-path process bound to node 1 cores and memory
numactl --cpunodebind=1 --membind=1 ./trading_engine

# Verify there is no cross-node traffic
numastat -p $(pgrep trading_engine)

At the application design level, the canonical pattern is to keep the entire chain — market data thread, strategy thread, order send thread — on the same node, and handle inter-thread communication via lock-free ring buffers between core pairs sharing the same L3 cache.

Interrupts vs Polling — Do Not Wait to Be Woken

Interrupts are the efficient "wake me when something happens" model, but the wakeup cost (IRQ handling plus scheduler wakeup plus cache warming) runs in microseconds. On the low-latency path, you burn a core and poll.

Busy polling is the compromise that keeps the standard stack.

# Socket-layer busy poll (in microseconds)
sysctl -w net.core.busy_poll=50
sysctl -w net.core.busy_read=50

# For epoll-based apps, per-NAPI busy poll is also possible (kernel 5.11+)
# combine with app-level settings such as SO_BUSY_POLL_BUDGET and the EPIOCSPARAMS ioctl

# Balance NAPI polling vs interrupt coalescing — for low latency, turn coalescing off
ethtool -C ens1f0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0

Note that disabling interrupt coalescing makes CPU usage soar at high PPS. Disable it only for traffic where "every single packet matters immediately" like market data, and keep coalescing for bulk-traffic NICs. With DPDK or Onload this trade-off disappears entirely — userspace poll mode is the default.

Eliminating Idle States — A Sleeping Core Is a Slow Core

Modern CPUs sleep into C-states when idle and shift P-states (frequency) under load. Both are enemies of low latency. Waking from deep C-states (C6) can take tens to hundreds of microseconds, and frequency transitions create hiccups of tens of microseconds.

# C-states: if pinned via boot parameters, just verify
cat /sys/devices/system/cpu/cpu6/cpuidle/state*/disable

# To control at runtime, keep a write of 0 to /dev/cpu_dma_latency open
# (applies while the opening process lives — tuned latency-performance does this)
tuned-adm profile latency-performance

# P-states: pin the governor to performance
for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance > "$g"
done

# Turbo boost: for determinism, disabling is the consistent choice
# (enabled, the mean improves but frequency wobbles with thermal conditions)
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Two easily missed items. First, SMIs (System Management Interrupts) are firmware-level stalls invisible to the OS and the usual culprit behind multi-hundred-microsecond spikes. Disable USB legacy support and power-management SMI sources in the BIOS, and watch the SMI counter with turbostat. Second, hyper-threading: on hot-path cores, disable it or leave the sibling thread empty — the sibling pollutes caches and execution ports.

Memory — Exterminate Page Faults and TLB Misses

A single page fault on the hot path is a microsecond-class disaster. Touch and lock all memory at startup.

# Reserve huge pages (2MB) — reduces TLB misses
echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# 1GB huge pages via boot parameters
# default_hugepagesz=1G hugepagesz=1G hugepages=16

/* The canonical application startup sequence */
#include <sys/mman.h>

int main(void) {
    /* 1) pin all current and future pages in RAM */
    mlockall(MCL_CURRENT | MCL_FUTURE);

    /* 2) preallocate heap/pools and touch every page (prefault) */
    /* 3) after entering the hot path: no malloc/free, no syscalls, no page faults */
    return run_engine();
}

THP (Transparent Huge Pages) is contentious. It helps from a TLB perspective, but the moment khugepaged compacts pages in the background you can take a millisecond-class stall. The low-latency consensus: set THP to never and use explicit huge pages. If determinism is the goal, most "the kernel handles it for you" features should be turned off.

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
sysctl -w vm.swappiness=0
sysctl -w vm.stat_interval=120        # slow vmstat updates to cut jitter

Network Stack Tuning Parameters

The baseline set for paths that keep the standard stack (e.g. TCP to the order gateway).

# Buffers: for low latency the goal is "right-sized," not "big" — oversized buffers hide latency
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP: disable Nagle in the app with TCP_NODELAY — that is the proper place
sysctl -w net.ipv4.tcp_low_latency=1        # meaningful only on older kernels
sysctl -w net.ipv4.tcp_timestamps=1
sysctl -w net.ipv4.tcp_slow_start_after_idle=0

# Queueing discipline: minimize transmit latency
sysctl -w net.core.default_qdisc=fq
tc qdisc replace dev ens1f0 root mq

# UDP market data: drop monitoring is life or death
watch -n1 'cat /proc/net/udp | awk "{print \$13}" | sort | uniq -c'
ethtool -S ens1f0 | grep -iE "drop|miss|fifo"

# RSS/RPS: separate RX queues per multicast group, mapped 1:1 to cores
ethtool -X ens1f0 weight 1 1 0 0

Time Synchronization — PTP and Hardware Timestamps

As important as reducing latency is measuring it correctly, and the prerequisite is the clock. On the regulatory side too, MiFID II RTS 25 in Europe requires HFT operators to keep clocks within 100 microseconds of UTC and to record precise timestamps. NTP (millisecond class) is insufficient; PTP (IEEE 1588, sub-microsecond) is the standard.

# Check NIC hardware timestamp support
ethtool -T ens1f0
# look for "hardware-transmit / hardware-receive / PTP Hardware Clock: 0"

# linuxptp: sync the NIC PHC to the grandmaster
ptp4l -i ens1f0 -f /etc/ptp4l.conf --summary_interval 6 &

# PHC --> system clock
phc2sys -s ens1f0 -O 0 -u 64 &

# Monitor offset (target: rms in the tens of ns)
pmc -u -b 0 'GET CURRENT_DATA_SET'

On the application side, use the SO_TIMESTAMPING socket option to receive NIC hardware RX timestamps and measure per-segment latency relative to "wire arrival time." Software timestamps have the contradiction that the thing being measured (the kernel path) is part of the measuring tool — serious measurement uses hardware timestamps or optical taps plus capture appliances.

Measurement Methodology — Averages Lie

The report card of a low-latency system is not the mean but the percentiles — especially p99.9 and p99.99.

A tick-to-trade distribution (fictional example)
  p50    :   3.2 us     <-- the number that goes in marketing slides
  p99    :   5.8 us
  p99.9  :  14.0 us     <-- interrupt/timer interference
  p99.99 : 180.0 us     <-- C-state wakeups, SMIs, page faults
  max    : 2.1 ms       <-- if this lands at the worst moment, it is an incident

Looking only at the mean (3.5us) without a histogram, this system is "great."
But if 1 in 10,000 events takes 180us precisely when the market is most
violent (= when load is highest), that is exactly the moment you lose money.

The tool chain:

# 1) Platform jitter itself — cyclictest (rt-tests package)
#    20 minutes on an isolated core, histogram output
cyclictest -m -p95 -t1 -a 6 -i 100 -h 400 -D 20m -q

# 2) Kernel path tracing — softirq duration distribution with bpftrace
bpftrace -e 'tracepoint:irq:softirq_entry { @ts[cpu] = nsecs; }
tracepoint:irq:softirq_exit /@ts[cpu]/ {
  @dist = hist(nsecs - @ts[cpu]); delete(@ts[cpu]); }'

# 3) Watch for scheduler interference — events that must not happen on isolated cores
perf record -C 6 -e sched:sched_switch,sched:sched_wakeup -- sleep 60
perf script | head

# 4) Correlate with hardware counters for cache/TLB misses
perf stat -C 6 -e cycles,instructions,cache-misses,dTLB-load-misses -- sleep 10

At the application level, record everything continuously with loss-free histograms in the HdrHistogram family, and correct for the "coordinated omission" trap — the phenomenon where samples missing from the periods when the measurement loop itself was stalled make the distribution look better than reality. Always measure at production-equivalent message rates, reproducing burst conditions like the market open.

Application-Side Considerations — The Kernel Is Not the Whole Story

What the kernel provides ends at "an undisturbed core." The code above it must obey the same discipline.

Java family: avoid GC altogether via zero-allocation design on the hot path. Object pools, primitive-based data structures, off-heap ring buffers (Aeron, Chronicle-style patterns) are standard; some shops permit GC only after market close. Low-pause collectors like ZGC guarantee sub-millisecond pauses — sub-millisecond, not zero.
C++ family: forbid dynamic allocation, locks, syscalls, and exceptions on the hot path. Spin instead of blocking, reduce branches, align hot data to cache lines, and eliminate false sharing with padding.
Common: logging is asynchronous, binary, and on a separate core. One printf on the hot path can nullify all your kernel tuning.

Why This Is Overkill for Ordinary Services — Settling the Trade-Offs

Why you should not apply these recipes to a general service, itemized.

Technique	Gain in trading	Cost in a general service
Core isolation + polling	Microsecond tail reduction	Cores pinned at 100% — wasted power and cost, lower density
C-states disabled	No wakeup latency	Idle power multiplies, heat, less turbo headroom
mitigations=off	Hundreds of ns off syscalls	Side-channel exposure — taboo in multi-tenant settings
Coalescing off	Immediate per-packet handling	CPU explosion at high PPS, throughput collapse
THP off + manual huge pages	No millisecond stalls	Ops complexity; THP often wins for general workloads
Kernel bypass	Sub-microsecond receive	Kernel security/observability tooling disabled, dedicated staffing

In short, low-latency tuning is a trade where you pay throughput, power efficiency, security, and operability to buy determinism. If your API server is fine with 50ms at p99, this trade is a plain loss. If microseconds are your profit and loss, every cost above is justified. The most important tuning decision is honestly judging which side your system is on.

Checklist

Hardware/firmware

BIOS reviewed for C-state limits, hyper-threading policy, SMI sources?
NIC supports PTP hardware timestamps and sits in a PCIe slot on the right NUMA node?
SMI counter verified near zero with turbostat?

Kernel boot/isolation

isolcpus, nohz_full, rcu_nocbs configured with the identical core set?
irqbalance disabled and IRQ affinity placed explicitly?
Zero context switches confirmed with perf on isolated cores?

Memory/power

mlockall and prefaulting included in the startup sequence?
THP set to never with explicit huge pages reserved?
Governor performance, C-states pinned, turbo policy as intended?

Network/time

Coalescing, drop counters, RSS mapping checked on the market data NIC?
ptp4l/phc2sys offsets held within target (tens of ns)?
Hardware-timestamp-based per-segment measurement working?

Measurement/operations

p99.99 and max recorded continuously as histograms?
A cyclictest-based platform jitter baseline documented?
A before/after distribution comparison procedure for every tuning change (one change at a time)?

Pitfalls and Anti-Patterns

Changing ten things at once. You will never know which change mattered. Keep the loop: baseline, single change, re-measure.
Shipping on the mean. Tails appear under load — when the market is violent. Benchmarks during quiet hours are nearly meaningless.
Isolated, but kernel threads remain. kworker, ksoftirqd, and timers can linger on isolated cores. Verify workqueue cpumask and nohz_full behavior empirically.
Upgrades that forget NUMA. After a server swap or NIC reseating, numa_node silently changes and everything quietly slows. Assert it at startup.
Copy-pasting mitigations=off carelessly. On internet-exposed or multi-tenant systems this is a shortcut to a security incident.
Measuring without a clock. Computing "latency" from timestamps of two unsynchronized hosts yields comedy like negative latency. PTP comes first.

Closing

The technical content of low-latency kernel tuning collapses into one sentence: remove everything unpredictable from the hot path. The scheduler's good intentions, power management's frugality, the kernel's conveniences — on the hot path they are all sources of jitter. This is tuning by subtraction, not addition, and the space left behind is filled with measurement.

The common trait of teams that win the war against microseconds is not a secret trick but the habit of looking at percentile histograms daily and the discipline of validating every change with instrumentation. Tuning without measurement is superstition; tuning with measurement is engineering.