Skip to content

필사 모드: Kernel Tuning for Low-Latency Trading Systems — The War Against Microseconds

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

In most services, 100 microseconds of latency is noise you would not even bother measuring. In high-frequency trading (HFT) and market making, 100 microseconds is the difference between "your order fills" and "you watch someone else fill it." The order that reaches the exchange matching engine first takes the favorable price; the late order gets the worsened quote. In this domain, latency is not an abstract quality metric — it is a number recorded directly on the profit and loss statement.

This article organizes Linux kernel tuning from the perspective of low-latency trading systems. A disclaimer up front: most techniques here are overkill for ordinary services, and the latter part of the article covers "why you should not apply this to your web server." Also, this is a systems-engineering article, not investment advice.

A Sense of Scale First — ns, us, ms

To discuss low-latency tuning, you first need intuition for time units.

| Unit | Magnitude | Intuition | Example operations |

| --- | --- | --- | --- |

| 1 ns | One billionth of a second | A few CPU cycles (one cycle is about 0.3ns) | L1 cache access (about 1ns) |

| 10 ns | --- | L2 cache access | Branch misprediction penalty |

| 100 ns | --- | Main memory access (about 60 to 100ns) | Remote NUMA node access |

| 1 us | One millionth of a second | Half a well-tuned userspace networking round trip | Kernel-bypass NIC transmit |

| 10 us | --- | One context switch plus cache pollution | Standard kernel stack UDP receive |

| 100 us | --- | Interrupt latency spike on a misconfigured box | Waking from a deep C-state |

| 1 ms | One thousandth of a second | A "fast" response for a normal web service | SSD access, minor GC pause |

| 10 ms | --- | Longer than a Seoul-Busan fiber round trip | HDD seek, timeslice expiry |

For trading systems, tick-to-trade (market data in, order out) competes in the single-digit microsecond range for well-built software systems and in the hundreds of nanoseconds for FPGA-based ones. The unit we fight in is microseconds, and the enemy is "the occasional millisecond-class spike."

Decomposing Latency — Where Does a Packet Lose Time?

Trace the path of a market data packet from NIC arrival to the application sending an order.

[Exchange market data]

|

v

(1) NIC receive: wire --> NIC buffer physical layer, tens of ns

|

(2) DMA + interrupt/polling hundreds of ns to several us

| - interrupt path: IRQ -> softirq -> protocol stack

| - polling/bypass path: userspace reads the ring buffer directly

v

(3) Protocol processing (IP/UDP decode) standard stack: 1-5 us / bypass: hundreds of ns

|

(4) Socket/queue delivery --> app wakeup large variance here:

| - if the thread is already busy-waiting on a core: ~0

| - if a scheduler wakeup is needed: 1-10 us plus spikes

v

(5) Strategy logic (quote calc, risk checks) hundreds of ns to several us (app domain)

|

(6) Order encoding + transmit reverse of (2)-(3)

v

[Exchange order gateway]

Two key insights. First, the mean is dominated by (3), (5), (6), but **the tail is dominated by (2) and (4)**. Interrupt coalescing, scheduler interference, C-state wakeups, TLB misses, SMIs (System Management Interrupts) — these are what ruin your 99.99th percentile. Second, the essence of tuning is not reducing the mean but **killing the variance**. In trading, an unpredictable system is worse than a slow one.

The Kernel-Bypass Spectrum — How Far Do You Go?

How much of the kernel network stack to bypass is a cost-benefit spectrum.

| Approach | Representative tech | Receive latency (approx.) | Dev difficulty | Ops difficulty | Best fit |

| --- | --- | --- | --- | --- | --- |

| Standard stack tuning | sysctl, busy_poll, IRQ affinity | 5-20 us | Low | Low | General low latency, back-office feeds |

| XDP/AF_XDP | eBPF in-kernel processing | 2-5 us | Medium | Medium | Filtering, market data fan-out, DDoS defense |

| Bypass with socket compat | Onload-style (socket API kept) | 1-3 us | Low (no recompile) | Medium | Accelerating existing socket apps |

| Full userspace stack | DPDK + own UDP/TCP handling | Under 1 us | High | High | Tick-to-trade core path |

| Hardware offload | FPGA/SmartNIC | Hundreds of ns | Very high | Very high | Top-tier HFT |

The decision criterion is simple: **is your competition measured in microseconds or milliseconds?** For market data analytics or risk batches, standard stack tuning suffices. For a speed race against the matching engine, DPDK or socket-compatible bypass (Onload, VMA-style) is the starting point. AF_XDP occupies a unique middle ground — "bypass in cooperation with the kernel" — and is useful for peeling only part of the traffic on one NIC into a low-latency path.

The CPU Isolation Recipe — Renting Out Cores

The first principle of low latency: evict everything else from the cores where hot-path threads run. The trio of kernel boot parameters is the foundation.

/etc/default/grub

Example: isolate cores 4-23 for trading on a 24-core machine

GRUB_CMDLINE_LINUX="isolcpus=nohz,domain,managed_irq,4-23 \

nohz_full=4-23 \

rcu_nocbs=4-23 \

rcu_nocb_poll \

irqaffinity=0-3 \

intel_idle.max_cstate=0 processor.max_cstate=1 \

intel_pstate=disable idle=poll \

mitigations=off \

transparent_hugepage=never \

audit=0 nosoftlockup"

Apply

sudo update-grub && sudo reboot

What each piece means:

- **isolcpus**: removes the listed cores from general scheduler load balancing. Only threads with explicit affinity land there.

- **nohz_full**: when exactly one task runs on the core, the periodic scheduler tick (100 to 1000 per second by default) is stopped, removing tick-induced micro-jitter.

- **rcu_nocbs**: moves RCU callback processing off the isolated cores onto housekeeping cores (0-3 here).

- **idle=poll / max_cstate**: keeps cores from sleeping (details later).

- **mitigations=off**: disables side-channel mitigations. Syscall and context-switch costs drop meaningfully, but this is an option **only for closed, dedicated networks where the security trade-off is formally approved by the organization**.

After boot, place threads and interrupts.

#!/usr/bin/env bash

set -euo pipefail

1) irqbalance is the enemy of isolation — disable it

systemctl stop irqbalance || true

systemctl disable irqbalance || true

2) default all IRQ affinity to housekeeping cores (0-3)

for irq in /proc/irq/*/smp_affinity_list; do

echo "0-3" > "$irq" 2>/dev/null || true

done

3) pin only the trading NIC RX queue IRQs to dedicated cores (4,5)

NIC="ens1f0"

i=0

for irq in $(grep "$NIC" /proc/interrupts | awk -F: '{print $1}'); do

core=$((4 + i % 2))

echo "$core" > "/proc/irq/$irq/smp_affinity_list"

i=$((i + 1))

done

4) pin worker threads to isolated cores (pthread_setaffinity_np inside

the app is the proper way; from outside, taskset)

taskset -c 6 ./market_data_handler &

taskset -c 7 ./order_gateway &

5) kernel workqueues to housekeeping cores too

echo 0f > /sys/devices/virtual/workqueue/cpumask

Verify with perf and the /proc interfaces.

Confirm ticks really stopped on isolated cores

watch -n1 'grep -E "LOC|RES" /proc/interrupts | awk "{print \$1, \$8, \$9, \$10}"'

Monitor context switches on a given core (should be zero)

perf stat -C 6 -e context-switches,cpu-migrations -- sleep 10

NUMA Alignment — Locality of Memory and the NIC

Ignore NUMA on a multi-socket server and you pay a tens-of-nanoseconds tax on every memory access. More important still: **which NUMA node is the NIC attached to?** PCIe slots are wired to a specific socket, so the rule is to use cores and memory on the same node as the NIC.

Find the NIC NUMA node

cat /sys/class/net/ens1f0/device/numa_node

Output: 1 --> this NIC hangs off node 1

List node 1 cores

lscpu | grep "NUMA node1"

Launch the hot-path process bound to node 1 cores and memory

numactl --cpunodebind=1 --membind=1 ./trading_engine

Verify there is no cross-node traffic

numastat -p $(pgrep trading_engine)

At the application design level, the canonical pattern is to keep the entire chain — market data thread, strategy thread, order send thread — on the same node, and handle inter-thread communication via lock-free ring buffers between core pairs sharing the same L3 cache.

Interrupts vs Polling — Do Not Wait to Be Woken

Interrupts are the efficient "wake me when something happens" model, but the wakeup cost (IRQ handling plus scheduler wakeup plus cache warming) runs in microseconds. On the low-latency path, you burn a core and poll.

Busy polling is the compromise that keeps the standard stack.

Socket-layer busy poll (in microseconds)

sysctl -w net.core.busy_poll=50

sysctl -w net.core.busy_read=50

For epoll-based apps, per-NAPI busy poll is also possible (kernel 5.11+)

combine with app-level settings such as SO_BUSY_POLL_BUDGET and the EPIOCSPARAMS ioctl

Balance NAPI polling vs interrupt coalescing — for low latency, turn coalescing off

ethtool -C ens1f0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0

Note that disabling interrupt coalescing makes CPU usage soar at high PPS. Disable it only for traffic where "every single packet matters immediately" like market data, and keep coalescing for bulk-traffic NICs. With DPDK or Onload this trade-off disappears entirely — userspace poll mode is the default.

Eliminating Idle States — A Sleeping Core Is a Slow Core

Modern CPUs sleep into C-states when idle and shift P-states (frequency) under load. Both are enemies of low latency. Waking from deep C-states (C6) can take tens to hundreds of microseconds, and frequency transitions create hiccups of tens of microseconds.

C-states: if pinned via boot parameters, just verify

cat /sys/devices/system/cpu/cpu6/cpuidle/state*/disable

To control at runtime, keep a write of 0 to /dev/cpu_dma_latency open

(applies while the opening process lives — tuned latency-performance does this)

tuned-adm profile latency-performance

P-states: pin the governor to performance

for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do

echo performance > "$g"

done

Turbo boost: for determinism, disabling is the consistent choice

(enabled, the mean improves but frequency wobbles with thermal conditions)

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Two easily missed items. First, **SMIs (System Management Interrupts)** are firmware-level stalls invisible to the OS and the usual culprit behind multi-hundred-microsecond spikes. Disable USB legacy support and power-management SMI sources in the BIOS, and watch the SMI counter with turbostat. Second, **hyper-threading**: on hot-path cores, disable it or leave the sibling thread empty — the sibling pollutes caches and execution ports.

Memory — Exterminate Page Faults and TLB Misses

A single page fault on the hot path is a microsecond-class disaster. Touch and lock all memory at startup.

Reserve huge pages (2MB) — reduces TLB misses

echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

1GB huge pages via boot parameters

default_hugepagesz=1G hugepagesz=1G hugepages=16

/* The canonical application startup sequence */

#include <sys/mman.h>

int main(void) {

/* 1) pin all current and future pages in RAM */

mlockall(MCL_CURRENT | MCL_FUTURE);

/* 2) preallocate heap/pools and touch every page (prefault) */

/* 3) after entering the hot path: no malloc/free, no syscalls, no page faults */

return run_engine();

}

THP (Transparent Huge Pages) is contentious. It helps from a TLB perspective, but the moment khugepaged compacts pages in the background you can take a millisecond-class stall. The low-latency consensus: **set THP to never and use explicit huge pages.** If determinism is the goal, most "the kernel handles it for you" features should be turned off.

echo never > /sys/kernel/mm/transparent_hugepage/enabled

echo never > /sys/kernel/mm/transparent_hugepage/defrag

sysctl -w vm.swappiness=0

sysctl -w vm.stat_interval=120 # slow vmstat updates to cut jitter

Network Stack Tuning Parameters

The baseline set for paths that keep the standard stack (e.g. TCP to the order gateway).

Buffers: for low latency the goal is "right-sized," not "big" — oversized buffers hide latency

sysctl -w net.core.rmem_max=16777216

sysctl -w net.core.wmem_max=16777216

TCP: disable Nagle in the app with TCP_NODELAY — that is the proper place

sysctl -w net.ipv4.tcp_low_latency=1 # meaningful only on older kernels

sysctl -w net.ipv4.tcp_timestamps=1

sysctl -w net.ipv4.tcp_slow_start_after_idle=0

Queueing discipline: minimize transmit latency

sysctl -w net.core.default_qdisc=fq

tc qdisc replace dev ens1f0 root mq

UDP market data: drop monitoring is life or death

watch -n1 'cat /proc/net/udp | awk "{print \$13}" | sort | uniq -c'

ethtool -S ens1f0 | grep -iE "drop|miss|fifo"

RSS/RPS: separate RX queues per multicast group, mapped 1:1 to cores

ethtool -X ens1f0 weight 1 1 0 0

Time Synchronization — PTP and Hardware Timestamps

As important as reducing latency is **measuring it correctly**, and the prerequisite is the clock. On the regulatory side too, MiFID II RTS 25 in Europe requires HFT operators to keep clocks within 100 microseconds of UTC and to record precise timestamps. NTP (millisecond class) is insufficient; PTP (IEEE 1588, sub-microsecond) is the standard.

Check NIC hardware timestamp support

ethtool -T ens1f0

look for "hardware-transmit / hardware-receive / PTP Hardware Clock: 0"

linuxptp: sync the NIC PHC to the grandmaster

ptp4l -i ens1f0 -f /etc/ptp4l.conf --summary_interval 6 &

PHC --> system clock

phc2sys -s ens1f0 -O 0 -u 64 &

Monitor offset (target: rms in the tens of ns)

pmc -u -b 0 'GET CURRENT_DATA_SET'

On the application side, use the SO_TIMESTAMPING socket option to receive NIC hardware RX timestamps and measure per-segment latency relative to "wire arrival time." Software timestamps have the contradiction that the thing being measured (the kernel path) is part of the measuring tool — serious measurement uses hardware timestamps or optical taps plus capture appliances.

Measurement Methodology — Averages Lie

The report card of a low-latency system is not the mean but the percentiles — especially p99.9 and p99.99.

A tick-to-trade distribution (fictional example)

p50 : 3.2 us <-- the number that goes in marketing slides

p99 : 5.8 us

p99.9 : 14.0 us <-- interrupt/timer interference

p99.99 : 180.0 us <-- C-state wakeups, SMIs, page faults

max : 2.1 ms <-- if this lands at the worst moment, it is an incident

Looking only at the mean (3.5us) without a histogram, this system is "great."

But if 1 in 10,000 events takes 180us precisely when the market is most

violent (= when load is highest), that is exactly the moment you lose money.

The tool chain:

1) Platform jitter itself — cyclictest (rt-tests package)

20 minutes on an isolated core, histogram output

cyclictest -m -p95 -t1 -a 6 -i 100 -h 400 -D 20m -q

2) Kernel path tracing — softirq duration distribution with bpftrace

bpftrace -e 'tracepoint:irq:softirq_entry { @ts[cpu] = nsecs; }

tracepoint:irq:softirq_exit /@ts[cpu]/ {

@dist = hist(nsecs - @ts[cpu]); delete(@ts[cpu]); }'

3) Watch for scheduler interference — events that must not happen on isolated cores

perf record -C 6 -e sched:sched_switch,sched:sched_wakeup -- sleep 60

perf script | head

4) Correlate with hardware counters for cache/TLB misses

perf stat -C 6 -e cycles,instructions,cache-misses,dTLB-load-misses -- sleep 10

At the application level, record everything continuously with loss-free histograms in the HdrHistogram family, and correct for the "coordinated omission" trap — the phenomenon where samples missing from the periods when the measurement loop itself was stalled make the distribution look better than reality. Always measure at production-equivalent message rates, reproducing burst conditions like the market open.

Application-Side Considerations — The Kernel Is Not the Whole Story

What the kernel provides ends at "an undisturbed core." The code above it must obey the same discipline.

- **Java family**: avoid GC altogether via zero-allocation design on the hot path. Object pools, primitive-based data structures, off-heap ring buffers (Aeron, Chronicle-style patterns) are standard; some shops permit GC only after market close. Low-pause collectors like ZGC guarantee sub-millisecond pauses — sub-millisecond, not zero.

- **C++ family**: forbid dynamic allocation, locks, syscalls, and exceptions on the hot path. Spin instead of blocking, reduce branches, align hot data to cache lines, and eliminate false sharing with padding.

- **Common**: logging is asynchronous, binary, and on a separate core. One printf on the hot path can nullify all your kernel tuning.

Why This Is Overkill for Ordinary Services — Settling the Trade-Offs

Why you should not apply these recipes to a general service, itemized.

| Technique | Gain in trading | Cost in a general service |

| --- | --- | --- |

| Core isolation + polling | Microsecond tail reduction | Cores pinned at 100% — wasted power and cost, lower density |

| C-states disabled | No wakeup latency | Idle power multiplies, heat, less turbo headroom |

| mitigations=off | Hundreds of ns off syscalls | Side-channel exposure — taboo in multi-tenant settings |

| Coalescing off | Immediate per-packet handling | CPU explosion at high PPS, throughput collapse |

| THP off + manual huge pages | No millisecond stalls | Ops complexity; THP often wins for general workloads |

| Kernel bypass | Sub-microsecond receive | Kernel security/observability tooling disabled, dedicated staffing |

In short, low-latency tuning is **a trade where you pay throughput, power efficiency, security, and operability to buy determinism**. If your API server is fine with 50ms at p99, this trade is a plain loss. If microseconds are your profit and loss, every cost above is justified. The most important tuning decision is honestly judging which side your system is on.

Checklist

**Hardware/firmware**

- BIOS reviewed for C-state limits, hyper-threading policy, SMI sources?

- NIC supports PTP hardware timestamps and sits in a PCIe slot on the right NUMA node?

- SMI counter verified near zero with turbostat?

**Kernel boot/isolation**

- isolcpus, nohz_full, rcu_nocbs configured with the identical core set?

- irqbalance disabled and IRQ affinity placed explicitly?

- Zero context switches confirmed with perf on isolated cores?

**Memory/power**

- mlockall and prefaulting included in the startup sequence?

- THP set to never with explicit huge pages reserved?

- Governor performance, C-states pinned, turbo policy as intended?

**Network/time**

- Coalescing, drop counters, RSS mapping checked on the market data NIC?

- ptp4l/phc2sys offsets held within target (tens of ns)?

- Hardware-timestamp-based per-segment measurement working?

**Measurement/operations**

- p99.99 and max recorded continuously as histograms?

- A cyclictest-based platform jitter baseline documented?

- A before/after distribution comparison procedure for every tuning change (one change at a time)?

Pitfalls and Anti-Patterns

- **Changing ten things at once.** You will never know which change mattered. Keep the loop: baseline, single change, re-measure.

- **Shipping on the mean.** Tails appear under load — when the market is violent. Benchmarks during quiet hours are nearly meaningless.

- **Isolated, but kernel threads remain.** kworker, ksoftirqd, and timers can linger on isolated cores. Verify workqueue cpumask and nohz_full behavior empirically.

- **Upgrades that forget NUMA.** After a server swap or NIC reseating, numa_node silently changes and everything quietly slows. Assert it at startup.

- **Copy-pasting mitigations=off carelessly.** On internet-exposed or multi-tenant systems this is a shortcut to a security incident.

- **Measuring without a clock.** Computing "latency" from timestamps of two unsynchronized hosts yields comedy like negative latency. PTP comes first.

Closing

The technical content of low-latency kernel tuning collapses into one sentence: **remove everything unpredictable from the hot path.** The scheduler's good intentions, power management's frugality, the kernel's conveniences — on the hot path they are all sources of jitter. This is tuning by subtraction, not addition, and the space left behind is filled with measurement.

The common trait of teams that win the war against microseconds is not a secret trick but the habit of looking at percentile histograms daily and the discipline of validating every change with instrumentation. Tuning without measurement is superstition; tuning with measurement is engineering.

References

- [Linux Kernel Documentation — NO_HZ: Reducing Scheduling-Clock Ticks](https://docs.kernel.org/timers/no_hz.html)

- [Linux Kernel Documentation — CPU Isolation](https://docs.kernel.org/admin-guide/kernel-per-CPU-kthreads.html)

- [Linux Kernel Documentation — The kernel command-line parameters](https://docs.kernel.org/admin-guide/kernel-parameters.html)

- [Linux Kernel Documentation — Timestamping](https://docs.kernel.org/networking/timestamping.html)

- [Linux Kernel Documentation — NAPI and Busy Polling](https://docs.kernel.org/networking/napi.html)

- [DPDK Documentation](https://doc.dpdk.org/guides/)

- [AF_XDP — Linux Kernel Documentation](https://docs.kernel.org/networking/af_xdp.html)

- [ebpf.io — eBPF introduction and ecosystem](https://ebpf.io/)

- [The linuxptp project](https://linuxptp.sourceforge.net/)

- [rt-tests (cyclictest) — Linux Foundation Wiki](https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cyclictest/start)

- [perf — Linux profiling with performance counters](https://perf.wiki.kernel.org/)

- [ESMA — MiFID II regulatory technical standards (including RTS 25 on clock synchronization)](https://www.esma.europa.eu/)

- [Red Hat Enterprise Linux for Real Time — Low Latency Tuning Guides](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/)

현재 단락 (1/209)

In most services, 100 microseconds of latency is noise you would not even bother measuring. In high-...

작성 글자: 0원문 글자: 18,674작성 단락: 0/209