Introduction
When p99 latency on a web server suddenly spikes, or packet drop counters start climbing in your monitoring, many engineers reach for the application code first. In reality, before a packet ever reaches the application, it has already passed through dozens of stages inside the kernel: NIC hardware, DMA, interrupts, softirqs, the protocol stack, socket queues. If any one of these stages is a bottleneck, it does not matter how fast your application is.
In this post we follow the entire journey of a single packet, from the moment it arrives at the network card until it reaches the read call of an application. We look at what happens at each stage, where packets can get dropped, and how to observe and tune each point from a practitioner perspective. If you tune Kubernetes nodes, operate low-latency services, or analyze network incidents, this should serve as a practical map.
The Big Picture — A Large Diagram of the Receive Path
Let us first see the whole path at a glance. The diagram below shows the key stages of the packet receive (RX) path.
[Physical network]
|
v
+---------------------------+
| NIC hardware | 1. Receive frame, verify FCS
| (RSS hash selects RX queue)| 2. RSS: 5-tuple hash -> queue
+---------------------------+
|
v DMA (no CPU involvement)
+---------------------------+
| RX ring buffer (descrs) | 3. Packet data written directly
| check size: ethtool -g | into pre-allocated kernel memory
+---------------------------+
|
v Hardware interrupt (IRQ)
+---------------------------+
| Hard IRQ handler (short) | 4. Only marks "packet arrived"
| calls napi_schedule() | and returns (microseconds)
+---------------------------+
|
v NET_RX_SOFTIRQ
+---------------------------+
| softirq / ksoftirqd | 5. NAPI polling starts;
| net_rx_action() | processes up to budget packets
+---------------------------+
|
v
+---------------------------+
| NAPI poll (driver) | 6. Pulls packets off the ring,
| builds sk_buff, GRO | converts to sk_buff, GRO merge
+---------------------------+
|
v
+---------------------------+
| netif_receive_skb() | 7. Protocol demux point;
| (tcpdump/AF_PACKET tap, | XDP generic and tc ingress
| tc ingress hook here) | hooks live around here
+---------------------------+
|
v
+---------------------------+
| IP layer (ip_rcv) | 8. Checksum, routing decision,
| netfilter PREROUTING/ | netfilter hooks; if local,
| INPUT hooks | deliver upward
+---------------------------+
|
v
+---------------------------+
| TCP layer (tcp_v4_rcv) | 9. Socket lookup, sequencing,
| congestion control, ACK | reassembly, ACK generation
+---------------------------+
|
v
+---------------------------+
| Socket receive queue | 10. Queued on sk_receive_queue
| (bounded by sk_rcvbuf) | and waiting process woken up
+---------------------------+
|
v
+---------------------------+
| Application read/recv | 11. Data copied to user space
| (epoll event fires) | via the system call
+---------------------------+
Among these 11 stages, there are four major points where packets can be dropped: when the ring buffer fills up (stages 2-3), when softirq processing falls behind (stage 5), at netfilter rules (stage 8), and when the socket buffer is full (stage 10). We will cover the diagnosis of each below.
sk_buff — The Kernel Avatar of a Packet
Inside the kernel, a packet is represented by a structure called sk_buff (socket buffer, abbreviated skb). It carries both the packet data itself and the metadata needed to interpret that data.
/* Conceptual structure excerpted from include/linux/skbuff.h */
struct sk_buff {
struct sk_buff *next; /* queue linkage */
struct sock *sk; /* owning socket */
struct net_device *dev; /* RX/TX device */
unsigned int len; /* data length */
__u16 transport_header; /* L4 header offset */
__u16 network_header; /* L3 header offset */
__u16 mac_header; /* L2 header offset */
sk_buff_data_t tail;
sk_buff_data_t end;
unsigned char *head; /* start of buffer */
unsigned char *data; /* current layer data start */
};
There are two core ideas.
First, the packet data lives in memory exactly once, and each protocol layer merely adjusts the head/data/tail pointers. When the Ethernet layer finishes, it advances the data pointer 14 bytes so it points at the IP header. Never copying data as the packet moves between layers is fundamental to performance.
Second, an skb can be cloned. When tcpdump captures a packet, it does not copy the full payload; it duplicates only the skb metadata pointing at the same data. That is why capture overhead is smaller than people expect (though it is not free).
From Interrupts to NAPI — A History of Interrupt Mitigation
What would happen if every arriving packet raised an interrupt? On 10GbE, roughly 14.88 million 64-byte packets can arrive per second. With one interrupt per packet, the CPU would do nothing but interrupt handling. This is interrupt livelock.
NAPI (New API) solves this with a hybrid of interrupts and polling.
Low traffic: interrupt mode (minimize latency)
packet arrives -> IRQ -> napi_schedule -> poll -> queue empty
-> re-enable IRQ
High traffic: polling mode (maximize throughput)
IRQ stays disabled -> softirq keeps polling the ring buffer
within the budget -> IRQ is never re-enabled while packets
keep coming
The operating rules are as follows.
1. When the first packet arrives, a hard IRQ fires; the driver disables that queue IRQ and schedules NAPI polling.
2. In softirq context, net_rx_action runs and processes at most budget packets per round (300 globally by default, 64 per device).
3. If the ring is fully drained, the IRQ is re-enabled and we return to interrupt mode. If packets remain, the IRQ stays off and processing continues in the next softirq round.
4. If softirq monopolizes the CPU for too long, work is handed to the ksoftirqd kernel thread, which runs under normal scheduler fairness.
The relevant parameters:
Upper bound of total packets processed in one net_rx_action
sysctl net.core.netdev_budget # default 300
Time limit for one net_rx_action (microseconds)
sysctl net.core.netdev_budget_usecs # default 2000
How often processing stopped due to exhausted budget
(3rd column: time_squeeze)
cat /proc/net/softnet_stat
If the third column (time_squeeze) of softnet_stat keeps growing, softirq cannot finish its work within the budget; consider raising the budget or spreading load across CPUs (RPS).
There is also hardware-level interrupt mitigation: the NIC can accumulate several packets, or wait a fixed time, and raise a single interrupt.
Inspect/change interrupt coalescing
ethtool -c eth0
ethtool -C eth0 rx-usecs 50 rx-frames 64
For low latency, lower rx-usecs; for throughput, raise it.
With adaptive-rx on, the NIC auto-adjusts to traffic patterns.
ethtool -C eth0 adaptive-rx on
Multi-core Scaling — RSS, RPS, RFS
A single core cannot keep up with a fast NIC. Three mechanisms spread packet processing across cores.
| Technique | Where it runs | Distribution key | Notes |
| --- | --- | --- | --- |
| RSS | NIC hardware | 5-tuple hash selects RX queue | Most efficient; queue count limited by hardware |
| RPS | Kernel software | Hash selects processing CPU | Backfill when NIC lacks RSS or queues |
| RFS | Kernel software | Steers to the CPU running the app | Cache locality optimization; extends RPS |
RSS is a hardware feature where the NIC hashes source/destination IP and port and steers each packet to one of several RX queues. Each queue has its own IRQ, so distributing IRQs across cores naturally yields multi-core processing.
Inspect and change the RX queue count
ethtool -l eth0
ethtool -L eth0 combined 8
IRQs per queue
grep eth0 /proc/interrupts
Pin an IRQ to a CPU (with irqbalance disabled, manual control)
echo 2 > /proc/irq/125/smp_affinity_list # IRQ 125 to CPU2
RPS is the software version of RSS. The CPU that took the hard IRQ computes a hash and hands the packet off to the backlog queue of another CPU.
Spread packets of eth0 rx-0 across CPUs 0-3 (bitmask f = 0b1111)
echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus
Size of the RPS backlog queue (prevents drops)
sysctl -w net.core.netdev_max_backlog=16384
RFS goes one step further and steers packets of a flow to the CPU where the consuming application is running. The packet data is more likely to already be warm in that CPU cache, reducing latency.
Global flow table size
sysctl -w net.core.rps_sock_flow_entries=32768
Per-queue flow count = global value / number of queues
echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
The practical rule of thumb: modern server NICs have plenty of queues, so RSS plus IRQ affinity solves most cases. RPS/RFS are complements for queue-poor situations such as virtio-net in virtualized environments, or when traffic piles onto specific cores.
Offloads — GRO, GSO, TSO
The other axis for reducing per-segment cost is the offload strategy: let the kernel stack handle big chunks, and split/merge only at the boundaries.
| Technique | Direction | Performed by | What it does |
| --- | --- | --- | --- |
| TSO | Transmit | NIC hardware | Kernel hands a 64KB-class chunk; NIC splits to MSS |
| GSO | Transmit | Kernel software | Software split just before the driver when TSO is absent |
| GRO | Receive | Kernel (NAPI stage) | Merges consecutive segments of a flow into a large skb |
Thanks to GRO, forty 1500-byte packets can be merged into a single skb of about 60KB that traverses the IP/TCP layers once. Stack traversal cost becomes proportional to the number of batches, not the number of packets.
Inspect offload state
ethtool -k eth0 | grep -E "generic-receive|generic-segmentation|tcp-segmentation"
Toggle
ethtool -K eth0 gro on gso on tso on
A few caveats. GRO briefly holds packets to merge them, so extreme low-latency workloads sometimes disable it. On forwarding devices such as routers/bridges, merging with GRO and re-splitting with GSO changes packet boundaries and can create subtle issues. Seeing packets larger than the MTU in tcpdump is almost always GRO/TSO at work and is normal.
The Transmit Path — qdisc and Queueing
Transmission is roughly the reverse of reception, with one important extra stage: the queueing discipline (qdisc).
Application write/send
|
v
+---------------------+
| Socket send buffer | bounded by sk_sndbuf; for TCP the
| (sk_write_queue) | congestion window and receiver window
+---------------------+ govern the sending rate
|
v
+---------------------+
| TCP/IP layers | header construction, routing,
| | netfilter OUTPUT
+---------------------+
|
v
+---------------------+
| qdisc | fq_codel (default), fq, pfifo_fast...
| controlled via tc | pacing, shaping, AQM happen here
+---------------------+
|
v
+---------------------+
| Driver TX ring | handed to the NIC via DMA
+---------------------+
|
v
| NIC -> physical network (TSO split happens here)
A qdisc is not a plain FIFO. The modern Linux default, fq_codel, combines per-flow fair queueing with the CoDel AQM (active queue management) to suppress bufferbloat. If you run BBR, an fq-family qdisc that supports pacing is recommended (recent kernels also support TCP internal pacing).
Current qdisc
tc qdisc show dev eth0
Replace with fq (pairs well with BBR pacing)
tc qdisc replace dev eth0 root fq
qdisc statistics: drops and backlog
tc -s qdisc show dev eth0
Two more things worth knowing on the TX side: the TX ring buffer size, and TCP Small Queues (TSQ). TSQ limits in-flight bytes per socket so a single socket cannot monopolize the qdisc/driver queues.
XDP — Processing Before the Stack
XDP (eXpress Data Path) runs an eBPF program at the earliest possible point in the driver, before an sk_buff is even allocated.
Normal path:
NIC -> DMA -> [skb allocation] -> GRO -> netif_receive_skb
-> netfilter -> IP -> TCP -> socket (cost accrues per stage)
XDP path:
NIC -> DMA -> [XDP program runs]
|-- XDP_DROP: discard immediately (no skb at all)
|-- XDP_TX: bounce out the same NIC
|-- XDP_REDIRECT: to another NIC/CPU/AF_XDP socket
+-- XDP_PASS: continue into the normal stack
Why XDP is fast is simple: it skips the per-packet costs of the normal path entirely — skb allocation and initialization, memory reclamation, protocol layer traversal, netfilter evaluation. For workloads that drop most packets, such as DDoS mitigation, the throughput difference versus the normal stack ranges from severalfold to over an order of magnitude. Cilium, Katran (the Facebook L4 load balancer), and Cloudflare DDoS defense are all XDP based.
There are three modes: native mode supported by the driver, offloaded mode executed by the NIC itself, and generic (skb) mode which runs after stack entry and yields little performance benefit. For performance purposes, first verify your driver supports native XDP.
Check loaded XDP programs
ip link show dev eth0 # look for prog/xdp
bpftool net list
Socket Buffers and Key sysctl Tuning
The socket queue is the terminus of the packet journey and the last drop point. The main parameters:
| Parameter | Default (approx.) | Meaning |
| --- | --- | --- |
| net.core.rmem_max | 212992 | Socket receive buffer ceiling (bytes) |
| net.core.wmem_max | 212992 | Socket send buffer ceiling |
| net.ipv4.tcp_rmem | 4096 131072 6291456 | TCP receive buffer min default max, autotuning range |
| net.ipv4.tcp_wmem | 4096 16384 4194304 | TCP send buffer min default max |
| net.core.netdev_max_backlog | 1000 | Per-CPU backlog length for RPS/non-NAPI paths |
| net.core.somaxconn | 4096 | Accept queue (completed connections) ceiling |
| net.ipv4.tcp_max_syn_backlog | 1024 | SYN queue (incomplete connections) length |
| net.ipv4.tcp_congestion_control | cubic | Congestion control algorithm |
| net.ipv4.tcp_notsent_lowat | 4294967295 | Unsent data threshold (low-latency tuning) |
In environments with a large bandwidth-delay product (BDP) — say a 10Gbps path with 100ms RTT — you theoretically need about 125MB of window. If the max of tcp_rmem is not raised enough, throughput is capped by the window.
Example for high-bandwidth/high-RTT links (/etc/sysctl.d/90-network.conf)
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 262144 134217728
net.ipv4.tcp_wmem = 4096 262144 134217728
net.core.netdev_max_backlog = 16384
net.core.somaxconn = 8192
One important trap. The max values in tcp_rmem/tcp_wmem operate independently of rmem_max/wmem_max (TCP autotuning follows tcp_rmem). However, if the application sets SO_RCVBUF explicitly via setsockopt, the value is clamped by rmem_max and, at the same time, TCP autotuning is disabled for that socket. This is the classic cause of the story where someone hardcoded a bigger buffer with setsockopt and things got slower.
A Step Inside TCP — Congestion Control and BBR
The TCP sending rate is determined by the smaller of the receiver window and the congestion window (cwnd). Congestion control is implemented as pluggable modules.
| Algorithm | Signal | Characteristics |
| --- | --- | --- |
| CUBIC (default) | Packet loss | Loss-based, tends to fill buffers, solid default |
| BBR | Measured delivery rate and RTT | Loss-tolerant, avoids bufferbloat, strong on long/lossy paths |
| DCTCP | ECN marks | Datacenter only; requires switch ECN configuration |
BBR continuously estimates the bottleneck bandwidth and minimum RTT of the path, and paces transmission to those estimates before loss ever happens. On wireless or intercontinental segments, where loss occurs independently of congestion, it delivers far more stable throughput than CUBIC.
Available congestion control modules
sysctl net.ipv4.tcp_available_congestion_control
Enable BBR (recommended together with the fq qdisc)
modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
Verify per-connection algorithm and state
ss -tin | grep -E "bbr|cubic"
From ss -ti output, the cwnd, rtt, retrans, and pacing_rate fields tell you almost everything about the health of a single connection. Steadily growing retrans points to path loss; a small, pinned cwnd points to congestion control or buffer limits.
Container Networking — The Extra Legs of the Journey
In container environments, more stages are appended to the path above. The receive path of a typical bridge + veth setup:
NIC -> host stack (IP/netfilter, NAT/conntrack)
-> bridge or routing decision
-> transmit into host-side veth end
-> "received again" at container-side veth end (softirq re-entry)
-> container netns IP/TCP stack traversal
-> socket inside the container
A veth pair is a virtual cable: a packet sent into one end is received at the other. The problem is that this second reception triggers another protocol stack traversal and another round of netfilter evaluation. Add kube-proxy iptables rules and conntrack, and each namespace boundary accumulates from a few to tens of microseconds of overhead. This cost is exactly why eBPF-based CNIs like Cilium bypass the netfilter path and optimize veth-to-veth redirection.
When diagnosing, first narrow down which interface in which namespace shows the problem.
Enter the container netns and read interface statistics
nsenter -t 12345 -n ip -s link
nsenter -t 12345 -n ss -tin
Check conntrack table saturation (new connections drop when full)
sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
dmesg | grep conntrack
Observability Toolbox
Tools for inspecting each stage, organized by layer.
1. NIC/driver level: hardware counters (drops, errors, per-queue)
ethtool -S eth0 | grep -iE "drop|err|miss|fifo"
ip -s link show eth0 # rx_dropped, rx_missed, ...
2. softirq level: budget exhaustion, backlog drops
cat /proc/net/softnet_stat # col1 processed, col2 drops, col3 time_squeeze
3. Protocol level: stack-internal counters
nstat -az | grep -iE "drop|retrans|listen|prune|collapse"
cat /proc/net/snmp # basic IP/TCP/UDP counters
cat /proc/net/netstat # TcpExt extended counters
4. Socket level: per-connection state
ss -tinm # TCP internals + memory (skmem)
ss -lnt # accept queue usage of listeners (Recv-Q/Send-Q)
5. Kernel function level: eBPF
Trace every kernel drop point with the reason (BCC)
dropwatch, or tcpdrop from BCC
bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
The kfree_skb tracepoint in particular exposes a drop reason field, so you can answer "where and why was this dropped" with measurement instead of guesswork.
Diagnosis Scenario 1 — Tracking Down Packet Drops
Situation: node rx_dropped is increasing in monitoring and applications hit intermittent timeouts. We narrow from the outside (NIC) inward (socket), stage by stage.
Step 1, look at the NIC/ring buffer.
ethtool -S eth0 | grep -iE "rx.*(drop|miss|no_buf|fifo)"
ethtool -g eth0 # current/max ring buffer sizes
If rx_missed or fifo-family counters are climbing, the ring buffer could not absorb the traffic. Enlarge the ring (ethtool -G eth0 rx 4096) and review interrupt coalescing and IRQ distribution.
Step 2, look at the softirq stage.
awk '{print NR-1, "drops:", strtonum("0x"$2), "squeeze:", strtonum("0x"$3)}' \
/proc/net/softnet_stat
Growth in column 2 (drops) means netdev_max_backlog is too small; growth in column 3 (squeeze) means insufficient budget or an overloaded CPU. Spread with RPS or raise backlog/budget.
Step 3, look at the protocol/socket stage.
nstat -az | grep -iE "ListenDrops|ListenOverflows|PruneCalled|RcvCollapsed"
ss -lnt # is Recv-Q of a listener pinned at Send-Q (the somaxconn cap)?
Growing ListenOverflows means accept queue saturation: examine application accept throughput together with somaxconn and the listen backlog argument. PruneCalled/RcvCollapsed signal receive buffer pressure; review tcp_rmem.
Step 4, if still unresolved, count kfree_skb reasons directly.
bpftrace -e 'tracepoint:skb:kfree_skb /args->reason > 2/
{ @[args->reason] = count(); } interval:s:5 { print(@); clear(@); }'
NETFILTER_DROP points to firewall rules, NO_SOCKET to a wrong destination or a race, SOCKET_RCVBUFF to receive buffer shortage — the reason code tells you the next action directly.
Diagnosis Scenario 2 — Latency Analysis
Situation: throughput is normal but p99 response time spikes periodically. Latency is a question of where time accumulates, so we cut the path into segments.
Step 1, separate network segment from host segment.
ss -ti '( dport = :443 )' # mean/variance of rtt, retrans
ping -c 100 target-host # baseline RTT and deviation (mdev)
If rtt from ss is much larger and more variable than ping, the host side (send/receive queues, scheduling) is the likely culprit. If retrans grows alongside, path loss is creating retransmission-timeout latency.
Step 2, if it is inside the host, look at softirq latency and queue buildup.
Distribution of softirq processing latency (BCC)
softirqs -d 10 1
Queue buildup: qdisc backlog and TX drops
tc -s qdisc show dev eth0
Socket buffer buildup: connections with nonzero Recv-Q/Send-Q
ss -tnp | awk '$2>0 || $3>0'
If Recv-Q accumulates, the kernel delivered the data but the application cannot read it. The cause is then not the network but application thread shortage or CPU contention (scheduling delay). Verify run queue latency with perf sched or runqlat (BCC).
Step 3, if the spikes are periodic, catch the culprit on the time axis. GC cycles, cron, backup traffic, and deep CPU idle states (C-states) are common causes. Check deep C-state exit latency with cpupower idle-info, and if low latency is mandatory, constrain it with kernel boot parameters or PM QoS.
Pitfalls and Anti-patterns
- Changing all sysctls at once. Change one at a time and measure, or you will never know causality.
- Applying "magic tuning collections" from the internet verbatim. Values like somaxconn 65535 can waste memory or mask problems depending on the workload.
- Pinning SO_RCVBUF via setsockopt and thereby disabling TCP autotuning (explained above).
- Misreading giant packets in tcpdump as an MTU misconfiguration. That is GRO/TSO working as intended.
- Looking only at host netns counters in container environments. Half of the problems live inside the container netns.
- Forgetting conntrack. On NAT-ing nodes with high connection churn, nf_conntrack_max saturation is a recurring outage cause.
- Running irqbalance and manual IRQ affinity at the same time. They overwrite each other.
Operations Checklist
- [ ] Are ethtool -S drop/miss counters collected into monitoring?
- [ ] Are drops and squeeze from /proc/net/softnet_stat tracked as node metrics?
- [ ] Are ListenOverflows and RetransSegs from nstat alerting targets?
- [ ] Are RX queue counts and IRQ affinity laid out to match CPU topology?
- [ ] Is the ring buffer large enough for traffic bursts (ethtool -g)?
- [ ] Did you size tcp_rmem/tcp_wmem max from a BDP calculation?
- [ ] Did you deliberately choose the congestion control (cubic/bbr) and qdisc (fq/fq_codel) combination?
- [ ] Is conntrack usage versus max monitored?
- [ ] Do you have the means (nsenter, eBPF) to collect metrics inside container netns?
- [ ] One change at a time, recorded with before/after measurements?
Closing
Following the packet journey reveals a consistent design philosophy behind the Linux networking stack: reduce interrupts (NAPI, coalescing), process in batches (GRO/TSO), spread across cores (RSS/RPS/RFS), and when needed, skip the stack itself (XDP). When you face a performance problem, recall this map and walk the per-stage counters from NIC to socket — you will find the bottleneck by measurement, not intuition. In the next post we cover the CPU-side counterpart of this journey: the scheduler.
References
- Kernel networking scaling documentation (RSS/RPS/RFS): https://www.kernel.org/doc/html/latest/networking/scaling.html
- Official NAPI documentation: https://docs.kernel.org/networking/napi.html
- Segmentation offloads (GSO/GRO/TSO): https://docs.kernel.org/networking/segmentation-offloads.html
- AF_XDP documentation: https://docs.kernel.org/networking/af_xdp.html
- tcp(7) man page — full sysctl list: https://man7.org/linux/man-pages/man7/tcp.7.html
- ss(8) man page: https://man7.org/linux/man-pages/man8/ss.8.html
- eBPF official site: https://ebpf.io
- BCC tools collection (tcpdrop, runqlat, ...): https://github.com/iovisor/bcc
- Cilium BPF/XDP reference guide: https://docs.cilium.io/en/stable/reference-guides/bpf/
- BBR paper (ACM Queue): https://queue.acm.org/detail.cfm?id=3022184
- CUBIC RFC 9438: https://datatracker.ietf.org/doc/html/rfc9438
- tc(8) man page: https://man7.org/linux/man-pages/man8/tc.8.html
현재 단락 (1/299)
When p99 latency on a web server suddenly spikes, or packet drop counters start climbing in your mon...