Skip to content
Published on

Complete Guide to Network Performance Analysis: Measurement, Diagnosis, and Monitoring

Authors
  • Name
    Twitter

Overview

Network performance issues manifest in many forms: slow application response times, failed file transfers, degraded streaming quality, and more. Effective troubleshooting requires understanding core performance metrics and systematically diagnosing problems using appropriate tools.

This guide covers everything you need to measure and analyze network performance, from understanding fundamental metrics to resolving common real-world scenarios.

1. Core Network Performance Metrics

1.1 Latency

Latency is the time it takes for a packet to travel from source to destination. It is typically measured as RTT (Round Trip Time).

# Basic ping test
ping -c 20 target-server.example.com

# Ping with timestamps
ping -c 100 -D target-server.example.com

# Example output
# PING target-server.example.com (10.0.1.50): 56 data bytes
# 64 bytes from 10.0.1.50: icmp_seq=0 ttl=64 time=0.523 ms
# 64 bytes from 10.0.1.50: icmp_seq=1 ttl=64 time=0.481 ms
# ...
# round-trip min/avg/max/stddev = 0.481/0.512/0.623/0.042 ms

Latency is composed of several components:

  • Propagation Delay: Limited by the speed of light, proportional to physical distance
  • Transmission Delay: Time to push data onto the link (packet size / bandwidth)
  • Processing Delay: Time for routers to process packet headers
  • Queuing Delay: Time spent waiting in router buffers
# Script for detailed latency distribution analysis
#!/bin/bash
TARGET="target-server.example.com"
COUNT=1000

echo "=== Latency Distribution Analysis ==="
ping -c $COUNT $TARGET | tail -1

# TCP-based latency measurement using hping3 (useful when ICMP is blocked)
sudo hping3 -S -p 443 -c 20 $TARGET

1.2 Throughput

Throughput is the actual amount of data transferred per unit time. It is often confused with bandwidth, but bandwidth represents theoretical maximum capacity while throughput reflects the achievable transfer rate in practice.

# Measure TCP throughput with iperf3
# Server side
iperf3 -s -p 5201

# Client side - basic test
iperf3 -c server-ip -p 5201 -t 30

# Bidirectional simultaneous test
iperf3 -c server-ip -p 5201 -t 30 --bidir

# Example output
# [ ID] Interval           Transfer     Bitrate         Retr
# [  5]   0.00-30.00  sec  3.28 GBytes   939 Mbits/sec   12  sender
# [  5]   0.00-30.00  sec  3.27 GBytes   937 Mbits/sec       receiver

1.3 Packet Loss

Packet loss is the percentage of transmitted packets that fail to reach their destination. Loss rates above 1% typically cause noticeable performance degradation.

# Measure packet loss
ping -c 1000 -i 0.01 target-server.example.com

# Check per-hop packet loss with mtr
mtr -r -c 100 target-server.example.com

# Example output
# HOST: myhost                     Loss%   Snt   Last   Avg  Best  Wrst StDev
#  1.|-- gateway                    0.0%   100    0.5   0.6   0.3   1.2   0.2
#  2.|-- isp-router                 0.0%   100    3.2   3.5   2.8   5.1   0.5
#  3.|-- core-router                2.0%   100    8.1  12.3   7.5  45.2   8.1
#  4.|-- target-server              2.0%   100   10.2  14.1   9.8  48.3   9.2

1.4 Jitter

Jitter is the variation in packet arrival times. It is especially critical for real-time communications such as VoIP and video calls.

# Measure jitter with iperf3 UDP mode
# Server
iperf3 -s

# Client - UDP mode, 100Mbps target
iperf3 -c server-ip -u -b 100M -t 30

# Example output
# [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total
# [  5]   0.00-30.00  sec   358 MBytes   100 Mbits/sec  0.042 ms  12/45892 (0.026%)

2. Network Diagnostic Tools

2.1 iperf3 - Bandwidth Testing

iperf3 is the most widely used tool for measuring network bandwidth.

# Multi-stream test (utilize full bandwidth with parallel connections)
iperf3 -c server-ip -P 4 -t 30

# Test with bandwidth cap
iperf3 -c server-ip -b 500M -t 30

# Specify TCP window size
iperf3 -c server-ip -w 256K -t 30

# JSON output (useful for automation)
iperf3 -c server-ip -t 30 -J > iperf3_result.json

# Reverse test (server to client direction)
iperf3 -c server-ip -R -t 30

# Set MSS
iperf3 -c server-ip -M 1400 -t 30

# Set reporting interval
iperf3 -c server-ip -t 60 -i 5

2.2 mtr - Path Analysis

mtr combines traceroute and ping to continuously monitor performance at each hop along the network path.

# Basic report mode
mtr -r -c 200 target-server.example.com

# TCP mode (for ICMP-blocked environments)
mtr -r -c 100 -T -P 443 target-server.example.com

# UDP mode
mtr -r -c 100 -u target-server.example.com

# Wide report with AS numbers
mtr -r -c 200 -w -z target-server.example.com

# CSV output
mtr -r -c 100 --csv target-server.example.com > mtr_report.csv

Key considerations when interpreting mtr results:

  • If loss appears only at intermediate hops but the final destination shows no loss, it is likely ICMP rate limiting
  • If latency spikes sharply starting at a specific hop, that segment is the bottleneck
  • If loss increases at the last few hops, there is likely a real problem

2.3 traceroute

# ICMP traceroute
traceroute target-server.example.com

# TCP traceroute (useful for bypassing firewalls)
sudo traceroute -T -p 443 target-server.example.com

# UDP traceroute (specific port)
traceroute -U -p 33434 target-server.example.com

# Set maximum hop count
traceroute -m 30 target-server.example.com

# Paris-traceroute (accurate path tracing in load-balanced environments)
paris-traceroute target-server.example.com

2.4 netperf - Advanced Performance Testing

# Start netperf server
netserver -p 12865

# TCP stream test
netperf -H server-ip -p 12865 -t TCP_STREAM -l 30

# TCP RR (Request/Response) test - measures transaction performance
netperf -H server-ip -p 12865 -t TCP_RR -l 30

# TCP CRR (Connect/Request/Response) - includes connection establishment
netperf -H server-ip -p 12865 -t TCP_CRR -l 30

# Specify message size
netperf -H server-ip -t TCP_STREAM -l 30 -- -m 65536

# UDP stream test
netperf -H server-ip -t UDP_STREAM -l 30

3. Bandwidth Testing and Bottleneck Identification

3.1 Systematic Bandwidth Testing

#!/bin/bash
# bandwidth_test.sh - Systematic bandwidth test suite

SERVER="10.0.1.50"
PORT=5201
DURATION=30
LOGDIR="/var/log/bandwidth_tests"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

mkdir -p $LOGDIR

echo "=== Bandwidth Test Suite - $TIMESTAMP ==="

# 1. Single stream TCP
echo "[1/5] Single Stream TCP Test"
iperf3 -c $SERVER -p $PORT -t $DURATION -J > "$LOGDIR/tcp_single_${TIMESTAMP}.json"

# 2. Multi-stream TCP
echo "[2/5] Multi Stream TCP Test (4 streams)"
iperf3 -c $SERVER -p $PORT -t $DURATION -P 4 -J > "$LOGDIR/tcp_multi_${TIMESTAMP}.json"

# 3. Reverse TCP
echo "[3/5] Reverse TCP Test"
iperf3 -c $SERVER -p $PORT -t $DURATION -R -J > "$LOGDIR/tcp_reverse_${TIMESTAMP}.json"

# 4. UDP bandwidth
echo "[4/5] UDP Bandwidth Test"
iperf3 -c $SERVER -p $PORT -t $DURATION -u -b 1G -J > "$LOGDIR/udp_${TIMESTAMP}.json"

# 5. Bidirectional
echo "[5/5] Bidirectional Test"
iperf3 -c $SERVER -p $PORT -t $DURATION --bidir -J > "$LOGDIR/bidir_${TIMESTAMP}.json"

echo "=== Tests Complete. Results in $LOGDIR ==="

3.2 Identifying Bottleneck Points

# Step 1: Check local interface speed
ethtool eth0 | grep -i speed
# Speed: 10000Mb/s

# Step 2: Test loopback performance (identify NIC/CPU limits)
iperf3 -c 127.0.0.1 -t 10

# Step 3: Test between servers on the same switch
iperf3 -c same-switch-server -t 30

# Step 4: Test to a server on a different subnet (routing impact)
iperf3 -c different-subnet-server -t 30

# Step 5: Test across the WAN
iperf3 -c remote-server -t 30

# The segment where throughput drops significantly is the bottleneck

4. Network Interface Statistics and Error Counters

4.1 ethtool Statistics

# Basic interface information
ethtool eth0

# Detailed statistics
ethtool -S eth0

# Key counters to check
ethtool -S eth0 | grep -E "(rx_errors|tx_errors|rx_dropped|tx_dropped|rx_crc|collisions)"

# Driver information
ethtool -i eth0

# Check ring buffer size
ethtool -g eth0

# Increase ring buffer size (to reduce drops)
sudo ethtool -G eth0 rx 4096 tx 4096

# Check offload settings
ethtool -k eth0

# Configure TSO/GSO/GRO
sudo ethtool -K eth0 tso on gso on gro on

4.2 Interface Statistics with ip Command

# Interface statistics summary
ip -s link show eth0

# Example output
# 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
#     RX: bytes  packets  errors  dropped overrun mcast
#     948271623  1523847  0       0       0       12847
#     TX: bytes  packets  errors  dropped carrier collsns
#     523841267  892341   0       0       0       0

# Detailed statistics
ip -s -s link show eth0

# All interface statistics
ip -s link show

# Script to track statistical changes
#!/bin/bash
IFACE="eth0"
while true; do
    echo "=== $(date) ==="
    ip -s link show $IFACE | grep -A 2 "RX\|TX"
    sleep 5
done

4.3 Error Counter Monitoring

# Check real-time stats from /proc/net/dev
cat /proc/net/dev

# Interface statistics via netstat
netstat -i

# Continuous error counter monitoring script
#!/bin/bash
IFACE="eth0"
INTERVAL=10

echo "Monitoring $IFACE errors every ${INTERVAL}s..."
echo "Time | RX_errors | TX_errors | RX_dropped | TX_dropped"

while true; do
    STATS=$(ip -s link show $IFACE)
    RX_ERR=$(echo "$STATS" | awk '/RX:/{getline; print $3}')
    TX_ERR=$(echo "$STATS" | awk '/TX:/{getline; print $3}')
    RX_DROP=$(echo "$STATS" | awk '/RX:/{getline; print $4}')
    TX_DROP=$(echo "$STATS" | awk '/TX:/{getline; print $4}')
    echo "$(date +%H:%M:%S) | $RX_ERR | $TX_ERR | $RX_DROP | $TX_DROP"
    sleep $INTERVAL
done

5. TCP Window Analysis and Congestion Control

5.1 TCP Window Size Analysis

# Check window size of current TCP connections
ss -ti dst target-server.example.com

# Example output
# State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
# ESTAB  0       0       10.0.1.10:42856     10.0.1.50:443
#   cubic wscale:7,7 rto:204 rtt:1.523/0.742 ato:40 mss:1448
#   pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 ssthresh:7
#   bytes_sent:15234 bytes_acked:15235 bytes_received:45678
#   send 76.1Mbps pacing_rate 152.1Mbps delivery_rate 45.2Mbps

# Key fields explained
# cwnd: Congestion window size (in segments)
# ssthresh: Slow start threshold
# rtt: Round trip time / standard deviation
# mss: Maximum segment size

5.2 TCP Congestion Control Algorithms

# Check current congestion control algorithm
sysctl net.ipv4.tcp_congestion_control
# net.ipv4.tcp_congestion_control = cubic

# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# net.ipv4.tcp_available_congestion_control = reno cubic bbr

# Switch to BBR
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

# Set fq scheduler for BBR
sudo tc qdisc replace dev eth0 root fq

# TCP buffer size tuning
# min/default/max (bytes)
sudo sysctl -w net.ipv4.tcp_rmem="4096 131072 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable TCP window scaling
sudo sysctl -w net.ipv4.tcp_window_scaling=1

5.3 TCP Analysis with tcpdump

# Capture TCP handshakes
sudo tcpdump -i eth0 -c 50 'tcp[tcpflags] & (tcp-syn|tcp-fin) != 0' -nn

# Capture TCP communication with a specific host
sudo tcpdump -i eth0 host target-server.example.com -w capture.pcap

# Identify retransmission packets
sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0' -nn

# Detect zero-window packets
sudo tcpdump -i eth0 'tcp[14:2] = 0' -nn

# Analyze capture file with tshark
tshark -r capture.pcap -q -z io,stat,1
tshark -r capture.pcap -q -z conv,tcp

6. MTU and Fragmentation Issues

6.1 MTU Verification and Path MTU Discovery

# Check interface MTU
ip link show eth0 | grep mtu

# Path MTU Discovery
# Attempt transmission with DF bit set (no fragmentation)
ping -c 5 -M do -s 1472 target-server.example.com
# PING target-server.example.com: 1472 data bytes
# 1480 bytes from 10.0.1.50: icmp_seq=1 ttl=64 time=0.523 ms

# Reduce packet size to find the maximum MTU
ping -c 3 -M do -s 1473 target-server.example.com
# ping: local error: message too long, mtu=1500

# Automated MTU discovery script
#!/bin/bash
TARGET=$1
SIZE=1500

while [ $SIZE -gt 0 ]; do
    ping -c 1 -M do -s $SIZE $TARGET > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        echo "Path MTU: $((SIZE + 28)) bytes (payload: $SIZE + 20 IP + 8 ICMP)"
        break
    fi
    SIZE=$((SIZE - 1))
done

6.2 Diagnosing Fragmentation Issues

# Check fragmentation statistics
cat /proc/net/snmp | grep -i frag
# Ip: ... FragCreates FragOKs FragFails

# Check fragmentation with netstat
netstat -s | grep -i frag

# Real-time fragmentation monitoring
watch -n 1 'cat /proc/net/snmp | grep Ip: | head -2'

# Change MTU
sudo ip link set dev eth0 mtu 9000  # Jumbo Frame

# Check PMTUD status
sysctl net.ipv4.ip_no_pmtu_disc
# 0 = PMTUD enabled (recommended)

6.3 Jumbo Frame Configuration and Validation

# Verify Jumbo Frame support
ethtool -i eth0

# Enable Jumbo Frames
sudo ip link set dev eth0 mtu 9000

# Validate Jumbo Frame path
ping -c 5 -M do -s 8972 target-server.example.com

# Compare Jumbo Frame performance
echo "=== MTU 1500 ==="
iperf3 -c server-ip -t 10 -M 1460
echo "=== MTU 9000 ==="
iperf3 -c server-ip -t 10 -M 8960

7. Network Monitoring with Prometheus and Grafana

7.1 node_exporter Configuration

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'server1:9100'
          - 'server2:9100'
          - 'server3:9100'
    scrape_interval: 5s

Key network metrics collected by node_exporter:

  • node_network_receive_bytes_total: Total received bytes
  • node_network_transmit_bytes_total: Total transmitted bytes
  • node_network_receive_errs_total: Total receive errors
  • node_network_transmit_errs_total: Total transmit errors
  • node_network_receive_drop_total: Total receive drops
  • node_network_transmit_drop_total: Total transmit drops

7.2 PromQL Query Examples

# Per-interface receive traffic rate (bps)
rate(node_network_receive_bytes_total{device!="lo"}[5m]) * 8

# Per-interface transmit traffic rate (bps)
rate(node_network_transmit_bytes_total{device!="lo"}[5m]) * 8

# Packet error rate (%)
rate(node_network_receive_errs_total{device="eth0"}[5m])
  / rate(node_network_receive_packets_total{device="eth0"}[5m]) * 100

# Packet drop rate
rate(node_network_receive_drop_total{device="eth0"}[5m])

# Bandwidth utilization (%)
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8
  / node_network_speed_bytes{device="eth0"} / 8 * 100

# TCP retransmission rate
rate(node_netstat_Tcp_RetransSegs[5m])
  / rate(node_netstat_Tcp_OutSegs[5m]) * 100

# TCP connections by state
node_netstat_Tcp_CurrEstab

7.3 Grafana Alert Rules

# Grafana Alert Rules
groups:
  - name: network_alerts
    rules:
      - alert: HighPacketLoss
        expr: |
          rate(node_network_receive_errs_total{device="eth0"}[5m])
          / rate(node_network_receive_packets_total{device="eth0"}[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'High packet loss on {{ $labels.instance }}'
          description: 'Packet loss rate is {{ $value | humanize }}%'

      - alert: HighBandwidthUsage
        expr: |
          rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8
          / 10000000000 > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'High bandwidth usage on {{ $labels.instance }}'

      - alert: NetworkInterfaceDown
        expr: node_network_up{device="eth0"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: 'Network interface down on {{ $labels.instance }}'

      - alert: HighTcpRetransmission
        expr: |
          rate(node_netstat_Tcp_RetransSegs[5m])
          / rate(node_netstat_Tcp_OutSegs[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'High TCP retransmission rate on {{ $labels.instance }}'

7.4 SNMP Exporter Integration

# SNMP exporter configuration (for network device monitoring)
# /etc/prometheus/snmp.yml (generated by snmp_exporter generator)

# Add SNMP targets to prometheus.yml
scrape_configs:
  - job_name: 'snmp'
    static_configs:
      - targets:
          - 'switch01.example.com'
          - 'router01.example.com'
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: snmp-exporter:9116

8. Practical Performance Diagnosis Scenarios

8.1 Scenario: Application Response Delay

# Step 1: Check DNS resolution delay
dig target-server.example.com | grep "Query time"
# Query time: 245 msec  <-- DNS is slow

# Resolution: Use local DNS cache or faster DNS servers
# Modify nameserver in /etc/resolv.conf

# Step 2: Check TCP connection establishment time
curl -o /dev/null -s -w "\
  DNS: %{time_namelookup}s\n\
  Connect: %{time_connect}s\n\
  TLS: %{time_appconnect}s\n\
  TTFB: %{time_starttransfer}s\n\
  Total: %{time_total}s\n" \
  https://target-server.example.com

# Step 3: Check for issues along the path
mtr -r -c 50 target-server.example.com

8.2 Scenario: Slow File Transfers

# Step 1: Check local interface status
ethtool eth0 | grep -E "(Speed|Duplex|Link)"
# Speed: 1000Mb/s
# Duplex: Full
# Link detected: yes

# If auto-negotiation failed and speed is stuck at 100Mbps
sudo ethtool -s eth0 speed 1000 duplex full autoneg on

# Step 2: Check error counters
ethtool -S eth0 | grep -E "(error|drop|crc|collision)"

# Step 3: Check TCP tuning state
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# Step 4: Measure bandwidth
iperf3 -c remote-server -t 30 -P 4

8.3 Scenario: Intermittent Connection Drops

# Step 1: Long-duration ping to identify patterns
ping -c 3600 -i 1 target-server.example.com | while read line; do
    echo "$(date '+%Y-%m-%d %H:%M:%S') $line"
done | tee ping_log.txt

# Step 2: Check for interface flapping
dmesg | grep -i "link\|eth0\|carrier"
journalctl -u NetworkManager --since "1 hour ago" | grep -i "disconnect\|connect"

# Step 3: Check ARP table for anomalies
ip neigh show | grep -i "FAILED\|STALE"

# Step 4: Check switch port statistics (via SNMP or management interface)
snmpwalk -v2c -c public switch01 IF-MIB::ifOperStatus
snmpwalk -v2c -c public switch01 IF-MIB::ifInErrors

8.4 Scenario: VoIP Quality Issues

# Step 1: Measure jitter and packet loss
iperf3 -c voip-server -u -b 100K -t 60 -l 160
# VoIP typically uses small packets of 64-160 bytes

# Step 2: Check QoS configuration
tc qdisc show dev eth0
tc class show dev eth0
tc filter show dev eth0

# Step 3: Apply QoS policy (prioritize VoIP traffic)
sudo tc qdisc add dev eth0 root handle 1: htb default 30
sudo tc class add dev eth0 parent 1: classid 1:1 htb rate 1000mbit
sudo tc class add dev eth0 parent 1:1 classid 1:10 htb rate 100mbit ceil 200mbit prio 1
sudo tc class add dev eth0 parent 1:1 classid 1:30 htb rate 900mbit ceil 1000mbit prio 3

# Classify VoIP traffic into the high-priority class
sudo tc filter add dev eth0 parent 1: protocol ip prio 1 \
    u32 match ip dport 5060 0xffff flowid 1:10

Summary

Network performance analysis should be approached systematically in the following order:

  1. Collect Metrics: Measure latency, throughput, packet loss, and jitter
  2. Analyze Segments: Identify problem segments using mtr and traceroute
  3. Inspect Interfaces: Check for physical issues with ethtool and ip commands
  4. Analyze Protocols: Examine TCP window and congestion control state
  5. Verify MTU: Check path MTU and fragmentation issues
  6. Continuous Monitoring: Observe trends with Prometheus and Grafana

The key is selecting the right tool at each stage and accurately interpreting the results. Avoid relying on a single tool; instead, cross-validate results from multiple tools to pinpoint the root cause of the problem.