Skip to content
Published on

Network Engineering & Security Master Guide: From TCP/IP to Service Mesh and AI Serving Networks

Authors

Table of Contents

  1. Network Fundamentals: OSI & TCP/IP
  2. Socket Programming: asyncio & aiohttp
  3. Service Mesh: Istio & Envoy
  4. Network Security: TLS & Zero Trust
  5. CDN & Edge Computing
  6. AI Serving Networks: gRPC & SSE
  7. Packet Analysis in Practice
  8. Quiz

1. Network Fundamentals

The OSI 7-Layer Model

The OSI (Open Systems Interconnection) model abstracts network communication into seven distinct layers, each with a specific responsibility.

LayerNameProtocol ExamplesPDU
7ApplicationHTTP, DNS, SMTPMessage
6PresentationTLS, JPEG, ASCIIMessage
5SessionNetBIOS, RPCMessage
4TransportTCP, UDPSegment
3NetworkIP, ICMPPacket
2Data LinkEthernet, 802.11Frame
1PhysicalCable, FiberBits

The TCP/IP Stack

The real internet uses a simplified 4-layer TCP/IP model rather than the full OSI stack:

  • Application Layer: HTTP/2, HTTP/3, DNS, TLS
  • Transport Layer: TCP (reliability), UDP (speed)
  • Internet Layer: IPv4, IPv6, ICMP
  • Link Layer: Ethernet, Wi-Fi

HTTP/2 vs HTTP/3

HTTP/1.1 suffers from Head-of-Line (HOL) blocking because each connection handles only one request at a time.

HTTP/2 improvements:

  • Multiplexing: multiple streams over a single TCP connection
  • Header compression: HPACK algorithm eliminates redundant headers
  • Server push: proactively send resources before client requests
  • Binary framing: binary frames instead of plain text

HTTP/3 & QUIC: HTTP/3 runs over QUIC (UDP-based) instead of TCP, eliminating TCP-level HOL blocking entirely.

HTTP/1.1:  [Req1][Res1][Req2][Res2]  (sequential)
HTTP/2:    [Req1, Req2, Req3][Res1, Res2, Res3]  (mux, single TCP)
HTTP/3:    [Req1, Req2, Req3][Res1, Res2, Res3]  (mux, independent QUIC streams)

How DNS Works

When resolving api.example.com:

  1. Browser cache lookup
  2. OS cache lookup (/etc/hosts)
  3. Query sent to Recursive Resolver (ISP DNS)
  4. Hierarchical resolution: Root NS → .com TLD NS → example.com Authoritative NS
  5. Returns A record (IPv4) or AAAA record (IPv6)

TLS 1.3 Handshake

TLS 1.3 reduces round trips to 1-RTT for new connections or 0-RTT for session resumption.

Client                          Server
  |--- ClientHello (key share) ---->|
  |<-- ServerHello + Certificate ---|
  |<-- + EncryptedExtensions -------|
  |--- Finished -------------------->|
  |<-> Encrypted application data <->|

2. Socket Programming

Python asyncio TCP Server

import asyncio

async def handle_client(reader: asyncio.StreamReader, writer: asyncio.StreamWriter):
    addr = writer.get_extra_info('peername')
    print(f"Connected: {addr}")

    try:
        while True:
            data = await reader.read(1024)
            if not data:
                break
            message = data.decode('utf-8').strip()
            print(f"Received: {message} from {addr}")

            # Echo response
            response = f"Echo: {message}\n"
            writer.write(response.encode('utf-8'))
            await writer.drain()
    except asyncio.IncompleteReadError:
        pass
    finally:
        print(f"Disconnected: {addr}")
        writer.close()
        await writer.wait_closed()

async def main():
    server = await asyncio.start_server(
        handle_client, '0.0.0.0', 8888
    )
    addr = server.sockets[0].getsockname()
    print(f"Server started: {addr}")

    async with server:
        await server.serve_forever()

if __name__ == '__main__':
    asyncio.run(main())

Async HTTP Client with aiohttp

import asyncio
import aiohttp
from typing import List

async def fetch_url(session: aiohttp.ClientSession, url: str) -> dict:
    async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as response:
        return {
            'url': url,
            'status': response.status,
            'body': await response.text()
        }

async def fetch_all(urls: List[str]) -> List[dict]:
    connector = aiohttp.TCPConnector(
        limit=100,          # max concurrent connections
        limit_per_host=10,  # max per host
        keepalive_timeout=30
    )
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch_url(session, url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

# Run
urls = [f"https://httpbin.org/get?id={i}" for i in range(20)]
results = asyncio.run(fetch_all(urls))

gRPC vs REST Comparison

AspectgRPCREST
ProtocolHTTP/2HTTP/1.1 or HTTP/2
SerializationProtocol Buffers (binary)JSON (text)
StreamingBidirectional streamingLimited (SSE, WebSocket separate)
Type safetyStrong typing (.proto schema)Weak typing
LatencyLowRelatively higher
Browser supportRequires grpc-webNative

3. Service Mesh

Istio Architecture

Istio uses the sidecar pattern, injecting Envoy proxies into each Pod to control service-to-service communication.

  • Control Plane (Istiod): manages configuration, issues certificates, distributes traffic policies
  • Data Plane (Envoy Sidecar): handles actual traffic, collects metrics, enforces mTLS

VirtualService Configuration Example

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: ml-inference-vs
  namespace: production
spec:
  hosts:
    - ml-inference-svc
  http:
    - match:
        - headers:
            x-model-version:
              exact: 'v2'
      route:
        - destination:
            host: ml-inference-svc
            subset: v2
          weight: 100
    - route:
        - destination:
            host: ml-inference-svc
            subset: v1
          weight: 90
        - destination:
            host: ml-inference-svc
            subset: v2
          weight: 10

Load Balancing Algorithms

  • Round Robin: distribute requests in order (default)
  • Least Connections: select server with fewest active connections
  • Weighted Round Robin: distribute based on server weights
  • IP Hash: pin client to a server based on IP (session affinity)
  • Consistent Hashing: minimize redistribution when adding/removing servers (distributed caching)

Service Discovery

In Kubernetes, CoreDNS handles service discovery. Services are reachable via service-name.namespace.svc.cluster.local.


4. Network Security

TLS Certificate Chain

Root CA (trusted by browsers/OS)
  └── Intermediate CA
        └── Leaf Certificate (actual server cert)

Each certificate is signed by the private key of the CA above it, forming a chain of trust.

mTLS (Mutual TLS)

Standard TLS only authenticates the server. mTLS requires both parties to present certificates, enabling bidirectional authentication.

Standard TLS:  Client[verify server cert]Server
mTLS:          Client[verify both certs]Server

In a service mesh, mTLS is handled automatically by sidecar proxies, securing all service-to-service communication without application changes.

Nginx TLS Configuration

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate     /etc/ssl/certs/server.crt;
    ssl_certificate_key /etc/ssl/private/server.key;

    # Only TLS 1.2+
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
    ssl_prefer_server_ciphers off;

    # HSTS (1 year)
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

    # OCSP Stapling
    ssl_stapling on;
    ssl_stapling_verify on;

    location /api/ {
        proxy_pass http://backend_pool;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Zero Trust Architecture

Core Zero Trust principle: "Never Trust, Always Verify"

  • Identity-based access: authenticate by user/service identity, not IP address
  • Least privilege: grant access only to required resources
  • Continuous verification: evaluate trust level throughout the session
  • Micro-segmentation: divide network into small zones to block lateral movement

JWT & OAuth 2.0

A JWT (JSON Web Token) consists of three parts: Header.Payload.Signature

OAuth 2.0 Authorization Code Flow:

  1. Client redirects user to Authorization Server
  2. After user authentication, Authorization Code is issued
  3. Backend exchanges Code for Access Token
  4. Access Token is used to call Resource Server APIs

5. CDN & Edge Computing

How CDNs Work

A CDN (Content Delivery Network) caches content at PoPs (Points of Presence) worldwide to reduce latency.

Cache strategies:

  • Cache-Control: max-age=86400: cache for 1 day in browser and CDN
  • Cache-Control: no-cache: always revalidate with origin
  • ETag / Last-Modified: conditional requests to check for changes

Cloudflare Workers Example

export default {
  async fetch(request, env) {
    const url = new URL(request.url)

    // Route AI inference at the edge
    if (url.pathname.startsWith('/inference/')) {
      const modelId = url.searchParams.get('model') || 'default'

      // Route to the nearest GPU cluster
      const region = request.cf.region
      const backendUrl = selectBackend(region, modelId)

      return fetch(backendUrl, {
        method: request.method,
        headers: request.headers,
        body: request.body,
      })
    }

    // Static asset caching
    const cache = caches.default
    const cachedResponse = await cache.match(request)
    if (cachedResponse) return cachedResponse

    const response = await fetch(request)
    if (response.status === 200) {
      const responseToCache = response.clone()
      await cache.put(request, responseToCache)
    }
    return response
  },
}

6. AI Serving Networks

ML Model Serving with gRPC

Protocol Buffers definition:

syntax = "proto3";
package inference;

service InferenceService {
  rpc Predict(PredictRequest) returns (PredictResponse);
  rpc StreamPredict(PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
  string model_name = 1;
  repeated float input_data = 2;
  map<string, string> metadata = 3;
}

message PredictResponse {
  repeated float output_data = 1;
  float confidence = 2;
  int64 latency_ms = 3;
}

Python gRPC server:

import grpc
from concurrent import futures
import inference_pb2
import inference_pb2_grpc
import numpy as np
import time

class InferenceServicer(inference_pb2_grpc.InferenceServiceServicer):
    def Predict(self, request, context):
        start = time.time()
        input_array = np.array(request.input_data)

        # Actual model inference (example)
        output = input_array * 2.0

        latency = int((time.time() - start) * 1000)
        return inference_pb2.PredictResponse(
            output_data=output.tolist(),
            confidence=0.95,
            latency_ms=latency
        )

    def StreamPredict(self, request, context):
        # Streaming response for LLM token generation
        tokens = ["Hello", " world", "!", " gRPC", " streaming", "."]
        for token in tokens:
            yield inference_pb2.PredictResponse(
                output_data=[float(ord(c)) for c in token],
                confidence=0.9,
                latency_ms=10
            )

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    inference_pb2_grpc.add_InferenceServiceServicer_to_server(
        InferenceServicer(), server
    )
    server.add_insecure_port('[::]:50051')
    server.start()
    print("gRPC server started on port 50051")
    server.wait_for_termination()

Real-Time LLM Streaming with SSE

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
import json

app = FastAPI()

async def llm_token_generator(prompt: str):
    """Stream LLM tokens as SSE events"""
    # In production, use an LLM library here
    tokens = prompt.split() + ["[DONE]"]
    for i, token in enumerate(tokens):
        data = json.dumps({"token": token, "index": i})
        yield f"data: {data}\n\n"
        await asyncio.sleep(0.05)  # simulate token generation delay
    yield "data: [DONE]\n\n"

@app.get("/stream")
async def stream_llm(prompt: str = "Hello world"):
    return StreamingResponse(
        llm_token_generator(prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # Disable Nginx buffering
        }
    )

7. Packet Analysis in Practice

tcpdump Commands

# Capture HTTP traffic on a specific interface
sudo tcpdump -i eth0 -w capture.pcap port 80 or port 443

# Filter TCP handshakes only (SYN packets)
sudo tcpdump -i any 'tcp[tcpflags] & tcp-syn != 0'

# Monitor traffic with a specific host
sudo tcpdump -i eth0 host 10.0.0.1 -n

# Monitor DNS queries
sudo tcpdump -i any udp port 53 -v

# Capture gRPC (HTTP/2) traffic
sudo tcpdump -i eth0 port 50051 -w grpc_trace.pcap

Network Debugging with curl

# Check TLS certificate info
curl -vI https://api.example.com 2>&1 | grep -A 20 "SSL connection"

# Verify HTTP/2 is being used
curl --http2 -I https://api.example.com

# Measure response time breakdown
curl -o /dev/null -s -w \
  "DNS: %{time_namelookup}s\nTCP: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" \
  https://api.example.com

# Test gRPC endpoint (grpcurl)
grpcurl -plaintext localhost:50051 list
grpcurl -plaintext -d '{"model_name": "bert", "input_data": [1.0, 2.0]}' \
  localhost:50051 inference.InferenceService/Predict

Key Network Performance Metrics

  • RTT (Round-Trip Time): packet round-trip time, measured with ping
  • Throughput: data transferred per unit time, measured with iperf3
  • Packet Loss: loss rate, critical in unstable UDP environments
  • Jitter: RTT variation, important for real-time streaming
# Measure network bandwidth
iperf3 -s                           # server
iperf3 -c server_ip -t 30 -P 4     # client (4 parallel streams)

Quiz

Q1. Explain the TCP 3-way handshake and 4-way termination process.

Answer:

3-way handshake (connection setup):

  1. Client → Server: SYN (seq=x)
  2. Server → Client: SYN-ACK (seq=y, ack=x+1)
  3. Client → Server: ACK (ack=y+1)

4-way termination (connection teardown):

  1. Client → Server: FIN
  2. Server → Client: ACK
  3. Server → Client: FIN
  4. Client → Server: ACK (enters TIME_WAIT, then fully closes)

Explanation: Termination requires 4 steps because after receiving a FIN, the server may still have data to send. The client enters TIME_WAIT for 2MSL (Maximum Segment Lifetime) to handle any delayed packets from the old connection.

Q2. How does HTTP/2 multiplexing solve the HOL blocking problem of HTTP/1.1?

Answer: HTTP/2 sends multiple streams concurrently over a single TCP connection. Each stream has an independent ID, so a delay in one stream does not block other streams.

Explanation: HTTP/1.1 supports pipelining but responses must arrive in request order, so a slow response blocks all subsequent ones (HOL blocking). HTTP/2 solves this at the application layer by interleaving frames from different streams. However, TCP-level HOL blocking (a dropped packet stalls the entire connection) is only resolved in HTTP/3, which uses QUIC's independently delivered streams over UDP.

Q3. How does mTLS differ from standard TLS, and what role does it play in a service mesh?

Answer: Standard TLS only authenticates the server certificate. mTLS (Mutual TLS) requires both client and server to present certificates, enabling bidirectional authentication.

Explanation: In a service mesh like Istio, mTLS is handled transparently by sidecar proxies (Envoy). Each service is issued an X.509 certificate with a unique SPIFFE identity. All inter-service traffic is mutually authenticated and encrypted without any application code changes. This defends against eavesdropping, spoofing, and man-in-the-middle attacks even inside the private network.

Q4. Why is gRPC better suited for ML model serving than REST APIs?

Answer: gRPC offers binary serialization with Protocol Buffers, HTTP/2 multiplexing, and bidirectional streaming — all of which are highly advantageous for ML serving workloads.

Explanation: Large tensor payloads serialize 3-5x more compactly with Protocol Buffers than JSON. Server-side streaming enables real-time delivery of LLM-generated tokens to the client. The .proto schema enforces a strict API contract, ensuring type safety across ML pipeline integrations. Persistent HTTP/2 connections also reduce the overhead of establishing new connections per request.

Q5. What are the types of CDN cache invalidation strategies and their trade-offs?

Answer: The main strategies are TTL-based expiration, versioned URLs, API-based instant invalidation, and surrogate key (tag) based invalidation.

Explanation:

  • TTL-based: simple to implement, but stale content may be served until expiration
  • Versioned URL (e.g., style.v2.css): preserves cache hit rate while allowing instant refresh; URL management adds complexity
  • Instant invalidation API: fast propagation but incurs CDN costs; propagation delay of tens of seconds still applies
  • Surrogate keys: invalidate groups of related content at once (e.g., Cloudflare Cache Tags)

In practice, static assets (JS/CSS) use versioned URLs while API responses use short TTLs combined with instant invalidation as needed.