Network Engineering & Security Master Guide: From TCP/IP to Service Mesh and AI Serving Networks

Network Fundamentals: OSI & TCP/IP
Socket Programming: asyncio & aiohttp
Service Mesh: Istio & Envoy
Network Security: TLS & Zero Trust
CDN & Edge Computing
AI Serving Networks: gRPC & SSE
Packet Analysis in Practice
Quiz

1. Network Fundamentals

The OSI 7-Layer Model

The OSI (Open Systems Interconnection) model abstracts network communication into seven distinct layers, each with a specific responsibility.

Layer	Name	Protocol Examples	PDU
7	Application	HTTP, DNS, SMTP	Message
6	Presentation	TLS, JPEG, ASCII	Message
5	Session	NetBIOS, RPC	Message
4	Transport	TCP, UDP	Segment
3	Network	IP, ICMP	Packet
2	Data Link	Ethernet, 802.11	Frame
1	Physical	Cable, Fiber	Bits

The TCP/IP Stack

The real internet uses a simplified 4-layer TCP/IP model rather than the full OSI stack:

Application Layer: HTTP/2, HTTP/3, DNS, TLS
Transport Layer: TCP (reliability), UDP (speed)
Internet Layer: IPv4, IPv6, ICMP
Link Layer: Ethernet, Wi-Fi

HTTP/2 vs HTTP/3

HTTP/1.1 suffers from Head-of-Line (HOL) blocking because each connection handles only one request at a time.

HTTP/2 improvements:

Multiplexing: multiple streams over a single TCP connection
Header compression: HPACK algorithm eliminates redundant headers
Server push: proactively send resources before client requests
Binary framing: binary frames instead of plain text

HTTP/3 & QUIC: HTTP/3 runs over QUIC (UDP-based) instead of TCP, eliminating TCP-level HOL blocking entirely.

HTTP/1.1:  [Req1] → [Res1] → [Req2] → [Res2]  (sequential)
HTTP/2:    [Req1, Req2, Req3] → [Res1, Res2, Res3]  (mux, single TCP)
HTTP/3:    [Req1, Req2, Req3] → [Res1, Res2, Res3]  (mux, independent QUIC streams)

How DNS Works

When resolving api.example.com:

Browser cache lookup
OS cache lookup (/etc/hosts)
Query sent to Recursive Resolver (ISP DNS)
Hierarchical resolution: Root NS → .com TLD NS → example.com Authoritative NS
Returns A record (IPv4) or AAAA record (IPv6)

TLS 1.3 Handshake

TLS 1.3 reduces round trips to 1-RTT for new connections or 0-RTT for session resumption.

Client                          Server
  |--- ClientHello (key share) ---->|
  |<-- ServerHello + Certificate ---|
  |<-- + EncryptedExtensions -------|
  |--- Finished -------------------->|
  |<-> Encrypted application data <->|

2. Socket Programming

Python asyncio TCP Server

import asyncio

async def handle_client(reader: asyncio.StreamReader, writer: asyncio.StreamWriter):
    addr = writer.get_extra_info('peername')
    print(f"Connected: {addr}")

    try:
        while True:
            data = await reader.read(1024)
            if not data:
                break
            message = data.decode('utf-8').strip()
            print(f"Received: {message} from {addr}")

            # Echo response
            response = f"Echo: {message}\n"
            writer.write(response.encode('utf-8'))
            await writer.drain()
    except asyncio.IncompleteReadError:
        pass
    finally:
        print(f"Disconnected: {addr}")
        writer.close()
        await writer.wait_closed()

async def main():
    server = await asyncio.start_server(
        handle_client, '0.0.0.0', 8888
    )
    addr = server.sockets[0].getsockname()
    print(f"Server started: {addr}")

    async with server:
        await server.serve_forever()

if __name__ == '__main__':
    asyncio.run(main())

Async HTTP Client with aiohttp

import asyncio
import aiohttp
from typing import List

async def fetch_url(session: aiohttp.ClientSession, url: str) -> dict:
    async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as response:
        return {
            'url': url,
            'status': response.status,
            'body': await response.text()
        }

async def fetch_all(urls: List[str]) -> List[dict]:
    connector = aiohttp.TCPConnector(
        limit=100,          # max concurrent connections
        limit_per_host=10,  # max per host
        keepalive_timeout=30
    )
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch_url(session, url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

# Run
urls = [f"https://httpbin.org/get?id={i}" for i in range(20)]
results = asyncio.run(fetch_all(urls))

gRPC vs REST Comparison

Aspect	gRPC	REST
Protocol	HTTP/2	HTTP/1.1 or HTTP/2
Serialization	Protocol Buffers (binary)	JSON (text)
Streaming	Bidirectional streaming	Limited (SSE, WebSocket separate)
Type safety	Strong typing (.proto schema)	Weak typing
Latency	Low	Relatively higher
Browser support	Requires grpc-web	Native

3. Service Mesh

Istio Architecture

Istio uses the sidecar pattern, injecting Envoy proxies into each Pod to control service-to-service communication.

Control Plane (Istiod): manages configuration, issues certificates, distributes traffic policies
Data Plane (Envoy Sidecar): handles actual traffic, collects metrics, enforces mTLS

VirtualService Configuration Example

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: ml-inference-vs
  namespace: production
spec:
  hosts:
    - ml-inference-svc
  http:
    - match:
        - headers:
            x-model-version:
              exact: 'v2'
      route:
        - destination:
            host: ml-inference-svc
            subset: v2
          weight: 100
    - route:
        - destination:
            host: ml-inference-svc
            subset: v1
          weight: 90
        - destination:
            host: ml-inference-svc
            subset: v2
          weight: 10

Load Balancing Algorithms

Round Robin: distribute requests in order (default)
Least Connections: select server with fewest active connections
Weighted Round Robin: distribute based on server weights
IP Hash: pin client to a server based on IP (session affinity)
Consistent Hashing: minimize redistribution when adding/removing servers (distributed caching)

Service Discovery

In Kubernetes, CoreDNS handles service discovery. Services are reachable via service-name.namespace.svc.cluster.local.

4. Network Security

TLS Certificate Chain

Root CA (trusted by browsers/OS)
  └── Intermediate CA
        └── Leaf Certificate (actual server cert)

Each certificate is signed by the private key of the CA above it, forming a chain of trust.

mTLS (Mutual TLS)

Standard TLS only authenticates the server. mTLS requires both parties to present certificates, enabling bidirectional authentication.

Standard TLS:  Client → [verify server cert] → Server
mTLS:          Client ↔ [verify both certs]  ↔ Server

In a service mesh, mTLS is handled automatically by sidecar proxies, securing all service-to-service communication without application changes.

Nginx TLS Configuration

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate     /etc/ssl/certs/server.crt;
    ssl_certificate_key /etc/ssl/private/server.key;

    # Only TLS 1.2+
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
    ssl_prefer_server_ciphers off;

    # HSTS (1 year)
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

    # OCSP Stapling
    ssl_stapling on;
    ssl_stapling_verify on;

    location /api/ {
        proxy_pass http://backend_pool;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Zero Trust Architecture

Core Zero Trust principle: "Never Trust, Always Verify"

Identity-based access: authenticate by user/service identity, not IP address
Least privilege: grant access only to required resources
Continuous verification: evaluate trust level throughout the session
Micro-segmentation: divide network into small zones to block lateral movement

JWT & OAuth 2.0

A JWT (JSON Web Token) consists of three parts: Header.Payload.Signature

OAuth 2.0 Authorization Code Flow:

Client redirects user to Authorization Server
After user authentication, Authorization Code is issued
Backend exchanges Code for Access Token
Access Token is used to call Resource Server APIs

5. CDN & Edge Computing

How CDNs Work

A CDN (Content Delivery Network) caches content at PoPs (Points of Presence) worldwide to reduce latency.

Cache strategies:

Cache-Control: max-age=86400: cache for 1 day in browser and CDN
Cache-Control: no-cache: always revalidate with origin
ETag / Last-Modified: conditional requests to check for changes

Cloudflare Workers Example

export default {
  async fetch(request, env) {
    const url = new URL(request.url)

    // Route AI inference at the edge
    if (url.pathname.startsWith('/inference/')) {
      const modelId = url.searchParams.get('model') || 'default'

      // Route to the nearest GPU cluster
      const region = request.cf.region
      const backendUrl = selectBackend(region, modelId)

      return fetch(backendUrl, {
        method: request.method,
        headers: request.headers,
        body: request.body,
      })
    }

    // Static asset caching
    const cache = caches.default
    const cachedResponse = await cache.match(request)
    if (cachedResponse) return cachedResponse

    const response = await fetch(request)
    if (response.status === 200) {
      const responseToCache = response.clone()
      await cache.put(request, responseToCache)
    }
    return response
  },
}

6. AI Serving Networks

ML Model Serving with gRPC

Protocol Buffers definition:

syntax = "proto3";
package inference;

service InferenceService {
  rpc Predict(PredictRequest) returns (PredictResponse);
  rpc StreamPredict(PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
  string model_name = 1;
  repeated float input_data = 2;
  map<string, string> metadata = 3;
}

message PredictResponse {
  repeated float output_data = 1;
  float confidence = 2;
  int64 latency_ms = 3;
}

Python gRPC server:

import grpc
from concurrent import futures
import inference_pb2
import inference_pb2_grpc
import numpy as np
import time

class InferenceServicer(inference_pb2_grpc.InferenceServiceServicer):
    def Predict(self, request, context):
        start = time.time()
        input_array = np.array(request.input_data)

        # Actual model inference (example)
        output = input_array * 2.0

        latency = int((time.time() - start) * 1000)
        return inference_pb2.PredictResponse(
            output_data=output.tolist(),
            confidence=0.95,
            latency_ms=latency
        )

    def StreamPredict(self, request, context):
        # Streaming response for LLM token generation
        tokens = ["Hello", " world", "!", " gRPC", " streaming", "."]
        for token in tokens:
            yield inference_pb2.PredictResponse(
                output_data=[float(ord(c)) for c in token],
                confidence=0.9,
                latency_ms=10
            )

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    inference_pb2_grpc.add_InferenceServiceServicer_to_server(
        InferenceServicer(), server
    )
    server.add_insecure_port('[::]:50051')
    server.start()
    print("gRPC server started on port 50051")
    server.wait_for_termination()

Real-Time LLM Streaming with SSE

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
import json

app = FastAPI()

async def llm_token_generator(prompt: str):
    """Stream LLM tokens as SSE events"""
    # In production, use an LLM library here
    tokens = prompt.split() + ["[DONE]"]
    for i, token in enumerate(tokens):
        data = json.dumps({"token": token, "index": i})
        yield f"data: {data}\n\n"
        await asyncio.sleep(0.05)  # simulate token generation delay
    yield "data: [DONE]\n\n"

@app.get("/stream")
async def stream_llm(prompt: str = "Hello world"):
    return StreamingResponse(
        llm_token_generator(prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # Disable Nginx buffering
        }
    )

7. Packet Analysis in Practice

tcpdump Commands

# Capture HTTP traffic on a specific interface
sudo tcpdump -i eth0 -w capture.pcap port 80 or port 443

# Filter TCP handshakes only (SYN packets)
sudo tcpdump -i any 'tcp[tcpflags] & tcp-syn != 0'

# Monitor traffic with a specific host
sudo tcpdump -i eth0 host 10.0.0.1 -n

# Monitor DNS queries
sudo tcpdump -i any udp port 53 -v

# Capture gRPC (HTTP/2) traffic
sudo tcpdump -i eth0 port 50051 -w grpc_trace.pcap

Network Debugging with curl

# Check TLS certificate info
curl -vI https://api.example.com 2>&1 | grep -A 20 "SSL connection"

# Verify HTTP/2 is being used
curl --http2 -I https://api.example.com

# Measure response time breakdown
curl -o /dev/null -s -w \
  "DNS: %{time_namelookup}s\nTCP: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" \
  https://api.example.com

# Test gRPC endpoint (grpcurl)
grpcurl -plaintext localhost:50051 list
grpcurl -plaintext -d '{"model_name": "bert", "input_data": [1.0, 2.0]}' \
  localhost:50051 inference.InferenceService/Predict

Key Network Performance Metrics

RTT (Round-Trip Time): packet round-trip time, measured with ping
Throughput: data transferred per unit time, measured with iperf3
Packet Loss: loss rate, critical in unstable UDP environments
Jitter: RTT variation, important for real-time streaming

# Measure network bandwidth
iperf3 -s                           # server
iperf3 -c server_ip -t 30 -P 4     # client (4 parallel streams)

Quiz

Q1. Explain the TCP 3-way handshake and 4-way termination process.

Answer:

3-way handshake (connection setup):

Client → Server: SYN (seq=x)
Server → Client: SYN-ACK (seq=y, ack=x+1)
Client → Server: ACK (ack=y+1)

4-way termination (connection teardown):

Client → Server: FIN
Server → Client: ACK
Server → Client: FIN
Client → Server: ACK (enters TIME_WAIT, then fully closes)

Explanation: Termination requires 4 steps because after receiving a FIN, the server may still have data to send. The client enters TIME_WAIT for 2MSL (Maximum Segment Lifetime) to handle any delayed packets from the old connection.

Q2. How does HTTP/2 multiplexing solve the HOL blocking problem of HTTP/1.1?

Answer: HTTP/2 sends multiple streams concurrently over a single TCP connection. Each stream has an independent ID, so a delay in one stream does not block other streams.

Explanation: HTTP/1.1 supports pipelining but responses must arrive in request order, so a slow response blocks all subsequent ones (HOL blocking). HTTP/2 solves this at the application layer by interleaving frames from different streams. However, TCP-level HOL blocking (a dropped packet stalls the entire connection) is only resolved in HTTP/3, which uses QUIC's independently delivered streams over UDP.

Q3. How does mTLS differ from standard TLS, and what role does it play in a service mesh?

Answer: Standard TLS only authenticates the server certificate. mTLS (Mutual TLS) requires both client and server to present certificates, enabling bidirectional authentication.

Explanation: In a service mesh like Istio, mTLS is handled transparently by sidecar proxies (Envoy). Each service is issued an X.509 certificate with a unique SPIFFE identity. All inter-service traffic is mutually authenticated and encrypted without any application code changes. This defends against eavesdropping, spoofing, and man-in-the-middle attacks even inside the private network.

Q4. Why is gRPC better suited for ML model serving than REST APIs?

Answer: gRPC offers binary serialization with Protocol Buffers, HTTP/2 multiplexing, and bidirectional streaming — all of which are highly advantageous for ML serving workloads.

Explanation: Large tensor payloads serialize 3-5x more compactly with Protocol Buffers than JSON. Server-side streaming enables real-time delivery of LLM-generated tokens to the client. The .proto schema enforces a strict API contract, ensuring type safety across ML pipeline integrations. Persistent HTTP/2 connections also reduce the overhead of establishing new connections per request.

Q5. What are the types of CDN cache invalidation strategies and their trade-offs?

Answer: The main strategies are TTL-based expiration, versioned URLs, API-based instant invalidation, and surrogate key (tag) based invalidation.

Explanation:

TTL-based: simple to implement, but stale content may be served until expiration
Versioned URL (e.g., style.v2.css): preserves cache hit rate while allowing instant refresh; URL management adds complexity
Instant invalidation API: fast propagation but incurs CDN costs; propagation delay of tens of seconds still applies
Surrogate keys: invalidate groups of related content at once (e.g., Cloudflare Cache Tags)

In practice, static assets (JS/CSS) use versioned URLs while API responses use short TTLs combined with instant invalidation as needed.

Table of Contents