Ingress Controller High Availability and Scaling

Introduction
HA Topology: DaemonSet vs Deployment + HPA
HA per Exposure Method
Graceful Drain and Zero-Downtime Deploys
Minimizing Reload Impact
Leader Election
Multi-AZ Spread
Large-Scale Ingress Object Performance
Resource Sizing and Load Testing
Failure Scenarios and Responses
HA in the Gateway API Era
Conclusion
References

Introduction

The Ingress controller is the single point of entry to your cluster. If the entry point dies, every service behind it — no matter how healthy — looks like a total outage from the outside. That is why controller high availability (HA) is not a "nice to have" but a precondition of operations.

HA is not merely "we bumped replicas to 2." Connections in flight must not be cut when the controller reloads, traffic must drain gracefully when a Pod terminates, and traffic must keep flowing even if an entire node or AZ disappears. On top of that, it must scale automatically as traffic grows, and the controller must hold up even as ingress objects swell into the thousands.

Using ingress-nginx as the reference, this article covers HA topology choices, availability per exposure method, graceful drain and zero-downtime deploys, minimizing reload impact, leader election, multi-AZ spread, performance in large-scale ingress environments, resource sizing, load-testing methods, and failure scenarios — all from an operations perspective. As of 2026 the Ingress API is frozen and Gateway API is the successor standard, so we close with how HA design carries over to Gateway API.

HA Topology: DaemonSet vs Deployment + HPA

How you place the controller is the first fork in the road. There are two mainstream patterns.

Aspect	Deployment + HPA	DaemonSet
Placement	N replicas on arbitrary nodes	One per (or labeled) node
Scaling	HPA auto-adjusts replicas	Tied to node count
Fits exposure	cloud LoadBalancer Service	hostNetwork/hostPort
Suited for	Most cloud environments	Bare metal, edge, node=capacity
Resource efficiency	Elastic to load	Fixed cost per node

The default choice in cloud environments is Deployment + HPA. A cloud LoadBalancer receives traffic and distributes it to nodes, while controller replicas scale up and down with load. On bare metal or edge, where the node itself is the entry point, DaemonSet + hostNetwork is common — the controller directly occupies 80/443 on each node, removing one hop.

A baseline Deployment-based HA configuration.

controller:
  kind: Deployment
  replicaCount: 3
  minAvailable: 2
  podDisruptionBudget:
    enabled: true
    minAvailable: 2
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: ingress-nginx

Three things matter here. (1) A PodDisruptionBudget limits how many Pods can go down at once, guaranteeing minimum availability even during node maintenance. (2) podAntiAffinity spreads replicas across different nodes. (3) topologySpreadConstraints distributes them evenly across AZs as well.

HA per Exposure Method

HA characteristics differ depending on how the controller receives external traffic.

Method A: cloud LoadBalancer Service
[internet] → [cloud LB] → [NodePort] → [controller Pod]
  - LB auto-excludes dead nodes via health checks
  - externalTrafficPolicy: Local preserves client IP + removes an extra hop

Method B: hostNetwork (DaemonSet)
[internet] → [node IP:80/443] → [controller (host network)]
  - removes one hop, lowest latency
  - one controller per node, watch for port conflicts

Method C: NodePort + external LB (self-managed)
[internet] → [external LB] → [NodeIP:NodePort] → [controller]
  - combine with MetalLB etc. on bare metal

externalTrafficPolicy in Method A is an important choice. Cluster (default) has all nodes receive traffic and redistribute internally, so distribution is even but the client IP is masked and an extra hop is added. Local sends only to Pods on that node, preserving the IP and removing a hop, but nodes without a Pod must be dropped from LB health checks, so distribution depends on the per-node Pod count. From an HA standpoint, with Local it is important to keep the per-node distribution of controller Pods balanced.

Graceful Drain and Zero-Downtime Deploys

When a controller Pod terminates, if in-flight requests are cut, users see 502s. The key to zero downtime is putting enough grace between "removal from endpoints" and "actual termination."

The termination sequence should flow like this.

1. Just before SIGTERM reaches the Pod, kube starts removing it from Endpoints
2. preStop hook: sleep to give the LB/kube-proxy time to propagate the removal
3. nginx begins graceful shutdown (reject new connections, finish existing ones)
4. In-flight requests complete within terminationGracePeriodSeconds
5. Process exits

A configuration example in ingress-nginx.

controller:
  terminationGracePeriodSeconds: 300
  lifecycle:
    preStop:
      exec:
        command:
          - /bin/sh
          - -c
          - sleep 30; /wait-shutdown

The preStop sleep is necessary because endpoint removal takes time to propagate to every LB and kube-proxy. Without that grace, terminating immediately leaves paths still sending traffic to this Pod, cutting connections. terminationGracePeriodSeconds must be generous enough for the longest legitimate request (for example a large upload or streaming) to finish.

The deploy strategy uses RollingUpdate with maxSurge to keep enough replicas alive at all times.

controller:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

maxUnavailable: 0 prevents bringing down existing Pods before new ones are Ready, avoiding a capacity dip during deploys.

Minimizing Reload Impact

A quirk of ingress-nginx is that it reloads the nginx configuration whenever an Ingress resource changes. Reloads are graceful by default (old workers finish their connections while new workers take new connections), but frequent reloads can cause memory-usage fluctuations, drops of long-lived connections (WebSocket, gRPC streams), and momentary latency spikes.

Ways to reduce reload shock.

# ConfigMap
data:
  # dynamic endpoint changes are applied via Lua without a reload (enabled by default)
  # increase worker shutdown grace to better preserve in-flight requests
  worker-shutdown-timeout: "240s"
  # keepalive tuning to ease the load right after a reload
  upstream-keepalive-connections: "320"
  upstream-keepalive-timeout: "60"

The key insight is that ingress-nginx handles endpoint (Pod IP) changes without a reload. Common events like Pods scaling or restarting are absorbed by Lua-based dynamic configuration and do not trigger a reload. What triggers a reload is mainly Ingress spec, annotation, or certificate changes. So to reduce reload frequency, it is effective to audit frequent Ingress changes (for example, automation that toggles an annotation on every deploy).

In operation, monitor reload frequency via metrics (the changes of config_last_reload_success_timestamp), and if you carry a lot of long-lived traffic like WebSocket/gRPC, set worker-shutdown-timeout generously to minimize drops on reload.

Leader Election

All replicas of ingress-nginx handle traffic, but some work — particularly updating the Ingress status.loadBalancer field — must be done by a single instance. Leader election handles this.

The leader is elected via a Lease object, and only the leader updates the status of Ingress objects. The data plane (actual traffic handling) is served equally by all replicas, so even if the leader dies, traffic is not interrupted — a new leader is simply elected. What operators should know: for leader election to work, RBAC must grant permissions on the Lease resource, and if election-related warning logs repeat, check permissions or the API server connection.

[replica A] ──┐
[replica B] ──┼─→ all handle the data plane (traffic)
[replica C] ──┘
     │
     └─ via Lease, only one is leader → responsible for updating Ingress status
        if the leader dies → automatic re-election, no traffic impact

Multi-AZ Spread

If controllers are concentrated in a single AZ, an AZ failure takes everything down. The topologySpreadConstraints seen earlier are the key tool for AZ spread.

controller:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx

A practical combination is strict spread at the zone level with DoNotSchedule, and best-effort spread at the hostname level with ScheduleAnyway so that scheduling is never blocked. Cloud LBs are usually multi-AZ aware and send traffic only to live AZs, so if the controller is evenly present in every AZ, service continues on the rest even when one AZ dies. The principle for replica count is to size so that "even if one whole AZ drops out, the remaining AZs can carry peak traffic" (N+1 or more).

Large-Scale Ingress Object Performance

The controller behaves differently with dozens of ingress objects than with thousands. As objects grow, the following become problems.

The generated nginx config file becomes huge, lengthening reload time
Controller memory/CPU usage rises
API server watch load increases
The cost of a single reload grows, so change frequency directly becomes a burden

Mitigation strategies follow. First, shard the controllers by namespace or domain group. By running multiple IngressClasses and controller instances, each watching only a subset of Ingresses, config size and reload cost are distributed.

# shard A controller
controller:
  ingressClassResource:
    name: nginx-shard-a
    controllerValue: "k8s.io/ingress-nginx-shard-a"
  watchNamespaces: "team-a,team-b"

Second, reduce unnecessary annotation changes to lower reload frequency. Third, monitor reload duration via metrics and increase sharding when it crosses a threshold. In large-scale environments, several small shards are often operationally more stable than a single huge controller.

Resource Sizing and Load Testing

Without proper requests/limits, HA crumbles. If the controller is throttled by hitting its CPU limit, you get a vicious cycle of health-check failure, restart, and traffic oscillation.

Sizing starts with measurement. Use load testing to derive "CPU cost per request" and "memory per concurrent connection," then add headroom on top of peak traffic.

# simple load generation (vegeta)
echo "GET https://app.example.com/" | \
  vegeta attack -rate=2000 -duration=300s | \
  vegeta report

# or stepwise load increase with k6
k6 run --vus 500 --duration 5m loadtest.js

Metrics to watch during load testing: controller CPU/memory, active connections, p99 latency, 5xx ratio, and recovery time when you deliberately kill a Pod (chaos). In particular, trigger reloads during load (for example, repeatedly modify an Ingress) and measure whether long-lived connections drop and how large the latency spikes get; only then do you understand the real production behavior.

Sizing guidelines vary by environment, but avoid setting requests so low that the controller cannot get resources when the node is packed, and leave enough room in limits to absorb the momentary memory increase during a reload.

Failure Scenarios and Responses

HA design is about drawing out "what can die" in advance.

Failure	Impact	How HA design prevents it
One controller Pod crashes	Partial capacity loss	N replicas + PDB keep remaining capacity
One node down	Loss of that node's Pods	podAntiAffinity spreads across nodes
One AZ down	Loss of that entire AZ	topologySpread + N+1 sizing
Traffic surge	Saturation, rising latency	HPA auto-scale + headroom
Reload storm	long-lived drops, latency	worker-shutdown-timeout, change-frequency control
Ingress object explosion	Reload slowdown	Sharding
Certificate expiry	Total TLS failure	cert-manager auto-renewal + expiry alerts

For each scenario, pre-define "detection (metrics/alerts), automatic mitigation (HPA/reschedule), and when manual intervention is needed," and a real incident flows without panic. AZ failure and traffic surges often come together (when one AZ drops, load concentrates on the rest), so verify that N+1 capacity and HPA work in concert.

HA in the Gateway API Era

As of 2026 the Ingress API is frozen and Gateway API is the successor standard. Most HA principles carry over, but a few things improve. Gateway API implementations (Envoy-based Contour/Istio, Cilium, NGINX Gateway Fabric, and others) separate the control plane and data plane more clearly, so config changes often do not directly translate into a data-plane reload. Envoy-based implementations push configuration dynamically via xDS, reducing the ingress-nginx-style "file reload" shock.

The three-tier model of Gateway API (GatewayClass/Gateway/HTTPRoute) also has clear separation of responsibilities, letting you naturally split the problem of too many Routes on one Gateway across multiple Gateways. This expresses the "sharding" seen earlier more elegantly at the API level. If you build PDBs, antiAffinity, topologySpread, and graceful drain properly on ingress-nginx today, these operational patterns remain almost entirely valid when you migrate to Gateway API.

Conclusion

The heart of Ingress controller HA is "not turning a single entry point into a single point of failure." Protect capacity with replica multiplicity and PDBs, spread across nodes and AZs with antiAffinity and topologySpread, never cut traffic during changes with graceful drain and zero-downtime deploys, adapt to load with HPA, and handle scale with sharding.

None of this is completed in one pass. Validate your assumptions with load testing and deliberate fault injection (chaos), observe real behavior via metrics, and refine gradually — that is the essence of HA operations. Only when the entry point does not waver can every service behind it run with confidence.

References

Kubernetes Ingress concepts: https://kubernetes.io/docs/concepts/services-networking/ingress/
ingress-nginx deploy/HA guide: https://kubernetes.github.io/ingress-nginx/deploy/
ingress-nginx ConfigMap options: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/
Kubernetes PodDisruptionBudget: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
Kubernetes topologySpreadConstraints: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
Kubernetes HorizontalPodAutoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Gateway API: https://gateway-api.sigs.k8s.io/
Contour (Envoy-based): https://projectcontour.io/docs/
Traefik high availability documentation: https://doc.traefik.io/traefik/
cert-manager: https://cert-manager.io/docs/