ingress-nginx Production Tuning — Performance, Timeouts, Connections, keepalive

Introduction
Worker and Connection Tuning
Upstream keepalive
The Three Timeouts
proxy buffering
gzip and brotli Compression
Large Uploads
Rate Limiting Annotations
HPA and Controller Scaling
Graceful Reload and Drain
Metric-Based Capacity Planning
Session Affinity and Load-Balancing Algorithm
Connection Handling and Client-Side keepalive
Load Scenarios and Pitfalls
WebSocket and gRPC Long-Lived Connections
Precedence of Global vs Ingress Settings
Warm-Up and Scale-Out Stabilization
Relationship with Gateway API
Timeout Budget and End-to-End Consistency
A Practical Tuning Workflow
Summary of Frequently Used ConfigMap Options
Key Takeaways
Conclusion
References

Introduction

Installing ingress-nginx and bringing up an ingress or two is not hard. The real difficulty begins as traffic grows. An ingress that behaves fine normally starts emitting intermittent 502s at peak, large uploads get blocked with 413, and on days when the backend is slow, 504s pour in. And every deploy triggers a reload that can spike latency.

This article gathers the tuning know-how to run ingress-nginx reliably in production. From workers and connections to keepalive, timeouts, buffering, compression, rate limiting, scaling, graceful reload, and a 502/504 debugging flow — all in a form you can apply directly. Note that the right value for every setting depends on your traffic profile, so treat the numbers as starting points and validate them with metrics.

Worker and Connection Tuning

nginx's processing capacity is ultimately the product of the number of worker processes and the connections each worker can handle. In ingress-nginx you control these via the ConfigMap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  worker-processes: "auto"
  max-worker-connections: "16384"
  max-worker-open-files: "65536"

Setting worker-processes to auto matches the CPUs allocated to the container. If you have a clear CPU limit, align it to that.
max-worker-connections is how many connections a single worker can open simultaneously. It counts both client and upstream sides, so set it generously, and raise it together with the file descriptor limit (max-worker-open-files).

Too many workers relative to CPU adds context-switching overhead; too few bottlenecks throughput. Generally auto, matching CPU cores 1:1, is a safe choice.

Upstream keepalive

One of the highest-impact items in performance tuning is upstream keepalive. With the defaults, each request to the backend may open a new TCP connection, which mass-produces TLS handshakes and TIME_WAIT sockets.

data:
  upstream-keepalive-connections: "320"
  upstream-keepalive-requests: "10000"
  upstream-keepalive-timeout: "60"

upstream-keepalive-connections is the idle connection pool each worker keeps to the backend. Enabling it reduces both latency and CPU through connection reuse. But if the timeout is misaligned with the backend (especially app servers that close keep-alive quickly), it can cause 502s from races, so it is safer to set it shorter than the backend's keepalive.

The Three Timeouts

Timeouts are what separate 504 from 502. Let us clearly distinguish the three most commonly tuned.

Item	Meaning	Annotation
connect timeout	Wait to establish backend TCP connection	proxy-connect-timeout
send timeout	Wait while sending the request to the backend	proxy-send-timeout
read timeout	Wait while reading the backend response	proxy-read-timeout

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"

A long connect timeout slows fault detection when the backend is down, so keep it short (a few seconds). Set the read timeout high enough to allow slow backend work (report generation, etc.), but stretching it indefinitely exhausts worker connections. If 504s are frequent, compare the read timeout against actual backend processing time.

proxy buffering

Buffering is the mechanism that protects the backend from slow clients. When nginx buffers the backend response, the backend can finish responding quickly and move on to the next request.

data:
  proxy-buffering: "on"
  proxy-buffer-size: "8k"
  proxy-buffers-number: "4"

However, for streaming (SSE, long downloads, gRPC streams) buffering actually delays the response. Turn buffering off per-ingress on such routes.

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-buffering: "off"

gzip and brotli Compression

Text responses (JSON, HTML, JS) can greatly improve bandwidth and perceived speed through compression.

data:
  use-gzip: "true"
  gzip-level: "5"
  gzip-types: "application/json application/javascript text/css text/plain"

The compression level is a CPU vs ratio trade-off; usually 4 to 6 is appropriate. brotli requires the module to be included in the build and offers a better ratio than gzip for static assets. Exclude already-compressed content (images, video) from compression.

Large Uploads

By default, bodies larger than 1MB are blocked with 413. On upload routes you must adjust body size together with buffers and timeouts.

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "200m"
    nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"

Setting proxy-request-buffering off makes nginx stream to the backend before receiving the full request body, reducing disk/memory use for large uploads. The backend must be able to handle a streamed body, though.

Rate Limiting Annotations

ingress-nginx provides per-ingress rate limiting via annotations.

metadata:
  annotations:
    nginx.ingress.kubernetes.io/limit-rps: "20"
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "3"
    nginx.ingress.kubernetes.io/limit-connections: "10"

limit-rps is requests per second per client IP; limit-connections is simultaneous connections. The burst multiplier allows some momentary spikes. Note that this counts per controller pod, so precise global limits are hard with multiple controllers. If you need strict global rate limiting, it is better to add a separate API Gateway layer.

HPA and Controller Scaling

The ingress-nginx controller itself must scale with traffic. It is usually deployed as a Deployment or DaemonSet; if a Deployment, you can attach an HPA.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ingress-nginx-controller
  minReplicas: 3
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

The controller must not be a single point of failure, so set minReplicas to 2 or more (ideally 3) and configure topologySpreadConstraints or anti-affinity for node distribution. Also account for the temporary reload cost while a new pod warms up during scale-out.

Graceful Reload and Drain

Reloads are unavoidable during deploys and config changes. The key is to ensure that reloads and pod termination do not cut in-flight connections.

data:
  worker-shutdown-timeout: "240s"

# pod spec
terminationGracePeriodSeconds: 300
lifecycle:
  preStop:
    exec:
      command: ["/wait-shutdown"]

worker-shutdown-timeout is how long an old worker waits after a reload to finish in-progress requests. A preStop hook and an ample terminationGracePeriod let the pod drain in-flight requests on termination. Using the readiness probe to remove a terminating pod from the load balancer pool first is also important.

Metric-Based Capacity Planning

Tuning should be done with metrics, not guesses. ingress-nginx exposes Prometheus metrics.

data:
  enable-metrics: "true"

Key metrics to observe are below.

Metric	Meaning	Use
nginx_ingress_controller_requests	Requests (by status code)	Traffic / error rate
request_duration_seconds	Request latency distribution	Latency p50/p95/p99
nginx_ingress_controller_nginx_process_connections	Active connections	Connection limit check
nginx_ingress_controller_config_last_reload_successful	Reload success flag	Detect reload failure

Watch p99 latency and active connection trends to tune worker-connections and the keepalive pool. Overlaying error-rate spikes with reload timestamps lets you separate reload-caused issues from backend-caused ones.

Session Affinity and Load-Balancing Algorithm

For stateful applications or cache efficiency, you sometimes need to send the same client to the same backend. ingress-nginx provides cookie-based session affinity.

metadata:
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/affinity-mode: "persistent"
    nginx.ingress.kubernetes.io/session-cookie-name: "route"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"

However, session affinity can break load-balancing balance. If traffic concentrates on a particular backend, scaling loses meaning, so apply it only on routes that truly need it and design applications to be stateless where possible.

The load-balancing algorithm is also adjustable via the global ConfigMap.

data:
  load-balance: "ewma"

The default is round_robin, but ewma (Exponentially Weighted Moving Average) uses each backend's recent response time as a weight and sends traffic to faster backends. It often improves p99 latency in environments with high variance in backend response times.

Connection Handling and Client-Side keepalive

Where we have covered upstream (backend) keepalive so far, client-side keepalive matters too. Keeping client connections appropriately reduces TLS handshake cost.

data:
  keep-alive: "75"
  keep-alive-requests: "1000"

keep-alive is how long (seconds) to keep a client connection; keep-alive-requests is the max requests per connection. If a CDN or load balancer sits in front, align with its idle timeout. If the front LB's idle timeout is longer than the ingress keep-alive, the LB may try to reuse an already-closed connection and cause a 502, so it is safer to keep the ingress side longer.

Load Scenarios and Pitfalls

The most common pitfall is a reload storm. If ConfigMaps, Secrets, or ingresses change frequently, nginx keeps reloading, and each reload spins up new workers, making memory and connections lurch. It is especially risky when cert-manager renews certificates often or automation repeatedly edits ingresses. Batch your changes and lower the frequency.

[ 502 vs 504 cause flowchart ]

request failed
   │
   ├─ 504 Gateway Timeout?
   │     │
   │     ├─ backend response slow ──▶ check proxy-read-timeout + backend latency
   │     └─ backend connect delay ──▶ proxy-connect-timeout + backend health
   │
   └─ 502 Bad Gateway?
         │
         ├─ upstream keepalive race ──▶ keepalive-timeout shorter than backend
         ├─ backend closes early ──▶ backend keep-alive config
         ├─ backend OOM/crash ──▶ check pod logs/restarts
         └─ transient during reload ──▶ tune worker-shutdown-timeout/grace

A 504 is almost always a "slowness" problem (timeout or slow backend), while a 502 is a "breakage" problem (a connection terminated improperly). Distinguishing these two first lets you narrow the cause far faster.

WebSocket and gRPC Long-Lived Connections

WebSocket and gRPC streaming need different tuning from ordinary HTTP requests. Because the connection stays open for a long time, short read/send timeouts cut healthy connections.

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"

For WebSocket, ingress-nginx handles the Upgrade header automatically, so it needs little extra config, but you must raise the timeouts enough. gRPC requires backend-protocol set to GRPC for the HTTP/2-based proxy to work. Services with many long-lived connections can reach the worker-connections limit quickly, so watch the connection metrics more carefully than usual.

Precedence of Global vs Ingress Settings

If you set the same option in both the ConfigMap (global) and an annotation (ingress), which wins? Generally the ingress annotation overrides the global value for that ingress only. This enables a pattern of keeping conservative global defaults and making exceptions for specific routes.

Layer	Scope	Precedence
ConfigMap	Whole controller	Low (default)
Annotation	Individual ingress	High (overrides)

For example, set proxy-body-size to 10m in the global ConfigMap and 200m via annotation only on the upload-route ingress. This protects most routes with a conservative limit while relaxing only the routes that need it. Note, however, that security-related global settings such as the snippet-disable policy or risk-level are designed not to be defeated by annotations.

Warm-Up and Scale-Out Stabilization

Right after scale-out, a new controller pod has an empty cache and an empty upstream keepalive pool, so latency is temporarily high. If the readiness probe passes too early, traffic flows to an unready pod and 502s can spike.

# pod spec
readinessProbe:
  httpGet:
    path: /healthz
    port: 10254
  initialDelaySeconds: 10
  periodSeconds: 5
  successThreshold: 1
  failureThreshold: 3

Set initialDelaySeconds appropriately so a new pod takes traffic only after warming up enough, and avoid setting the HPA scale-out too aggressively for stability. If traffic spikes are predictable (events, sales), consider prewarming via pre-scaling.

Relationship with Gateway API

From a performance-tuning angle too, keep the 2026 trend in mind. The worker/connection/keepalive/timeout concepts here are intrinsic to the nginx data plane, so they apply largely unchanged to Gateway API implementations (especially when nginx is the data plane, as in NGINX Gateway Fabric). What changes is the expression — not annotations, but standard CRDs or policy resources. Since the Ingress API is frozen, designing new tuning policy in Gateway API's policy model pays off long term.

Timeout Budget and End-to-End Consistency

Timeouts must be consistent across the whole path, not within a single component. In the chain client to LB to ingress to backend to DB, if each stage's timeout is misaligned, one side has already given up while another keeps waiting — pure waste.

[ end-to-end timeout budget example ]

client          LB           ingress        backend        DB
  30s    ──▶   28s   ──▶    25s    ──▶     20s   ──▶   10s
 (longest)                                         (shortest)

The principle is: longer at the outside, shorter at the inside. If the inner (DB) timeout is longer than the outer (ingress), the backend keeps waiting for the DB response even after the ingress cut it with a 504, holding connections and resources. This accelerates resource exhaustion under load. Conversely, keeping the inside short makes slow work fail fast and lets retry policies act cleanly.

The ingress proxy-read-timeout should be slightly longer than backend processing time within this budget but shorter than the LB's idle timeout — the key is making "who cuts first" predictable.

For example, reflecting the budget above into an ingress looks like this.

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "3"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "25"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "25"

Explicitly documenting a staged budget like this and reflecting it into each component makes it far easier to estimate "where it was cut" during an incident. Think of a timeout not as a single number but as a contract across the whole path.

A Practical Tuning Workflow

Applying everything covered so far all at once leaves you unable to tell which change had which effect. Apply changes gradually with the following workflow.

[ tuning workflow ]

1. measure baseline ── record current p50/p95/p99, error rate, connections, reload freq
2. form hypothesis ── be explicit about which metric improves via which change
3. single change ──── change only one option at a time (staging first)
4. load test ─────── apply load that mimics real traffic patterns
5. compare/judge ──── verify improvement/regression vs baseline
6. roll out/back ──── if better, production canary; if regression, roll back immediately

The most common mistake is changing several options at once. Adjusting keepalive, timeouts, and buffers together means that even if p99 improves you cannot pinpoint the cause, and if a regression appears later it is hard to trace where it came from. One change at a time, validated with metrics — that discipline is ultimately the fastest path.

For load-testing tools, k6, vegeta, and wrk are commonly used. Beyond raw RPS, you must mimic the traffic shape of the real service — concurrent connections, body sizes, the proportion of long-lived connections — to get meaningful results.

Summary of Frequently Used ConfigMap Options

Finally, here is a single table of global options frequently tuned in production.

Option	Role	Starting point
worker-processes	Number of workers	auto
max-worker-connections	Connections per worker	16384
upstream-keepalive-connections	Backend idle pool	320
keep-alive	Client keepalive seconds	75
proxy-body-size	Body size limit	per use case
use-gzip	gzip compression	true
load-balance	Balancing algorithm	round_robin or ewma
worker-shutdown-timeout	Graceful shutdown wait	240s

These are starting points, not answers. Always validate with each environment's metrics.

Key Takeaways

Here are the principles to remember in tuning.

Do not guess; measure a baseline, then change one thing at a time and validate with metrics.
Cap capacity with worker-processes (auto) and worker-connections.
Upstream keepalive has high impact, but keep it shorter than the backend keepalive to avoid 502 races.
Keep connect short; set read/send to match the work. But keep them shorter than the LB so "who cuts" is predictable.
Turn buffering off on streaming/upload routes, and compress text responses.
Reload storms are the most common pitfall. Batch changes and lower frequency via GitOps.
504 means "slow", 502 means "broken". This distinction drives your debugging speed.

These principles are intrinsic to the nginx data plane, so they carry over directly to Gateway API.

Conclusion

The essence of ingress-nginx tuning is to understand how the data plane works and validate with metrics rather than guessing. Set capacity with workers and connections, raise efficiency with upstream keepalive, detect faults fast with the three timeouts, save resources with buffering and compression, and protect zero-downtime with graceful reload. Just learning the flow to separate 502 from 504 dramatically changes your incident-response speed. All of this intuition carries over as an asset into the Gateway API era.

References

ingress-nginx ConfigMap options: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/
ingress-nginx annotations: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/
ingress-nginx metrics/monitoring: https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/
Kubernetes HPA: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Kubernetes Ingress concept: https://kubernetes.io/docs/concepts/services-networking/ingress/
Gateway API: https://gateway-api.sigs.k8s.io/
NGINX Gateway Fabric: https://docs.nginx.com/nginx-gateway-fabric/
HAProxy Kubernetes Ingress: https://www.haproxy.com/documentation/kubernetes-ingress/
Traefik docs: https://doc.traefik.io/traefik/
Envoy/Contour: https://projectcontour.io/