- Published on
ingress-nginx Production Tuning — Performance, Timeouts, Connections, keepalive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Worker and Connection Tuning
- Upstream keepalive
- The Three Timeouts
- proxy buffering
- gzip and brotli Compression
- Large Uploads
- Rate Limiting Annotations
- HPA and Controller Scaling
- Graceful Reload and Drain
- Metric-Based Capacity Planning
- Session Affinity and Load-Balancing Algorithm
- Connection Handling and Client-Side keepalive
- Load Scenarios and Pitfalls
- WebSocket and gRPC Long-Lived Connections
- Precedence of Global vs Ingress Settings
- Warm-Up and Scale-Out Stabilization
- Relationship with Gateway API
- Timeout Budget and End-to-End Consistency
- A Practical Tuning Workflow
- Summary of Frequently Used ConfigMap Options
- Key Takeaways
- Conclusion
- References
Introduction
Installing ingress-nginx and bringing up an ingress or two is not hard. The real difficulty begins as traffic grows. An ingress that behaves fine normally starts emitting intermittent 502s at peak, large uploads get blocked with 413, and on days when the backend is slow, 504s pour in. And every deploy triggers a reload that can spike latency.
This article gathers the tuning know-how to run ingress-nginx reliably in production. From workers and connections to keepalive, timeouts, buffering, compression, rate limiting, scaling, graceful reload, and a 502/504 debugging flow — all in a form you can apply directly. Note that the right value for every setting depends on your traffic profile, so treat the numbers as starting points and validate them with metrics.
Worker and Connection Tuning
nginx's processing capacity is ultimately the product of the number of worker processes and the connections each worker can handle. In ingress-nginx you control these via the ConfigMap.
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
data:
worker-processes: "auto"
max-worker-connections: "16384"
max-worker-open-files: "65536"
- Setting worker-processes to auto matches the CPUs allocated to the container. If you have a clear CPU limit, align it to that.
- max-worker-connections is how many connections a single worker can open simultaneously. It counts both client and upstream sides, so set it generously, and raise it together with the file descriptor limit (max-worker-open-files).
Too many workers relative to CPU adds context-switching overhead; too few bottlenecks throughput. Generally auto, matching CPU cores 1:1, is a safe choice.
Upstream keepalive
One of the highest-impact items in performance tuning is upstream keepalive. With the defaults, each request to the backend may open a new TCP connection, which mass-produces TLS handshakes and TIME_WAIT sockets.
data:
upstream-keepalive-connections: "320"
upstream-keepalive-requests: "10000"
upstream-keepalive-timeout: "60"
upstream-keepalive-connections is the idle connection pool each worker keeps to the backend. Enabling it reduces both latency and CPU through connection reuse. But if the timeout is misaligned with the backend (especially app servers that close keep-alive quickly), it can cause 502s from races, so it is safer to set it shorter than the backend's keepalive.
The Three Timeouts
Timeouts are what separate 504 from 502. Let us clearly distinguish the three most commonly tuned.
| Item | Meaning | Annotation |
|---|---|---|
| connect timeout | Wait to establish backend TCP connection | proxy-connect-timeout |
| send timeout | Wait while sending the request to the backend | proxy-send-timeout |
| read timeout | Wait while reading the backend response | proxy-read-timeout |
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
A long connect timeout slows fault detection when the backend is down, so keep it short (a few seconds). Set the read timeout high enough to allow slow backend work (report generation, etc.), but stretching it indefinitely exhausts worker connections. If 504s are frequent, compare the read timeout against actual backend processing time.
proxy buffering
Buffering is the mechanism that protects the backend from slow clients. When nginx buffers the backend response, the backend can finish responding quickly and move on to the next request.
data:
proxy-buffering: "on"
proxy-buffer-size: "8k"
proxy-buffers-number: "4"
However, for streaming (SSE, long downloads, gRPC streams) buffering actually delays the response. Turn buffering off per-ingress on such routes.
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-buffering: "off"
gzip and brotli Compression
Text responses (JSON, HTML, JS) can greatly improve bandwidth and perceived speed through compression.
data:
use-gzip: "true"
gzip-level: "5"
gzip-types: "application/json application/javascript text/css text/plain"
The compression level is a CPU vs ratio trade-off; usually 4 to 6 is appropriate. brotli requires the module to be included in the build and offers a better ratio than gzip for static assets. Exclude already-compressed content (images, video) from compression.
Large Uploads
By default, bodies larger than 1MB are blocked with 413. On upload routes you must adjust body size together with buffers and timeouts.
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "200m"
nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
Setting proxy-request-buffering off makes nginx stream to the backend before receiving the full request body, reducing disk/memory use for large uploads. The backend must be able to handle a streamed body, though.
Rate Limiting Annotations
ingress-nginx provides per-ingress rate limiting via annotations.
metadata:
annotations:
nginx.ingress.kubernetes.io/limit-rps: "20"
nginx.ingress.kubernetes.io/limit-burst-multiplier: "3"
nginx.ingress.kubernetes.io/limit-connections: "10"
limit-rps is requests per second per client IP; limit-connections is simultaneous connections. The burst multiplier allows some momentary spikes. Note that this counts per controller pod, so precise global limits are hard with multiple controllers. If you need strict global rate limiting, it is better to add a separate API Gateway layer.
HPA and Controller Scaling
The ingress-nginx controller itself must scale with traffic. It is usually deployed as a Deployment or DaemonSet; if a Deployment, you can attach an HPA.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ingress-nginx-controller
minReplicas: 3
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
The controller must not be a single point of failure, so set minReplicas to 2 or more (ideally 3) and configure topologySpreadConstraints or anti-affinity for node distribution. Also account for the temporary reload cost while a new pod warms up during scale-out.
Graceful Reload and Drain
Reloads are unavoidable during deploys and config changes. The key is to ensure that reloads and pod termination do not cut in-flight connections.
data:
worker-shutdown-timeout: "240s"
# pod spec
terminationGracePeriodSeconds: 300
lifecycle:
preStop:
exec:
command: ["/wait-shutdown"]
worker-shutdown-timeout is how long an old worker waits after a reload to finish in-progress requests. A preStop hook and an ample terminationGracePeriod let the pod drain in-flight requests on termination. Using the readiness probe to remove a terminating pod from the load balancer pool first is also important.
Metric-Based Capacity Planning
Tuning should be done with metrics, not guesses. ingress-nginx exposes Prometheus metrics.
data:
enable-metrics: "true"
Key metrics to observe are below.
| Metric | Meaning | Use |
|---|---|---|
| nginx_ingress_controller_requests | Requests (by status code) | Traffic / error rate |
| request_duration_seconds | Request latency distribution | Latency p50/p95/p99 |
| nginx_ingress_controller_nginx_process_connections | Active connections | Connection limit check |
| nginx_ingress_controller_config_last_reload_successful | Reload success flag | Detect reload failure |
Watch p99 latency and active connection trends to tune worker-connections and the keepalive pool. Overlaying error-rate spikes with reload timestamps lets you separate reload-caused issues from backend-caused ones.
Session Affinity and Load-Balancing Algorithm
For stateful applications or cache efficiency, you sometimes need to send the same client to the same backend. ingress-nginx provides cookie-based session affinity.
metadata:
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/affinity-mode: "persistent"
nginx.ingress.kubernetes.io/session-cookie-name: "route"
nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"
However, session affinity can break load-balancing balance. If traffic concentrates on a particular backend, scaling loses meaning, so apply it only on routes that truly need it and design applications to be stateless where possible.
The load-balancing algorithm is also adjustable via the global ConfigMap.
data:
load-balance: "ewma"
The default is round_robin, but ewma (Exponentially Weighted Moving Average) uses each backend's recent response time as a weight and sends traffic to faster backends. It often improves p99 latency in environments with high variance in backend response times.
Connection Handling and Client-Side keepalive
Where we have covered upstream (backend) keepalive so far, client-side keepalive matters too. Keeping client connections appropriately reduces TLS handshake cost.
data:
keep-alive: "75"
keep-alive-requests: "1000"
keep-alive is how long (seconds) to keep a client connection; keep-alive-requests is the max requests per connection. If a CDN or load balancer sits in front, align with its idle timeout. If the front LB's idle timeout is longer than the ingress keep-alive, the LB may try to reuse an already-closed connection and cause a 502, so it is safer to keep the ingress side longer.
Load Scenarios and Pitfalls
The most common pitfall is a reload storm. If ConfigMaps, Secrets, or ingresses change frequently, nginx keeps reloading, and each reload spins up new workers, making memory and connections lurch. It is especially risky when cert-manager renews certificates often or automation repeatedly edits ingresses. Batch your changes and lower the frequency.
[ 502 vs 504 cause flowchart ]
request failed
│
├─ 504 Gateway Timeout?
│ │
│ ├─ backend response slow ──▶ check proxy-read-timeout + backend latency
│ └─ backend connect delay ──▶ proxy-connect-timeout + backend health
│
└─ 502 Bad Gateway?
│
├─ upstream keepalive race ──▶ keepalive-timeout shorter than backend
├─ backend closes early ──▶ backend keep-alive config
├─ backend OOM/crash ──▶ check pod logs/restarts
└─ transient during reload ──▶ tune worker-shutdown-timeout/grace
A 504 is almost always a "slowness" problem (timeout or slow backend), while a 502 is a "breakage" problem (a connection terminated improperly). Distinguishing these two first lets you narrow the cause far faster.
WebSocket and gRPC Long-Lived Connections
WebSocket and gRPC streaming need different tuning from ordinary HTTP requests. Because the connection stays open for a long time, short read/send timeouts cut healthy connections.
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
For WebSocket, ingress-nginx handles the Upgrade header automatically, so it needs little extra config, but you must raise the timeouts enough. gRPC requires backend-protocol set to GRPC for the HTTP/2-based proxy to work. Services with many long-lived connections can reach the worker-connections limit quickly, so watch the connection metrics more carefully than usual.
Precedence of Global vs Ingress Settings
If you set the same option in both the ConfigMap (global) and an annotation (ingress), which wins? Generally the ingress annotation overrides the global value for that ingress only. This enables a pattern of keeping conservative global defaults and making exceptions for specific routes.
| Layer | Scope | Precedence |
|---|---|---|
| ConfigMap | Whole controller | Low (default) |
| Annotation | Individual ingress | High (overrides) |
For example, set proxy-body-size to 10m in the global ConfigMap and 200m via annotation only on the upload-route ingress. This protects most routes with a conservative limit while relaxing only the routes that need it. Note, however, that security-related global settings such as the snippet-disable policy or risk-level are designed not to be defeated by annotations.
Warm-Up and Scale-Out Stabilization
Right after scale-out, a new controller pod has an empty cache and an empty upstream keepalive pool, so latency is temporarily high. If the readiness probe passes too early, traffic flows to an unready pod and 502s can spike.
# pod spec
readinessProbe:
httpGet:
path: /healthz
port: 10254
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
Set initialDelaySeconds appropriately so a new pod takes traffic only after warming up enough, and avoid setting the HPA scale-out too aggressively for stability. If traffic spikes are predictable (events, sales), consider prewarming via pre-scaling.
Relationship with Gateway API
From a performance-tuning angle too, keep the 2026 trend in mind. The worker/connection/keepalive/timeout concepts here are intrinsic to the nginx data plane, so they apply largely unchanged to Gateway API implementations (especially when nginx is the data plane, as in NGINX Gateway Fabric). What changes is the expression — not annotations, but standard CRDs or policy resources. Since the Ingress API is frozen, designing new tuning policy in Gateway API's policy model pays off long term.
Timeout Budget and End-to-End Consistency
Timeouts must be consistent across the whole path, not within a single component. In the chain client to LB to ingress to backend to DB, if each stage's timeout is misaligned, one side has already given up while another keeps waiting — pure waste.
[ end-to-end timeout budget example ]
client LB ingress backend DB
30s ──▶ 28s ──▶ 25s ──▶ 20s ──▶ 10s
(longest) (shortest)
The principle is: longer at the outside, shorter at the inside. If the inner (DB) timeout is longer than the outer (ingress), the backend keeps waiting for the DB response even after the ingress cut it with a 504, holding connections and resources. This accelerates resource exhaustion under load. Conversely, keeping the inside short makes slow work fail fast and lets retry policies act cleanly.
The ingress proxy-read-timeout should be slightly longer than backend processing time within this budget but shorter than the LB's idle timeout — the key is making "who cuts first" predictable.
For example, reflecting the budget above into an ingress looks like this.
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-connect-timeout: "3"
nginx.ingress.kubernetes.io/proxy-read-timeout: "25"
nginx.ingress.kubernetes.io/proxy-send-timeout: "25"
Explicitly documenting a staged budget like this and reflecting it into each component makes it far easier to estimate "where it was cut" during an incident. Think of a timeout not as a single number but as a contract across the whole path.
A Practical Tuning Workflow
Applying everything covered so far all at once leaves you unable to tell which change had which effect. Apply changes gradually with the following workflow.
[ tuning workflow ]
1. measure baseline ── record current p50/p95/p99, error rate, connections, reload freq
2. form hypothesis ── be explicit about which metric improves via which change
3. single change ──── change only one option at a time (staging first)
4. load test ─────── apply load that mimics real traffic patterns
5. compare/judge ──── verify improvement/regression vs baseline
6. roll out/back ──── if better, production canary; if regression, roll back immediately
The most common mistake is changing several options at once. Adjusting keepalive, timeouts, and buffers together means that even if p99 improves you cannot pinpoint the cause, and if a regression appears later it is hard to trace where it came from. One change at a time, validated with metrics — that discipline is ultimately the fastest path.
For load-testing tools, k6, vegeta, and wrk are commonly used. Beyond raw RPS, you must mimic the traffic shape of the real service — concurrent connections, body sizes, the proportion of long-lived connections — to get meaningful results.
Summary of Frequently Used ConfigMap Options
Finally, here is a single table of global options frequently tuned in production.
| Option | Role | Starting point |
|---|---|---|
| worker-processes | Number of workers | auto |
| max-worker-connections | Connections per worker | 16384 |
| upstream-keepalive-connections | Backend idle pool | 320 |
| keep-alive | Client keepalive seconds | 75 |
| proxy-body-size | Body size limit | per use case |
| use-gzip | gzip compression | true |
| load-balance | Balancing algorithm | round_robin or ewma |
| worker-shutdown-timeout | Graceful shutdown wait | 240s |
These are starting points, not answers. Always validate with each environment's metrics.
Key Takeaways
Here are the principles to remember in tuning.
- Do not guess; measure a baseline, then change one thing at a time and validate with metrics.
- Cap capacity with worker-processes (auto) and worker-connections.
- Upstream keepalive has high impact, but keep it shorter than the backend keepalive to avoid 502 races.
- Keep connect short; set read/send to match the work. But keep them shorter than the LB so "who cuts" is predictable.
- Turn buffering off on streaming/upload routes, and compress text responses.
- Reload storms are the most common pitfall. Batch changes and lower frequency via GitOps.
- 504 means "slow", 502 means "broken". This distinction drives your debugging speed.
These principles are intrinsic to the nginx data plane, so they carry over directly to Gateway API.
Conclusion
The essence of ingress-nginx tuning is to understand how the data plane works and validate with metrics rather than guessing. Set capacity with workers and connections, raise efficiency with upstream keepalive, detect faults fast with the three timeouts, save resources with buffering and compression, and protect zero-downtime with graceful reload. Just learning the flow to separate 502 from 504 dramatically changes your incident-response speed. All of this intuition carries over as an asset into the Gateway API era.
References
- ingress-nginx ConfigMap options: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/
- ingress-nginx annotations: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/
- ingress-nginx metrics/monitoring: https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/
- Kubernetes HPA: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- Kubernetes Ingress concept: https://kubernetes.io/docs/concepts/services-networking/ingress/
- Gateway API: https://gateway-api.sigs.k8s.io/
- NGINX Gateway Fabric: https://docs.nginx.com/nginx-gateway-fabric/
- HAProxy Kubernetes Ingress: https://www.haproxy.com/documentation/kubernetes-ingress/
- Traefik docs: https://doc.traefik.io/traefik/
- Envoy/Contour: https://projectcontour.io/