💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

One night, a login API that normally handled about 200 requests per second was suddenly flooded with 40,000 requests per second. The database connection pool was exhausted in an instant, and legitimate users were greeted with 504 Gateway Timeout errors. Tracing the traffic revealed that the same payload was being sent repeatedly to the same endpoint from thousands of IPs. It was a textbook L7 application-layer attack.

In situations like this, the first line of defense that comes to mind is application code. But the moment the application receives a request, it means a connection has already been established, the TLS handshake has completed, and a worker thread is occupied. In other words, the cost has already been incurred. Real defense must happen before requests reach expensive resources, as far out at the edge as possible. In a Kubernetes environment, that outer boundary is the **Ingress layer**.

This article walks through how to configure rate limiting at the Ingress layer, why accurate counting is hard in distributed environments, and how to separate the defense of L3/L4 volumetric attacks from L7 application attacks across different layers, all with hands-on code. As of 2026, the Ingress API is effectively frozen and the Gateway API has established itself as the successor standard, so we will also examine the relationship between the two APIs and the migration perspective.

Fundamentals of Rate Limiting

Why Limit at L7

Network defense plays a different role at each OSI layer. The table below summarizes what each layer can see and what it can block.

| --- | --- | --- | --- |

The key point is that **each layer should block what it is best at blocking**. Trying to stop a volumetric attack of hundreds of gigabits per second at the L7 Ingress means packets enter the cluster network and consume bandwidth before being blocked. Conversely, a fine-grained rule like five requests per second on a specific login endpoint cannot be expressed by L3/L4 appliances.

Token Bucket and Leaky Bucket

The two pillars of rate-limiting algorithms are the token bucket and the leaky bucket.

[Token Bucket] [Leaky Bucket]

refill rate r ──> ( bucket b ) request ──> ( queue b ) ──> leak rate r ──> handle

│ │

each request consumes 1 token if the queue is full, drop the request

no token means reject drains only at a fixed rate

- **Token bucket**: maintains an average rate r while allowing instantaneous bursts up to the bucket capacity b. It naturally absorbs short spikes.

- **Leaky bucket**: strictly flattens the output rate. It does not absorb bursts and lets traffic out only at a fixed rate.

ingress-nginx rate limiting internally uses nginx `limit_req` (a model close to a leaky bucket) and `limit_conn`. The burst parameter allows a small amount of bursting.

Deep Dive into ingress-nginx Annotations

ingress-nginx declares rate limiting through annotations on the Ingress resource. Here are the four most commonly used.

| Annotation | Meaning | Unit |

| --- | --- | --- |

| limit-rps | Allowed requests per second | requests per second |

| limit-rpm | Allowed requests per minute | requests per minute |

| limit-connections | Concurrent connection cap | connections |

| limit-burst-multiplier | Burst multiplier (default 5) | multiplier |

The following is an example Ingress protecting a login endpoint.

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: production

annotations:

nginx.ingress.kubernetes.io/limit-rps: "5"

nginx.ingress.kubernetes.io/limit-burst-multiplier: "2"

nginx.ingress.kubernetes.io/limit-connections: "10"

Exclude trusted internal ranges from limiting

nginx.ingress.kubernetes.io/limit-whitelist: "10.0.0.0/8,192.168.0.0/16"

spec:

ingressClassName: nginx

rules:

- host: api.example.com

http:

paths:

- path: /auth/login

pathType: Prefix

backend:

service:

port:

number: 8080

Here, combining `limit-rps: 5` with `limit-burst-multiplier: 2` allows an actual burst capacity of up to 10 requests per second. In other words, it maintains an average of 5 RPS but lets through up to 10 requests in a momentary spike, rejecting anything beyond that with 503 Service Temporarily Unavailable.

What Is the Key Based On

The default rate-limiting key in ingress-nginx is the client IP. But if you sit behind a proxy or load balancer, you must accurately identify the real client IP. The crucial setting here is the trust configuration in the ConfigMap.

apiVersion: v1

kind: ConfigMap

metadata:

namespace: ingress-nginx

data:

Trusted proxy ranges (e.g., cloud LB CIDRs)

proxy-real-ip-cidr: "130.211.0.0/22,35.191.0.0/16"

use-forwarded-headers: "true"

compute-full-forwarded-for: "true"

Enabling `use-forwarded-headers` makes nginx treat the IP in the X-Forwarded-For header as the client. However, this header can be forged by the client, so you must always specify the trusted ranges (`proxy-real-ip-cidr`) alongside it. Otherwise, an attacker could insert a fake X-Forwarded-For value on every request to bypass the counter.

Global Limits and Custom Responses

The default limits applied across all Ingresses are set in the ConfigMap.

apiVersion: v1

kind: ConfigMap

metadata:

namespace: ingress-nginx

data:

limit-rate: "0" # response bandwidth limit (bytes/sec), 0 is unlimited

limit-req-status-code: "429" # return 429 Too Many Requests instead of the default 503

limit-conn-status-code: "429"

The default response code when rate limiting rejects a request is 503, but it is common to change it to `429 Too Many Requests` so clients clearly recognize they should back off. Where possible, it is good to also return a `Retry-After` header.

Comparing Traefik / Kong / APISIX

While ingress-nginx is the most widely used, you might choose a different controller depending on your operating environment and requirements. From a rate-limiting perspective, the comparison is as follows.

| --- | --- | --- | --- | --- |

Traefik Middleware

Traefik declares rate limiting with a Middleware CRD.

apiVersion: traefik.io/v1alpha1

kind: Middleware

metadata:

namespace: production

spec:

rateLimit:

average: 100 # average 100 requests per second

burst: 50 # allow a momentary burst of 50

period: 1s

sourceCriterion:

ipStrategy:

depth: 1 # use the 1st IP from the end of X-Forwarded-For

Kong Plugin

Kong declares it with a KongPlugin resource, and specifying Redis as the backend enables distributed counting.

apiVersion: configuration.konghq.com/v1

kind: KongPlugin

metadata:

namespace: production

plugin: rate-limiting

config:

minute: 60

policy: redis

redis_host: redis.production.svc.cluster.local

redis_port: 6379

fault_tolerant: true

APISIX Route

APISIX provides the limit-req, limit-conn, and limit-count plugins, and supports global counting through a Redis cluster.

apiVersion: apisix.apache.org/v2

kind: ApisixRoute

metadata:

namespace: production

spec:

http:

- name: limited

match:

hosts:

- api.example.com

paths:

- /api/*

backends:

- serviceName: api-service

servicePort: 8080

plugins:

- name: limit-count

enable: true

config:

time_window: 60

rejected_code: 429

policy: redis

redis_host: redis.production.svc.cluster.local

redis_port: 6379

The Counting Problem in Distributed Environments

This is where teams get tripped up most often in practice. An Ingress controller usually runs as multiple pods (replicas). If each pod keeps its counter in its own memory, the overall limit becomes a multiple of the intended value.

[ client traffic ]

│

┌─────────────────┼─────────────────┐

▼ ▼ ▼

┌──────────────┐ ┌──────────────┐ ┌──────────────┐

│ ingress pod1 │ │ ingress pod2 │ │ ingress pod3 │

│ memory counter│ │ memory counter│ │ memory counter│

│ limit 100 │ │ limit 100 │ │ limit 100 │

└──────────────┘ └──────────────┘ └──────────────┘

intended limit: 100 RPS → actual allowed: up to 300 RPS (pods x 100)

The characteristics of the in-memory approach are summarized below.

| --- | --- | --- | --- | --- |

The default ingress-nginx implementation is per-pod memory, so when you set a limit you should **divide it by the replica count** to approximate the intended global limit. For example, if you want a global 300 RPS with three pods, you set 100 RPS per pod. Be aware, though, that if HPA changes the pod count, the global limit shifts with it.

If you need accurate global counting, choose Kong or APISIX with a central store like Redis as the backend, or consider placing a separate rate-limiting gateway in front of ingress-nginx. When using Redis, the following settings matter.

- Atomic counting: eliminate race conditions with INCR + EXPIRE or a Lua script

- Fault-tolerant mode: decide whether to block (fail-closed) or pass (fail-open) on Redis failure

- Key expiry: choose sliding/fixed window so keys expire exactly per window

- Latency budget: measure the impact of Redis round-trips on p99 latency

Fail-open and fail-closed are a trade-off. When Redis dies, fail-open preserves availability but exposes you to attacks, while fail-closed is safe but blocks legitimate traffic too. We recommend fail-closed for sensitive endpoints like login and payment, and fail-open for general read APIs.

Whitelists and Geo Controls

You should not limit all traffic equally. Trusted partners, internal monitoring, and health checks should be excluded.

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: production

annotations:

nginx.ingress.kubernetes.io/limit-rps: "20"

Whitelisted ranges bypass rate limiting

nginx.ingress.kubernetes.io/limit-whitelist: "203.0.113.0/24,198.51.100.10/32"

spec:

ingressClassName: nginx

rules:

- host: partner.example.com

http:

paths:

- path: /

pathType: Prefix

backend:

service:

port:

number: 80

Geo blocking is implemented with a GeoIP module. ingress-nginx supports country-code-based control through nginx's GeoIP2 module.

apiVersion: v1

kind: ConfigMap

metadata:

namespace: ingress-nginx

data:

use-geoip2: "true"

Policies to allow or block specific countries are written with server-snippet

That said, GeoIP blocking is easily bypassed with VPNs and proxies, so it is realistic to use it as a supplementary signal rather than a standalone defense. It is well suited as a first filter for hostile traffic coming from regions with no legitimate users at all.

Separating L3/L4 DDoS from L7 Defense

The most common design mistake is trying to block all DDoS in one place at the Ingress. Volumetric attacks (L3/L4) and application attacks (L7) must be defended at fundamentally different locations.

[internet]

│

▼

┌──────────────────────────────────────────────┐

│ cloud edge (CDN / Anycast / scrubbing) │ <- L3/L4 volumetric defense

│ - absorb SYN floods, UDP amp, large packets │ (AWS Shield, Cloud Armor, etc.)

└──────────────────────────────────────────────┘

│ (only near-legitimate traffic passes)

▼

┌──────────────────────────────────────────────┐

│ Ingress layer (ingress-nginx / Gateway) │ <- L7 application defense

│ - per-endpoint RPS, connection caps, WAF rules │ (rate limiting, bot blocking)

└──────────────────────────────────────────────┘

│

▼

┌──────────────────────────────────────────────┐

│ application (services/pods) │ <- business-logic quotas

│ - per-user quotas, idempotency, circuit breaker │

└──────────────────────────────────────────────┘

The division of responsibilities per layer is summarized in the table below.

| --- | --- | --- | --- |

When you absorb volumetric attacks at the cloud edge, the traffic reaching the Ingress is already substantially cleaned up. At this stage the Ingress can focus solely on L7 rules. Conversely, if the Ingress takes hundreds of gigabits alone without edge protection, the node's NIC bandwidth and conntrack table collapse first.

Slow attacks like slowloris need separate attention. Since they occupy connections for a long time with little bandwidth, defend against them by setting timeouts short as follows.

apiVersion: v1

kind: ConfigMap

metadata:

namespace: ingress-nginx

data:

client-header-timeout: "10"

client-body-timeout: "10"

keep-alive-requests: "100"

worker-shutdown-timeout: "30s"

Handling Bot Traffic

A large share of DDoS is generated by botnets. There are several signals that distinguish bots from legitimate users.

[bot identification signals]

- Missing or abnormal User-Agent patterns

- Stateless repeated requests that do not maintain cookies/sessions

- Failure to pass JavaScript challenges

- TLS fingerprint (JA3/JA4) matching known bot tools

- Abnormally uniform request intervals (humans have jitter)

At the Ingress layer, simple User-Agent blocking or header validation is about as far as you can go. Below is an example that rejects an empty User-Agent with server-snippet.

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: production

annotations:

nginx.ingress.kubernetes.io/server-snippet: |

if ($http_user_agent = "") {

return 403;

}

if ($http_user_agent ~* "(curl|wget|python-requests|scrapy)") {

return 403;

}

spec:

ingressClassName: nginx

rules:

- host: www.example.com

http:

paths:

- path: /

pathType: Prefix

backend:

service:

port:

number: 80

However, since the User-Agent is trivially forged, its effect against sophisticated bots is limited. Advanced bot-management features like JavaScript challenges and JA3/JA4 fingerprinting are usually handled at the CDN/WAF layer (cloud edge), and it is reasonable to have the Ingress act as a first-pass filter.

Behavior Through Load Scenarios

When deciding on configuration values, you must validate with concrete scenarios. Below is the behavior of an API that normally runs at 200 RPS with a limit of 300 RPS (600 including burst) under three situations.

| --- | --- | --- | --- | --- |

| Normal peak | 280 RPS | 280 | 0 | Normal |

| L7 flood | 40000 RPS | 300 | 39700 | Legitimate users protected |

The third scenario is the very reason rate limiting exists. Even when 40,000 RPS arrives, only the limit is forwarded to the backend and the rest is cut off at the Ingress with 429, so the database and application are protected. The cost of handling rejected requests is negligibly small compared with the cost of backend processing.

Always run a load test before going to production. Below are simple validation command examples.

fire 200 concurrent, 20000 total requests with hey

hey -z 30s -c 200 https://api.example.com/api/items

aggregate the status-code distribution to check the rejection rate

hey -z 30s -c 500 https://api.example.com/auth/login \

| grep -A20 "Status code distribution"

Relationship with the Gateway API

As of 2026, the Ingress API is frozen with no new feature additions, and the successor standard is the **Gateway API**. The Gateway API is designed for role separation (infrastructure operator / cluster operator / app developer), protocol extensibility, and expressive routing.

| Perspective | Ingress | Gateway API |

| --- | --- | --- |

| Standard status | Frozen | Actively evolving |

| Resources | Single Ingress | GatewayClass / Gateway / HTTPRoute, etc. |

| Role separation | Weak | Explicit (RBAC-friendly) |

| Rate limiting | Implementation annotations | Policy attachment + implementation extensions |

| Traffic splitting | Annotations | Weight-based standard fields |

That said, rate limiting is not fully part of the Gateway API standard spec and is often provided through implementation-specific policies (policy attachment) or extension filters. So during migration you must check how the implementation you use exposes rate limiting.

ingress-nginx has effectively entered maintenance mode as of 2025 and is increasingly operated with a focus on responding to security vulnerabilities. If you are designing a new cluster, it is safer in the long run to first consider Gateway API-based implementations (such as Envoy Gateway, Traefik's Gateway support, and Cilium Gateway). Below is a simple HTTPRoute example.

apiVersion: gateway.networking.k8s.io/v1

kind: HTTPRoute

metadata:

namespace: production

spec:

parentRefs:

- name: prod-gateway

hostnames:

- api.example.com

rules:

- matches:

- path:

type: PathPrefix

value: /api

backendRefs:

- name: api-service

port: 8080

Operations and Tuning

Establishing Observability

Rate limiting does not end with turning it on. You must continually observe how much is being rejected and whether the rejected traffic is an attack or legitimate users.

[observability metrics]

- 429/503 response ratio (relative to total)

- IP/path/User-Agent distribution of rejected requests

- Changes in backend p50/p95/p99 latency

- CPU/memory/connection count of Ingress pods

- upstream 5xx ratio (to judge whether the limit is effective)

If you collect ingress-nginx metrics with Prometheus, you can track the 429 ratio via the status label of `nginx_ingress_controller_requests`. A sudden surge in the rejection rate indicates either an attack or limits that are too tight.

Phased Rollout

Applying strong rate limiting directly to production risks blocking legitimate users. We recommend the following sequence.

1. Observe mode: set very high limits and collect metrics only, with no rejections

2. Establish a baseline: analyze normal peak RPS, p99, and normal burst patterns

3. Conservative application: set the limit to 3-5 times the normal peak

4. Gradual tightening: adjust the limit while watching rejection rate and user impact

5. Endpoint differentiation: strict for login/payment, loose for static reads

Common Pitfalls and Troubleshooting

Here are pitfalls you will encounter repeatedly in practice.

[Pitfall 1] X-Forwarded-For not configured

→ all requests counted as a single LB IP, so everyone is blocked at once

→ always set proxy-real-ip-cidr and use-forwarded-headers together

[Pitfall 2] Mistaking per-pod counting for global

→ limit is 100 but with 3 pods up to 300 is actually allowed

→ divide by replica count, or introduce a central counter (Redis)

[Pitfall 3] Health checks/probes hit the rate limit

→ kubelet probes or monitoring receive 429 and pods die

→ add internal ranges to the whitelist

[Pitfall 4] Leaving the reject code as 503

→ clients mistake it for a server failure and retry endlessly → vicious cycle

→ induce backoff with 429 + Retry-After

[Pitfall 5] Trying to block volumetric attacks at L7

→ packets reach the node and saturate NIC/conntrack first

→ absorb volumetric attacks at the cloud edge

[Pitfall 6] Setting the limit too low

→ 429 occurs even at normal peak, causing user churn

→ measure the baseline first in observe mode

Debugging Commands

Commands you can use to quickly check when you suspect a problem.

check limiting-related messages in the Ingress controller logs

kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep -i limit

inspect the actually applied nginx config (verify the generated limit_req zone)

kubectl exec -n ingress-nginx deploy/ingress-nginx-controller -- \

cat /etc/nginx/nginx.conf | grep -A3 limit_req_zone

quick check of the status-code distribution seen from a specific IP

for i in $(seq 1 20); do

curl -s -o /dev/null -w "%{http_code}\n" https://api.example.com/auth/login

done | sort | uniq -c

Operational Checklist

The final pre-deployment review list.

[ ] Did you set differentiated limits per endpoint (strict for login, loose for reads)?

[ ] Did you accurately specify the X-Forwarded-For trust range (proxy-real-ip-cidr)?

[ ] Did you set the reject response to 429 + Retry-After?

[ ] Did you add health checks/monitoring/internal ranges to the whitelist?

[ ] Did you intentionally choose per-pod vs Redis counting?

[ ] Did you account for the effect of replica count changes (HPA) on the global limit?

[ ] Did you place L3/L4 volumetric defense at the cloud edge?

[ ] Did you set short timeouts to guard against slowloris?

[ ] Are you observing the 429/503 ratio and backend latency on a dashboard?

[ ] Did you pre-validate limit behavior with a load test?

[ ] Did you review the Gateway API migration path?

Conclusion

The core of rate limiting and DDoS mitigation is **blocking the right attacks at the right location, layer by layer**. Volumetric attacks must be stopped at the cloud edge, application-layer abuse at the Ingress, and business-rule violations at the application. Piling all responsibility on one layer is guaranteed to collapse.

The counting problem in distributed environments deceives people quietly in particular. Mistaking per-pod memory counting for a global limit lets traffic leak through to several times the configured value. If accuracy matters, introduce a central counter, but be sure to clearly define the fail-open/fail-closed policy for when Redis fails.

Finally, the direction of 2026 is clear. The Ingress API is frozen, ingress-nginx has entered maintenance mode, and the Gateway API is establishing itself as the successor standard. Refine your current ingress-nginx configuration well, but for new designs, we recommend actively evaluating Gateway API-based implementations and their rate-limiting models.

References

- Kubernetes Ingress concept: https://kubernetes.io/docs/concepts/services-networking/ingress/

- Gateway API official docs: https://gateway-api.sigs.k8s.io/

- ingress-nginx annotations reference: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/

- ingress-nginx rate-limiting guide: https://kubernetes.github.io/ingress-nginx/examples/customization/custom-configuration/

- Traefik Rate Limit middleware: https://doc.traefik.io/traefik/middlewares/http/ratelimit/

- Kong Rate Limiting plugin: https://docs.konghq.com/hub/kong-inc/rate-limiting/

- Apache APISIX limit-count plugin: https://apisix.apache.org/docs/apisix/plugins/limit-count/

- AWS Shield docs: https://docs.aws.amazon.com/waf/latest/developerguide/shield-chapter.html

- Google Cloud Armor docs: https://cloud.google.com/armor/docs/security-policy-overview

- nginx limit_req module: https://nginx.org/en/docs/http/ngx_http_limit_req_module.html