Skip to content

필사 모드: Keycloak Observability — Metrics, Audit Logs, and Event-Driven Monitoring

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

The moment an authentication system fails, every service fails. If login does not work, the entire product is effectively down from the user's point of view — and what is scarier is the "silent security incident": credential stuffing running for days while nobody notices. This is why the maturity of Keycloak operations ultimately converges on the maturity of observability.

Fortunately, the Keycloak of 2026 is incomparably better at observability than it used to be. The 26.x line ships micrometer-based metrics, user event metrics, and OpenTelemetry tracing out of the box, and 26.6 added operational features such as zero-downtime rolling updates. In this article we draw the full observability picture: the Keycloak event system and retention policies, exporting events through the EventListener SPI, building a Prometheus + Grafana monitoring stack, tracing and log collection, brute force and credential stuffing detection, alerting rules, and finally ISMS-P/SOC 2 audit readiness.

The Keycloak Event System

Login Events and Admin Events

Keycloak emits two kinds of events.

| Aspect | Login Events (User Events) | Admin Events |

| --- | --- | --- |

| Triggered by | End-user actions | Administrator / Admin API actions |

| Representative events | LOGIN, LOGIN_ERROR, LOGOUT, REGISTER, UPDATE_PASSWORD, TOKEN_REFRESH | CREATE/UPDATE/DELETE (user, client, role, realm settings) |

| Primary use | Security monitoring, user behavior analysis | Change auditing (who changed what) |

| Payload | User, client, IP, outcome, error code | Resource path, before/after representation |

Event collection is enabled per realm. You can do it in the Admin Console under Realm Settings, but let us manage it as code.

Event configuration (kcadm.sh)

./kcadm.sh update events/config -r myrealm \

-s eventsEnabled=true \

-s 'eventsExpiration=2592000' \

-s 'enabledEventTypes=["LOGIN","LOGIN_ERROR","LOGOUT","REGISTER","UPDATE_PASSWORD","UPDATE_EMAIL","REMOVE_TOTP","UPDATE_TOTP","TOKEN_EXCHANGE","REFRESH_TOKEN_ERROR"]' \

-s adminEventsEnabled=true \

-s adminEventsDetailsEnabled=true

- `eventsExpiration` is the retention period in seconds. The example above is 30 days.

- Enabling `adminEventsDetailsEnabled` stores the changed representation (JSON) alongside, letting you trace "what changed and how." This is mandatory if you have audit requirements.

- In 26.x, admin events have their own expiration setting as well, so specify retention for both kinds explicitly.

Querying is available through the Admin API.

Fetch recent login failure events

./kcadm.sh get events -r myrealm \

-q type=LOGIN_ERROR -q max=50

Admin change history for a specific user

./kcadm.sh get admin-events -r myrealm \

-q resourcePath=users/USER_UUID -q max=20

Event retention and DB load — a recurring operational incident

Events are stored in Keycloak's main database (the EVENT_ENTITY and ADMIN_EVENT_ENTITY tables). Two incident patterns repeat themselves here.

Incident pattern 1: infinite retention

eventsExpiration unset → table balloons to hundreds of millions of rows

→ event INSERT slows down → entire login transaction slows down

→ flood of "login is slow" tickets

Incident pattern 2: the bulk-delete bomb

Expiration configured belatedly → expiry job tries to delete

hundreds of millions of rows → DB lock contention, WAL/redo surge

→ production DB paralyzed

Practical guidelines:

- Keep event retention **short, for operational queries (7–30 days)**, and export long-term storage to external systems (see the SPI section below). The Keycloak DB is not an audit log archive.

- For tables that have already ballooned, do not rely on the expiration setting — clean up with batched deletes (partition-wise, repeated LIMIT) during a maintenance window, then configure expiration.

- If `enabledEventTypes` is left empty, every type is stored. Review whether high-frequency events like CODE_TO_TOKEN and TOKEN_REFRESH are needed, and list only the types you require. In environments with many token-refreshing SPAs, this difference shows up as an order-of-magnitude difference in row counts.

EventListener SPI — Shipping Events Out

Instead of polling the DB, the EventListener SPI pushes events to external systems (Kafka, Loki, SIEM) at the moment they occur. Built-in listeners include `jboss-logging` (log output) and `email` (user notification), and you can deploy custom implementations as JARs.

A custom listener that publishes to Kafka

public class KafkaEventListenerProvider implements EventListenerProvider {

private final KafkaProducer<String, String> producer;

private final String topic;

private final ObjectMapper mapper = new ObjectMapper();

public KafkaEventListenerProvider(KafkaProducer<String, String> producer, String topic) {

this.producer = producer;

this.topic = topic;

}

@Override

public void onEvent(Event event) {

// Asynchronously publish security events such as LOGIN_ERROR

try {

String payload = mapper.writeValueAsString(Map.of(

"category", "USER_EVENT",

"type", event.getType().name(),

"realmId", event.getRealmId(),

"clientId", event.getClientId(),

"userId", event.getUserId(),

"ipAddress", event.getIpAddress(),

"error", event.getError(),

"time", event.getTime(),

"details", event.getDetails()

));

producer.send(new ProducerRecord<>(topic, event.getRealmId(), payload));

} catch (Exception e) {

// Never break the authentication flow — log and proceed

LoggerFactory.getLogger(getClass()).warn("event publish failed", e);

}

}

@Override

public void onEvent(AdminEvent adminEvent, boolean includeRepresentation) {

try {

String payload = mapper.writeValueAsString(Map.of(

"category", "ADMIN_EVENT",

"operation", adminEvent.getOperationType().name(),

"resourceType", String.valueOf(adminEvent.getResourceTypeAsString()),

"resourcePath", adminEvent.getResourcePath(),

"realmId", adminEvent.getRealmId(),

"authUserId", adminEvent.getAuthDetails().getUserId(),

"time", adminEvent.getTime()

));

producer.send(new ProducerRecord<>(topic, adminEvent.getRealmId(), payload));

} catch (Exception e) {

LoggerFactory.getLogger(getClass()).warn("admin event publish failed", e);

}

}

@Override

public void close() {

}

}

You also need the factory and the registration file.

public class KafkaEventListenerProviderFactory implements EventListenerProviderFactory {

private KafkaProducer<String, String> producer;

private String topic;

@Override

public EventListenerProvider create(KeycloakSession session) {

return new KafkaEventListenerProvider(producer, topic);

}

@Override

public void init(Config.Scope config) {

Properties props = new Properties();

props.put("bootstrap.servers", config.get("bootstrapServers", "kafka:9092"));

props.put("key.serializer", StringSerializer.class.getName());

props.put("value.serializer", StringSerializer.class.getName());

props.put("acks", "1");

props.put("linger.ms", "20");

this.producer = new KafkaProducer<>(props);

this.topic = config.get("topic", "keycloak-events");

}

@Override

public String getId() {

return "kafka-event-listener";

}

@Override

public void close() {

if (producer != null) producer.close();

}

}

META-INF/services/org.keycloak.events.EventListenerProviderFactory

File content: com.example.keycloak.KafkaEventListenerProviderFactory

Drop the JAR into the providers directory, rebuild, and register the listener with the realm.

cp keycloak-kafka-listener.jar /opt/keycloak/providers/

/opt/keycloak/bin/kc.sh build

./kcadm.sh update events/config -r myrealm \

-s 'eventsListeners=["jboss-logging","kafka-event-listener"]'

There is one iron rule when implementing this: **a listener failure must never fail authentication.** onEvent is invoked on the authentication transaction path, so external publishing must be asynchronous/buffered and exceptions must be swallowed. A design where company-wide login stops because Kafka is down is the worst possible outcome.

A lightweight alternative: shipping to Loki

To start light without a custom JAR, a practical setup is to emit the event logs written by the `jboss-logging` listener as structured JSON and have Promtail/Alloy collect them into Loki.

JSON log output + raise the event log level

bin/kc.sh start \

--log-console-output=json \

--log-level=INFO,org.keycloak.events:DEBUG

promtail config excerpt — label only Keycloak events

scrape_configs:

- job_name: keycloak

static_configs:

- targets: [localhost]

labels:

job: keycloak

__path__: /var/log/keycloak/*.log

pipeline_stages:

- json:

expressions:

logger: loggerName

message: message

- match:

selector: '{job="keycloak"} |= "org.keycloak.events"'

stages:

- labels:

logger:

The Kafka path suits SIEM/real-time detection pipelines, while the Loki path suits operational queries and mid-term retention. They are not mutually exclusive; many organizations run both.

The Metrics Endpoint — Micrometer and /metrics

Keycloak exposes micrometer-based metrics at /metrics on the management port (9000 by default).

bin/kc.sh start \

--metrics-enabled=true \

--event-metrics-user-enabled=true \

--event-metrics-user-tags=realm,clientId,idp \

--http-metrics-histograms-enabled=true

- `metrics-enabled`: exposes JVM, DB connection pool (Agroal), HTTP, and Infinispan cache metrics.

- `event-metrics-user-enabled`: the user event metrics of 26.x. Events such as login success/failure are aggregated as counters, giving you real-time statistics without scraping the DB event tables.

- `event-metrics-user-tags`: controls the tags attached to the metrics. The clientId and idp tags increase cardinality, so choose carefully in environments with many clients.

A Prometheus scrape configuration example:

prometheus.yml excerpt

scrape_configs:

- job_name: keycloak

metrics_path: /metrics

static_configs:

- targets: ['keycloak-0.mgmt:9000', 'keycloak-1.mgmt:9000']

scheme: https

tls_config:

insecure_skip_verify: false

The core metrics to watch

| Area | Metric (example) | Meaning and alert criteria |

| --- | --- | --- |

| Logins | keycloak_user_events_total (event tags login, login_error) | Failure rate surge = attack or outage |

| Tokens | keycloak_user_events_total (event tags refresh_token, code_to_token) | Sudden issuance change = client anomaly |

| HTTP | http_server_requests_seconds (uri, status tags) | p99 latency, 5xx ratio |

| DB pool | agroal_available_count, agroal_blocking_time_average | Pool exhaustion = precursor to total outage |

| Cache | vendor_statistics family (Infinispan sessions, realms caches) | Hit-rate drop, cluster rebalancing |

| JVM | jvm_memory_used_bytes, jvm_gc_pause_seconds | Correlation between GC pause and login latency |

The **DB connection pool (agroal)** in particular is the front-line indicator of Keycloak outages. Once pool wait times start growing, login timeouts often follow within minutes — set alerts on available_count exhaustion and blocking_time growth.

Building Grafana Dashboards

We recommend splitting the dashboards into two boards: "security operations" and "system health."

+----------------------------------------------------------------+

| Keycloak Security Operations |

+------------------------+---------------------------------------+

| Login success/failure | Failure rate % (vs success+failure) |

| (rate), stacked by | with threshold lines (5%, 20%) |

| realm/client | |

+------------------------+---------------------------------------+

| LOGIN_ERROR reasons | Top 10 failing IPs (Loki query) |

| invalid_user_credentials| user_not_found surge = enumeration |

+------------------------+---------------------------------------+

| Registrations/password | Admin event timeline |

| changes | |

+------------------------+---------------------------------------+

+----------------------------------------------------------------+

| Keycloak System Health |

+------------------------+---------------------------------------+

| HTTP p50/p95/p99 | 5xx ratio |

+------------------------+---------------------------------------+

| Agroal pool usage/wait | Infinispan cache hit rate / entries |

+------------------------+---------------------------------------+

| JVM heap / GC pause | Per-instance CPU / threads |

+----------------------------------------------------------------+

A few frequently used PromQL queries:

Login failure rate over a 5-minute window (%)

100 *

sum(rate(keycloak_user_events_total{event="login_error"}[5m]))

/

sum(rate(keycloak_user_events_total{event=~"login|login_error"}[5m]))

p99 latency of the token endpoint

histogram_quantile(0.99,

sum by (le) (rate(http_server_requests_seconds_bucket{uri=~".*token.*"}[5m])))

Average DB pool blocking time (ms)

avg(agroal_blocking_time_average_milliseconds)

OpenTelemetry Tracing

Distributed tracing is the answer to "login is slow but we do not know where." Keycloak 26.x supports OpenTelemetry tracing natively.

bin/kc.sh start \

--tracing-enabled=true \

--tracing-endpoint=http://otel-collector:4317 \

--tracing-protocol=grpc \

--tracing-sampler-type=traceidratio \

--tracing-sampler-ratio=0.05 \

--tracing-service-name=keycloak-prod

- Traces include spans for HTTP requests, DB queries, LDAP calls, and more, so you can see at a glance that "of the 2.5-second login, the LDAP bind took 2.1 seconds."

- In production, start with a low sampling ratio (1–5%). A parentbased sampler respects trace context propagated from the gateway.

- Tail-based sampling at the Collector — "keep all slow requests and errors" — pairs especially well with authentication systems.

- Combine with JSON logs so that trace_id is printed in log lines, and you complete the trinity in Grafana: drill down from metrics to traces to logs.

Anomaly Detection — Brute Force and Credential Stuffing

Built-in brute force detection

A realm-level built-in defense:

./kcadm.sh update realms/myrealm \

-s bruteForceProtected=true \

-s failureFactor=5 \

-s waitIncrementSeconds=60 \

-s maxFailureWaitSeconds=900 \

-s maxDeltaTimeSeconds=43200 \

-s permanentLockout=false

When the failure count crosses the threshold, the account is locked with increasing wait times. But this alone is not enough — the built-in feature is a **per-account** defense.

Credential stuffing has a different pattern

| Attack | Pattern | Built-in defense effectiveness | Additional response |

| --- | --- | --- | --- |

| Brute force | Many passwords against one account | Effective (account lockout) | Built-in feature + alerting |

| Credential stuffing | 1–2 attempts each against many accounts | Nearly powerless | IP/ASN-level analysis, WAF, bot detection |

| Password spraying | Same password against many accounts | Nearly powerless | Attempted-password pattern analysis, MFA |

Credential stuffing produces only 1–2 failures per account, so it never trips account lockout. The clue for detection is the **global pattern**:

- A surge in the overall LOGIN_ERROR rate (especially user_not_found and invalid_user_credentials rising together)

- Many distinct usernames attempted from a single IP/range (aggregate by ipAddress in Loki)

- Unusual geography/ASN, abnormal User-Agent distribution

- Uniform request intervals during early-morning hours (a bot signature)

If you are streaming events to Kafka, stream processing (Flink/ksqlDB) can evaluate rules in real time such as "more than M distinct usernames from one IP within N minutes." The fundamental countermeasures remain expanding MFA/passkeys (the passkeys login-form integration in 26.6 has lowered the adoption barrier significantly) and breached-password blocking policies.

Alerting Rule Examples (Prometheus Alertmanager)

groups:

- name: keycloak-security

rules:

- alert: KeycloakLoginFailureRateHigh

expr: |

100 *

sum(rate(keycloak_user_events_total{event="login_error"}[5m]))

/

sum(rate(keycloak_user_events_total{event=~"login|login_error"}[5m]))

> 20

for: 10m

labels:

severity: warning

team: identity

annotations:

summary: "Login failure rate above 20% (sustained 10m)"

description: "Possible attack or authentication-path outage. Check the security dashboard."

- alert: KeycloakLoginErrorSpike

expr: |

sum(rate(keycloak_user_events_total{event="login_error"}[5m]))

> 4 * sum(rate(keycloak_user_events_total{event="login_error"}[5m] offset 1d))

for: 15m

labels:

severity: critical

annotations:

summary: "Login failures 4x higher than the same time yesterday"

- name: keycloak-health

rules:

- alert: KeycloakDbPoolExhausted

expr: min(agroal_available_count) == 0

for: 2m

labels:

severity: critical

annotations:

summary: "DB connection pool exhausted — full login outage imminent"

- alert: KeycloakTokenLatencyHigh

expr: |

histogram_quantile(0.99,

sum by (le) (rate(http_server_requests_seconds_bucket{uri=~".*token.*"}[5m])))

> 1

for: 10m

labels:

severity: warning

annotations:

summary: "Token endpoint p99 above 1 second"

- alert: KeycloakInstanceDown

expr: up{job="keycloak"} == 0

for: 1m

labels:

severity: critical

annotations:

summary: "Keycloak instance down"

The principle of alert design is to "separate the recipients of security alerts and availability alerts." A failure-rate surge should go to the security team channel, pool exhaustion to the platform on-call, and every alert should carry links to the dashboard and runbook.

Audit Compliance — the ISMS-P / SOC 2 Perspective

Your observability stack doubles as an audit-readiness asset. Mapping it to the items auditors repeatedly request:

| Audit requirement (common to ISMS-P / SOC 2) | Keycloak answer |

| --- | --- |

| Retain authentication success/failure records | Login events + external long-term storage (SIEM/object storage) |

| Trace administrator actions | Admin events (with details) + before/after representations |

| Tamper-proof logs | Ship externally, then immutable storage (WORM, bucket lock) |

| Retention period policy | Short internal + 1+ year external (varies by regulation), documented |

| Periodic access reviews | Regular role/group membership extraction reports via the Admin API |

| Anomalous access monitoring | Failure-rate/geo/IP alert rules + response runbooks |

| Time synchronization | NTP — the precondition for trustworthy event timestamps |

A few practical tips:

- What auditors demand is not "logs exist" but **"evidence that you looked at the logs and acted."** Build a workflow (ticket integration) that records alert fired → reviewed → acted.

- Admin event representations may contain sensitive data. Define masking policies in the export pipeline, and verify with legal that personal-data retention rules do not conflict with log retention periods.

- If you manage realm settings/policies as IaC (Terraform, keycloak-config-cli), a large part of the "change control" requirement is evidenced by Git history. Admin events then become the safety net that catches out-of-band changes made through the console.

Summary of Operational Best Practices

- Enable events, but keep DB retention short — externalize long-term storage via the SPI/log pipeline.

- Listener implementations must never block the authentication path: asynchronous + exception isolation.

- Make metrics-enabled plus event-metrics-user-enabled part of your default start options.

- Split dashboards into security operations and system health; alerts on agroal pool metrics are mandatory.

- Start tracing with low sampling and graduate to tail-based sampling.

- Design per-account brute force defense and global-pattern stuffing detection as separate concerns.

- Alerts need runbook links, separated security/availability recipients, and periodic alert-fatigue reviews.

- Audit readiness means not just "collection" but "response evidence" — build the workflow and immutable storage.

Closing

Keycloak observability is complete when designed as four mutually reinforcing pillars: events (what happened), metrics (what state are we in now), traces (why is it slow), and logs (detailed context). With metrics and tracing built in since 26.x, the barrier to entry has dropped sharply; what remains is organizational discipline — retention policies, alert design, response runbooks.

Monitoring an authentication system sits at the intersection of security detection and compliance, beyond plain infrastructure operations. Stack up the components in this article one by one, and you can reach an operational posture where you "know about login outages before your users do" — and "know about attacks before the attackers succeed."

References

- [Keycloak Server Administration Guide — Auditing and Events](https://www.keycloak.org/docs/latest/server_admin/index.html#auditing-and-events)

- [Keycloak Guides — Gaining insights with metrics](https://www.keycloak.org/observability/configuration-metrics)

- [Keycloak Guides — Monitoring user activities with event metrics](https://www.keycloak.org/observability/event-metrics)

- [Keycloak Guides — Root cause analysis with tracing](https://www.keycloak.org/observability/tracing)

- [Keycloak Server Developer Guide — Event Listener SPI](https://www.keycloak.org/docs/latest/server_development/index.html#_events)

- [Keycloak 26.6.0 Release Notes](https://www.keycloak.org/docs/latest/release_notes/index.html)

- [Prometheus Documentation — Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)

- [Grafana Loki Documentation](https://grafana.com/docs/loki/latest/)

- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)

- [Micrometer Documentation](https://micrometer.io/docs)

- [OWASP — Credential Stuffing Prevention Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Credential_Stuffing_Prevention_Cheat_Sheet.html)

- [AICPA — SOC 2 Trust Services Criteria](https://www.aicpa-cima.com/topic/audit-assurance/audit-and-assurance-greater-than-soc-2)

현재 단락 (1/322)

The moment an authentication system fails, every service fails. If login does not work, the entire p...

작성 글자: 0원문 글자: 19,070작성 단락: 0/322