Skip to content
Published on

Keycloak Observability — Metrics, Audit Logs, and Event-Driven Monitoring

Authors

Introduction

The moment an authentication system fails, every service fails. If login does not work, the entire product is effectively down from the user's point of view — and what is scarier is the "silent security incident": credential stuffing running for days while nobody notices. This is why the maturity of Keycloak operations ultimately converges on the maturity of observability.

Fortunately, the Keycloak of 2026 is incomparably better at observability than it used to be. The 26.x line ships micrometer-based metrics, user event metrics, and OpenTelemetry tracing out of the box, and 26.6 added operational features such as zero-downtime rolling updates. In this article we draw the full observability picture: the Keycloak event system and retention policies, exporting events through the EventListener SPI, building a Prometheus + Grafana monitoring stack, tracing and log collection, brute force and credential stuffing detection, alerting rules, and finally ISMS-P/SOC 2 audit readiness.

The Keycloak Event System

Login Events and Admin Events

Keycloak emits two kinds of events.

AspectLogin Events (User Events)Admin Events
Triggered byEnd-user actionsAdministrator / Admin API actions
Representative eventsLOGIN, LOGIN_ERROR, LOGOUT, REGISTER, UPDATE_PASSWORD, TOKEN_REFRESHCREATE/UPDATE/DELETE (user, client, role, realm settings)
Primary useSecurity monitoring, user behavior analysisChange auditing (who changed what)
PayloadUser, client, IP, outcome, error codeResource path, before/after representation

Event collection is enabled per realm. You can do it in the Admin Console under Realm Settings, but let us manage it as code.

# Event configuration (kcadm.sh)
./kcadm.sh update events/config -r myrealm \
  -s eventsEnabled=true \
  -s 'eventsExpiration=2592000' \
  -s 'enabledEventTypes=["LOGIN","LOGIN_ERROR","LOGOUT","REGISTER","UPDATE_PASSWORD","UPDATE_EMAIL","REMOVE_TOTP","UPDATE_TOTP","TOKEN_EXCHANGE","REFRESH_TOKEN_ERROR"]' \
  -s adminEventsEnabled=true \
  -s adminEventsDetailsEnabled=true
  • eventsExpiration is the retention period in seconds. The example above is 30 days.
  • Enabling adminEventsDetailsEnabled stores the changed representation (JSON) alongside, letting you trace "what changed and how." This is mandatory if you have audit requirements.
  • In 26.x, admin events have their own expiration setting as well, so specify retention for both kinds explicitly.

Querying is available through the Admin API.

# Fetch recent login failure events
./kcadm.sh get events -r myrealm \
  -q type=LOGIN_ERROR -q max=50

# Admin change history for a specific user
./kcadm.sh get admin-events -r myrealm \
  -q resourcePath=users/USER_UUID -q max=20

Event retention and DB load — a recurring operational incident

Events are stored in Keycloak's main database (the EVENT_ENTITY and ADMIN_EVENT_ENTITY tables). Two incident patterns repeat themselves here.

Incident pattern 1: infinite retention
  eventsExpiration unset → table balloons to hundreds of millions of rows
  → event INSERT slows down → entire login transaction slows down
  → flood of "login is slow" tickets

Incident pattern 2: the bulk-delete bomb
  Expiration configured belatedly → expiry job tries to delete
  hundreds of millions of rows → DB lock contention, WAL/redo surge
  → production DB paralyzed

Practical guidelines:

  • Keep event retention short, for operational queries (7–30 days), and export long-term storage to external systems (see the SPI section below). The Keycloak DB is not an audit log archive.
  • For tables that have already ballooned, do not rely on the expiration setting — clean up with batched deletes (partition-wise, repeated LIMIT) during a maintenance window, then configure expiration.
  • If enabledEventTypes is left empty, every type is stored. Review whether high-frequency events like CODE_TO_TOKEN and TOKEN_REFRESH are needed, and list only the types you require. In environments with many token-refreshing SPAs, this difference shows up as an order-of-magnitude difference in row counts.

EventListener SPI — Shipping Events Out

Instead of polling the DB, the EventListener SPI pushes events to external systems (Kafka, Loki, SIEM) at the moment they occur. Built-in listeners include jboss-logging (log output) and email (user notification), and you can deploy custom implementations as JARs.

A custom listener that publishes to Kafka

public class KafkaEventListenerProvider implements EventListenerProvider {

  private final KafkaProducer<String, String> producer;
  private final String topic;
  private final ObjectMapper mapper = new ObjectMapper();

  public KafkaEventListenerProvider(KafkaProducer<String, String> producer, String topic) {
    this.producer = producer;
    this.topic = topic;
  }

  @Override
  public void onEvent(Event event) {
    // Asynchronously publish security events such as LOGIN_ERROR
    try {
      String payload = mapper.writeValueAsString(Map.of(
          "category", "USER_EVENT",
          "type", event.getType().name(),
          "realmId", event.getRealmId(),
          "clientId", event.getClientId(),
          "userId", event.getUserId(),
          "ipAddress", event.getIpAddress(),
          "error", event.getError(),
          "time", event.getTime(),
          "details", event.getDetails()
      ));
      producer.send(new ProducerRecord<>(topic, event.getRealmId(), payload));
    } catch (Exception e) {
      // Never break the authentication flow — log and proceed
      LoggerFactory.getLogger(getClass()).warn("event publish failed", e);
    }
  }

  @Override
  public void onEvent(AdminEvent adminEvent, boolean includeRepresentation) {
    try {
      String payload = mapper.writeValueAsString(Map.of(
          "category", "ADMIN_EVENT",
          "operation", adminEvent.getOperationType().name(),
          "resourceType", String.valueOf(adminEvent.getResourceTypeAsString()),
          "resourcePath", adminEvent.getResourcePath(),
          "realmId", adminEvent.getRealmId(),
          "authUserId", adminEvent.getAuthDetails().getUserId(),
          "time", adminEvent.getTime()
      ));
      producer.send(new ProducerRecord<>(topic, adminEvent.getRealmId(), payload));
    } catch (Exception e) {
      LoggerFactory.getLogger(getClass()).warn("admin event publish failed", e);
    }
  }

  @Override
  public void close() {
  }
}

You also need the factory and the registration file.

public class KafkaEventListenerProviderFactory implements EventListenerProviderFactory {

  private KafkaProducer<String, String> producer;
  private String topic;

  @Override
  public EventListenerProvider create(KeycloakSession session) {
    return new KafkaEventListenerProvider(producer, topic);
  }

  @Override
  public void init(Config.Scope config) {
    Properties props = new Properties();
    props.put("bootstrap.servers", config.get("bootstrapServers", "kafka:9092"));
    props.put("key.serializer", StringSerializer.class.getName());
    props.put("value.serializer", StringSerializer.class.getName());
    props.put("acks", "1");
    props.put("linger.ms", "20");
    this.producer = new KafkaProducer<>(props);
    this.topic = config.get("topic", "keycloak-events");
  }

  @Override
  public String getId() {
    return "kafka-event-listener";
  }

  @Override
  public void close() {
    if (producer != null) producer.close();
  }
}
META-INF/services/org.keycloak.events.EventListenerProviderFactory
File content: com.example.keycloak.KafkaEventListenerProviderFactory

Drop the JAR into the providers directory, rebuild, and register the listener with the realm.

cp keycloak-kafka-listener.jar /opt/keycloak/providers/
/opt/keycloak/bin/kc.sh build

./kcadm.sh update events/config -r myrealm \
  -s 'eventsListeners=["jboss-logging","kafka-event-listener"]'

There is one iron rule when implementing this: a listener failure must never fail authentication. onEvent is invoked on the authentication transaction path, so external publishing must be asynchronous/buffered and exceptions must be swallowed. A design where company-wide login stops because Kafka is down is the worst possible outcome.

A lightweight alternative: shipping to Loki

To start light without a custom JAR, a practical setup is to emit the event logs written by the jboss-logging listener as structured JSON and have Promtail/Alloy collect them into Loki.

# JSON log output + raise the event log level
bin/kc.sh start \
  --log-console-output=json \
  --log-level=INFO,org.keycloak.events:DEBUG
# promtail config excerpt — label only Keycloak events
scrape_configs:
  - job_name: keycloak
    static_configs:
      - targets: [localhost]
        labels:
          job: keycloak
          __path__: /var/log/keycloak/*.log
    pipeline_stages:
      - json:
          expressions:
            logger: loggerName
            message: message
      - match:
          selector: '{job="keycloak"} |= "org.keycloak.events"'
          stages:
            - labels:
                logger:

The Kafka path suits SIEM/real-time detection pipelines, while the Loki path suits operational queries and mid-term retention. They are not mutually exclusive; many organizations run both.

The Metrics Endpoint — Micrometer and /metrics

Keycloak exposes micrometer-based metrics at /metrics on the management port (9000 by default).

bin/kc.sh start \
  --metrics-enabled=true \
  --event-metrics-user-enabled=true \
  --event-metrics-user-tags=realm,clientId,idp \
  --http-metrics-histograms-enabled=true
  • metrics-enabled: exposes JVM, DB connection pool (Agroal), HTTP, and Infinispan cache metrics.
  • event-metrics-user-enabled: the user event metrics of 26.x. Events such as login success/failure are aggregated as counters, giving you real-time statistics without scraping the DB event tables.
  • event-metrics-user-tags: controls the tags attached to the metrics. The clientId and idp tags increase cardinality, so choose carefully in environments with many clients.

A Prometheus scrape configuration example:

# prometheus.yml excerpt
scrape_configs:
  - job_name: keycloak
    metrics_path: /metrics
    static_configs:
      - targets: ['keycloak-0.mgmt:9000', 'keycloak-1.mgmt:9000']
    scheme: https
    tls_config:
      insecure_skip_verify: false

The core metrics to watch

AreaMetric (example)Meaning and alert criteria
Loginskeycloak_user_events_total (event tags login, login_error)Failure rate surge = attack or outage
Tokenskeycloak_user_events_total (event tags refresh_token, code_to_token)Sudden issuance change = client anomaly
HTTPhttp_server_requests_seconds (uri, status tags)p99 latency, 5xx ratio
DB poolagroal_available_count, agroal_blocking_time_averagePool exhaustion = precursor to total outage
Cachevendor_statistics family (Infinispan sessions, realms caches)Hit-rate drop, cluster rebalancing
JVMjvm_memory_used_bytes, jvm_gc_pause_secondsCorrelation between GC pause and login latency

The DB connection pool (agroal) in particular is the front-line indicator of Keycloak outages. Once pool wait times start growing, login timeouts often follow within minutes — set alerts on available_count exhaustion and blocking_time growth.

Building Grafana Dashboards

We recommend splitting the dashboards into two boards: "security operations" and "system health."

+----------------------------------------------------------------+
| Keycloak Security Operations                                   |
+------------------------+---------------------------------------+
| Login success/failure  | Failure rate % (vs success+failure)   |
| (rate), stacked by     | with threshold lines (5%, 20%)        |
| realm/client           |                                       |
+------------------------+---------------------------------------+
| LOGIN_ERROR reasons    | Top 10 failing IPs (Loki query)       |
| invalid_user_credentials| user_not_found surge = enumeration   |
+------------------------+---------------------------------------+
| Registrations/password | Admin event timeline                  |
| changes                |                                       |
+------------------------+---------------------------------------+

+----------------------------------------------------------------+
| Keycloak System Health                                         |
+------------------------+---------------------------------------+
| HTTP p50/p95/p99       | 5xx ratio                             |
+------------------------+---------------------------------------+
| Agroal pool usage/wait | Infinispan cache hit rate / entries   |
+------------------------+---------------------------------------+
| JVM heap / GC pause    | Per-instance CPU / threads            |
+----------------------------------------------------------------+

A few frequently used PromQL queries:

# Login failure rate over a 5-minute window (%)
100 *
  sum(rate(keycloak_user_events_total{event="login_error"}[5m]))
/
  sum(rate(keycloak_user_events_total{event=~"login|login_error"}[5m]))

# p99 latency of the token endpoint
histogram_quantile(0.99,
  sum by (le) (rate(http_server_requests_seconds_bucket{uri=~".*token.*"}[5m])))

# Average DB pool blocking time (ms)
avg(agroal_blocking_time_average_milliseconds)

OpenTelemetry Tracing

Distributed tracing is the answer to "login is slow but we do not know where." Keycloak 26.x supports OpenTelemetry tracing natively.

bin/kc.sh start \
  --tracing-enabled=true \
  --tracing-endpoint=http://otel-collector:4317 \
  --tracing-protocol=grpc \
  --tracing-sampler-type=traceidratio \
  --tracing-sampler-ratio=0.05 \
  --tracing-service-name=keycloak-prod
  • Traces include spans for HTTP requests, DB queries, LDAP calls, and more, so you can see at a glance that "of the 2.5-second login, the LDAP bind took 2.1 seconds."
  • In production, start with a low sampling ratio (1–5%). A parentbased sampler respects trace context propagated from the gateway.
  • Tail-based sampling at the Collector — "keep all slow requests and errors" — pairs especially well with authentication systems.
  • Combine with JSON logs so that trace_id is printed in log lines, and you complete the trinity in Grafana: drill down from metrics to traces to logs.

Anomaly Detection — Brute Force and Credential Stuffing

Built-in brute force detection

A realm-level built-in defense:

./kcadm.sh update realms/myrealm \
  -s bruteForceProtected=true \
  -s failureFactor=5 \
  -s waitIncrementSeconds=60 \
  -s maxFailureWaitSeconds=900 \
  -s maxDeltaTimeSeconds=43200 \
  -s permanentLockout=false

When the failure count crosses the threshold, the account is locked with increasing wait times. But this alone is not enough — the built-in feature is a per-account defense.

Credential stuffing has a different pattern

AttackPatternBuilt-in defense effectivenessAdditional response
Brute forceMany passwords against one accountEffective (account lockout)Built-in feature + alerting
Credential stuffing1–2 attempts each against many accountsNearly powerlessIP/ASN-level analysis, WAF, bot detection
Password sprayingSame password against many accountsNearly powerlessAttempted-password pattern analysis, MFA

Credential stuffing produces only 1–2 failures per account, so it never trips account lockout. The clue for detection is the global pattern:

  • A surge in the overall LOGIN_ERROR rate (especially user_not_found and invalid_user_credentials rising together)
  • Many distinct usernames attempted from a single IP/range (aggregate by ipAddress in Loki)
  • Unusual geography/ASN, abnormal User-Agent distribution
  • Uniform request intervals during early-morning hours (a bot signature)

If you are streaming events to Kafka, stream processing (Flink/ksqlDB) can evaluate rules in real time such as "more than M distinct usernames from one IP within N minutes." The fundamental countermeasures remain expanding MFA/passkeys (the passkeys login-form integration in 26.6 has lowered the adoption barrier significantly) and breached-password blocking policies.

Alerting Rule Examples (Prometheus Alertmanager)

groups:
  - name: keycloak-security
    rules:
      - alert: KeycloakLoginFailureRateHigh
        expr: |
          100 *
            sum(rate(keycloak_user_events_total{event="login_error"}[5m]))
          /
            sum(rate(keycloak_user_events_total{event=~"login|login_error"}[5m]))
          > 20
        for: 10m
        labels:
          severity: warning
          team: identity
        annotations:
          summary: "Login failure rate above 20% (sustained 10m)"
          description: "Possible attack or authentication-path outage. Check the security dashboard."

      - alert: KeycloakLoginErrorSpike
        expr: |
          sum(rate(keycloak_user_events_total{event="login_error"}[5m]))
          > 4 * sum(rate(keycloak_user_events_total{event="login_error"}[5m] offset 1d))
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Login failures 4x higher than the same time yesterday"

  - name: keycloak-health
    rules:
      - alert: KeycloakDbPoolExhausted
        expr: min(agroal_available_count) == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "DB connection pool exhausted — full login outage imminent"

      - alert: KeycloakTokenLatencyHigh
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(http_server_requests_seconds_bucket{uri=~".*token.*"}[5m])))
          > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Token endpoint p99 above 1 second"

      - alert: KeycloakInstanceDown
        expr: up{job="keycloak"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Keycloak instance down"

The principle of alert design is to "separate the recipients of security alerts and availability alerts." A failure-rate surge should go to the security team channel, pool exhaustion to the platform on-call, and every alert should carry links to the dashboard and runbook.

Audit Compliance — the ISMS-P / SOC 2 Perspective

Your observability stack doubles as an audit-readiness asset. Mapping it to the items auditors repeatedly request:

Audit requirement (common to ISMS-P / SOC 2)Keycloak answer
Retain authentication success/failure recordsLogin events + external long-term storage (SIEM/object storage)
Trace administrator actionsAdmin events (with details) + before/after representations
Tamper-proof logsShip externally, then immutable storage (WORM, bucket lock)
Retention period policyShort internal + 1+ year external (varies by regulation), documented
Periodic access reviewsRegular role/group membership extraction reports via the Admin API
Anomalous access monitoringFailure-rate/geo/IP alert rules + response runbooks
Time synchronizationNTP — the precondition for trustworthy event timestamps

A few practical tips:

  • What auditors demand is not "logs exist" but "evidence that you looked at the logs and acted." Build a workflow (ticket integration) that records alert fired → reviewed → acted.
  • Admin event representations may contain sensitive data. Define masking policies in the export pipeline, and verify with legal that personal-data retention rules do not conflict with log retention periods.
  • If you manage realm settings/policies as IaC (Terraform, keycloak-config-cli), a large part of the "change control" requirement is evidenced by Git history. Admin events then become the safety net that catches out-of-band changes made through the console.

Summary of Operational Best Practices

  • Enable events, but keep DB retention short — externalize long-term storage via the SPI/log pipeline.
  • Listener implementations must never block the authentication path: asynchronous + exception isolation.
  • Make metrics-enabled plus event-metrics-user-enabled part of your default start options.
  • Split dashboards into security operations and system health; alerts on agroal pool metrics are mandatory.
  • Start tracing with low sampling and graduate to tail-based sampling.
  • Design per-account brute force defense and global-pattern stuffing detection as separate concerns.
  • Alerts need runbook links, separated security/availability recipients, and periodic alert-fatigue reviews.
  • Audit readiness means not just "collection" but "response evidence" — build the workflow and immutable storage.

Closing

Keycloak observability is complete when designed as four mutually reinforcing pillars: events (what happened), metrics (what state are we in now), traces (why is it slow), and logs (detailed context). With metrics and tracing built in since 26.x, the barrier to entry has dropped sharply; what remains is organizational discipline — retention policies, alert design, response runbooks.

Monitoring an authentication system sits at the intersection of security detection and compliance, beyond plain infrastructure operations. Stack up the components in this article one by one, and you can reach an operational posture where you "know about login outages before your users do" — and "know about attacks before the attackers succeed."

References