Skip to content
Published on

Distributed Lock Pattern Comparison: Redis Redlock vs ZooKeeper vs etcd — Consistency and Availability Trade-offs

Authors
Distributed Lock Pattern Comparison

Introduction

In distributed systems, when multiple processes simultaneously access the same resource, data consistency breaks. Operations requiring mutual exclusion such as inventory deduction, payment processing, and file writing must be performed by only one process. On a single server, mutexes or semaphores can solve this, but between processes distributed across multiple servers, a Distributed Lock is needed.

Distributed lock use cases fall into two categories:

  • Efficiency: To prevent duplicate execution of the same work. If a lock occasionally fails, only cost is wasted without data corruption.
  • Correctness: To prevent data corruption from concurrent access. If a lock fails, data is corrupted, requiring much stricter guarantees.

For efficiency purposes, a Redis single instance lock is sufficient. However, for correctness purposes, a Fencing Token must accompany the lock, and this difference is the core of the Redis Redlock debate.

Redis Single Instance Lock

The simplest distributed lock uses Redis's SET NX PX command. The NX (Not eXists) option sets the key only when it doesn't exist, and PX specifies expiration in milliseconds.

Basic Implementation

import redis
import uuid
import time


class RedisSimpleLock:
    """Redis single instance distributed lock"""

    def __init__(self, client: redis.Redis, resource: str, ttl_ms: int = 10000):
        self.client = client
        self.resource = resource
        self.ttl_ms = ttl_ms
        self.lock_value = str(uuid.uuid4())  # Ownership identifier

    def acquire(self, retry_count: int = 3, retry_delay_ms: int = 200) -> bool:
        """Attempt to acquire lock. Retry on failure."""
        for attempt in range(retry_count):
            result = self.client.set(
                self.resource,
                self.lock_value,
                nx=True,
                px=self.ttl_ms
            )
            if result:
                return True
            time.sleep(retry_delay_ms / 1000.0)
        return False

    def release(self) -> bool:
        """Release lock after ownership verification (using Lua script)"""
        lua_script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        else
            return 0
        end
        """
        result = self.client.eval(lua_script, 1, self.resource, self.lock_value)
        return result == 1


# Usage example
client = redis.Redis(host="localhost", port=6379)
lock = RedisSimpleLock(client, "order:12345:lock", ttl_ms=5000)

if lock.acquire():
    try:
        # Perform critical section work
        print("Lock acquired, performing work...")
    finally:
        lock.release()
else:
    print("Failed to acquire lock")

Safe Lock Release with Lua Script

When releasing a lock, ownership must be verified. If GET and DEL are executed as separate commands, another client could acquire the lock between them. Lua scripts execute atomically in Redis, preventing this issue.

-- safe_unlock.lua
-- Safe lock release after ownership verification
-- KEYS[1]: Lock key
-- ARGV[1]: Owner identification value
-- Returns: 1 (success), 0 (ownership mismatch)

if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
else
    return 0
end

The limitation of the single instance approach is clear. If the Redis master fails, lock information is lost. Even with failover to a replica, due to the asynchronous replication nature, if the master goes down before the lock key is replicated, two clients can simultaneously acquire the same lock.

Redlock Algorithm

Redis creator Salvatore Sanfilippo (antirez) proposed the Redlock algorithm to overcome single instance limitations. The core idea is to obtain majority consensus from N (typically 5) independent Redis master nodes.

Three-Step Algorithm

  1. Acquisition phase: Record the current time, then sequentially send SET NX PX commands to all N nodes. The timeout for each node is set much shorter than the total TTL.
  2. Validity check: If locks were acquired on a majority (N/2 + 1) or more nodes, and the total elapsed time is less than the TTL, the lock acquisition is successful. Valid remaining time is TTL - elapsed time.
  3. Release phase: Send release commands unconditionally to all N nodes. Release is sent even to nodes where acquisition failed to clean up partially set keys.

Redlock Python Implementation

import redis
import uuid
import time
from typing import List, Optional, Tuple


class Redlock:
    """Redlock distributed lock algorithm implementation"""

    CLOCK_DRIFT_FACTOR = 0.01  # Clock drift correction factor
    RETRY_DELAY_MS = 200
    RETRY_COUNT = 3

    def __init__(self, nodes: List[dict], ttl_ms: int = 10000):
        self.nodes = [
            redis.Redis(host=n["host"], port=n["port"], socket_timeout=0.1)
            for n in nodes
        ]
        self.quorum = len(self.nodes) // 2 + 1
        self.ttl_ms = ttl_ms

    def _acquire_single(self, client: redis.Redis, resource: str,
                        value: str) -> bool:
        try:
            return bool(client.set(resource, value, nx=True, px=self.ttl_ms))
        except redis.RedisError:
            return False

    def _release_single(self, client: redis.Redis, resource: str,
                        value: str) -> None:
        lua = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        else
            return 0
        end
        """
        try:
            client.eval(lua, 1, resource, value)
        except redis.RedisError:
            pass

    def acquire(self, resource: str) -> Optional[Tuple[str, float]]:
        """Acquire lock. Returns (lock_value, validity_time) on success"""
        for _ in range(self.RETRY_COUNT):
            lock_value = str(uuid.uuid4())
            start_time = time.monotonic()
            acquired_count = 0

            # Step 1: Attempt lock acquisition on all nodes
            for client in self.nodes:
                if self._acquire_single(client, resource, lock_value):
                    acquired_count += 1

            # Step 2: Validity check
            elapsed_ms = (time.monotonic() - start_time) * 1000
            drift = self.ttl_ms * self.CLOCK_DRIFT_FACTOR + 2
            validity_time = self.ttl_ms - elapsed_ms - drift

            if acquired_count >= self.quorum and validity_time > 0:
                return (lock_value, validity_time)

            # On failure, release on all nodes
            for client in self.nodes:
                self._release_single(client, resource, lock_value)

            time.sleep(self.RETRY_DELAY_MS / 1000.0)

        return None

    def release(self, resource: str, lock_value: str) -> None:
        """Release lock on all nodes"""
        for client in self.nodes:
            self._release_single(client, resource, lock_value)


# Usage example
nodes = [
    {"host": "redis1.example.com", "port": 6379},
    {"host": "redis2.example.com", "port": 6379},
    {"host": "redis3.example.com", "port": 6379},
    {"host": "redis4.example.com", "port": 6379},
    {"host": "redis5.example.com", "port": 6379},
]

redlock = Redlock(nodes, ttl_ms=10000)
result = redlock.acquire("payment:order:99999")

if result:
    lock_value, validity_ms = result
    try:
        print(f"Lock acquired. Valid for: {validity_ms:.0f}ms")
        # Perform critical section work
    finally:
        redlock.release("payment:order:99999", lock_value)

Using time.monotonic() is important here. System time (time.time()) can go backward due to NTP corrections, but a monotonic clock always moves forward. The CLOCK_DRIFT_FACTOR also compensates for clock drift between nodes.

Redlock Critique: Kleppmann vs Antirez Debate

In 2016, Martin Kleppmann identified fundamental problems with the Redlock algorithm in his article "How to do distributed locking." This debate became one of the most famous technical discussions in the distributed systems community.

Kleppmann's Core Critique

1. Timing Assumption Risks

Redlock relies on the timing assumption that processes complete quickly without interruption. But in reality:

  • Client A successfully acquires locks on 3 of 5 nodes
  • A GC (Garbage Collection) pause occurs, freezing the process for tens of seconds
  • Meanwhile, the TTL expires and the lock is released
  • Client B acquires the same lock and performs work
  • Client A returns from GC and (believing it still holds the lock) performs work
  • Both clients perform critical section work simultaneously, causing data corruption

2. Absence of Fencing Token

Kleppmann argued that safe distributed locks must have a Fencing Token. A Fencing Token is a monotonically increasing number transmitted along with resource access (e.g., database). When the resource side rejects requests with tokens lower than previously seen ones, delayed writes from expired lock holders can be safely blocked. Redlock lacks a mechanism to generate such Fencing Tokens.

3. Network Delay and Clock Jumps

If system clocks suddenly jump due to NTP synchronization failure or VM migration, TTL calculations become invalid. Redlock assumes that clocks between nodes are roughly synchronized, but this cannot be guaranteed in asynchronous distributed systems.

Antirez's Response

Salvatore Sanfilippo responded in his article "Is Redlock safe?" with the following counterarguments:

  • The GC pause scenario is not specific to Redlock but applies to all distributed lock systems
  • In reasonable operational environments, clock drift is limited and can be sufficiently corrected with CLOCK_DRIFT_FACTOR
  • Fencing Tokens require support from the resource side, and if the resource supports this, it could potentially handle concurrency control on its own

Comparison Table: Redlock Debate Summary

PointKleppmann (Critique)Antirez (Defense)
Timing assumptionsTiming assumptions are dangerous in async systemsSufficiently valid in reasonable operational environments
GC pauseProcess can be suspended during lock validityA common problem for all distributed systems
Fencing TokenRedlock cannot generate them, essentialIf resource supports them, the lock itself may be unnecessary
Clock syncClock jump risk during NTP failureAddressable with drift correction factor
Recommended useEfficiency only, unsuitable for correctnessSufficiently safe for most practical scenarios

In the author's view, standalone Redlock use is not recommended when correctness is important. It's safer to use it with storage that supports Fencing Tokens, or to choose consensus-based systems like ZooKeeper or etcd.

ZooKeeper Distributed Lock

Apache ZooKeeper is a dedicated service designed for distributed system coordination. Through the Zab (ZooKeeper Atomic Broadcast) protocol, it guarantees Linearizability and provides built-in primitives needed for distributed lock implementation.

Ephemeral Sequential Node Pattern

ZooKeeper's distributed lock uses Ephemeral Sequential Znodes:

  1. The client creates an ephemeral sequential node at the /locks/resource-name/lock- path.
  2. ZooKeeper automatically assigns a sequence number (e.g., lock-0000000001).
  3. If the created node has the smallest number, the lock is acquired.
  4. Otherwise, a Watch is set on the immediately preceding node and the client waits.
  5. When the client session ends, the ephemeral node is automatically deleted, releasing the lock.

Setting a Watch only on the immediately preceding node is key. If all waiters Watch the smallest node, a Herd Effect occurs when the lock is released, as notifications are sent to all waiters simultaneously.

ZooKeeper Distributed Lock Python Implementation

from kazoo.client import KazooClient
from kazoo.recipe.lock import Lock
import logging

logging.basicConfig(level=logging.INFO)


class ZooKeeperDistributedLock:
    """ZooKeeper-based distributed lock (using Kazoo library)"""

    def __init__(self, hosts: str, lock_path: str):
        self.zk = KazooClient(hosts=hosts)
        self.zk.start()
        self.lock = Lock(self.zk, lock_path)
        self.lock_path = lock_path

    def acquire(self, timeout: float = 30.0) -> bool:
        """Acquire lock. Returns False if not acquired within timeout seconds"""
        try:
            return self.lock.acquire(timeout=timeout)
        except Exception as e:
            logging.error(f"Lock acquisition failed: {e}")
            return False

    def release(self) -> None:
        """Release lock"""
        try:
            self.lock.release()
        except Exception as e:
            logging.error(f"Lock release failed: {e}")

    def get_fencing_token(self) -> int:
        """Use zxid as Fencing Token"""
        data, stat = self.zk.get(self.lock_path)
        return stat.czxid  # Creation transaction ID (monotonically increasing)

    def close(self) -> None:
        self.zk.stop()
        self.zk.close()


# Usage example
zk_lock = ZooKeeperDistributedLock(
    hosts="zk1:2181,zk2:2181,zk3:2181",
    lock_path="/locks/payment/order-99999"
)

if zk_lock.acquire(timeout=10.0):
    try:
        fencing_token = zk_lock.get_fencing_token()
        logging.info(f"Lock acquired, fencing token: {fencing_token}")
        # Write to storage with fencing_token
        # storage.write(data, fencing_token=fencing_token)
    finally:
        zk_lock.release()

zk_lock.close()

ZooKeeper's strength is that zxid (ZooKeeper Transaction ID) can be used as a Fencing Token. Since zxid is globally monotonically increasing, if the storage side rejects writes with previous zxids, delayed writes from expired lock holders can be safely blocked.

Read-Write Lock Recipe

ZooKeeper supports not only exclusive locks but also Read-Write Locks. Read lock nodes use the read- prefix, and write lock nodes use the write- prefix. A read lock can be acquired when there are no write nodes ahead, and a write lock can only be acquired when it has the smallest number. This allows higher read concurrency while ensuring write mutual exclusion.

etcd Distributed Lock

etcd is a distributed key-value store widely known as the state store for Kubernetes. Based on the Raft consensus algorithm, it guarantees Strong Consistency and provides Lease and Revision mechanisms suitable for distributed lock implementation.

Lease-based TTL Management

etcd's Lease is a temporary token with TTL. When a key-value pair is attached to a Lease, the key is also deleted when the Lease expires. The client periodically calls KeepAlive to renew the Lease, and if the client fails, renewal stops and the lock is automatically released.

Automatic Fencing Token Generation via Revision Numbers

Every key change in etcd is assigned a globally monotonically increasing Revision number. The ability to directly use this Revision as a Fencing Token is a major advantage of etcd distributed locks. No separate token generation mechanism is needed.

etcd Distributed Lock Python Implementation

import etcd3
import threading
import logging
from typing import Optional, Tuple

logging.basicConfig(level=logging.INFO)


class EtcdDistributedLock:
    """etcd Lease-based distributed lock"""

    def __init__(self, host: str = "localhost", port: int = 2379,
                 ttl: int = 10):
        self.client = etcd3.client(host=host, port=port)
        self.ttl = ttl
        self.lease: Optional[etcd3.Lease] = None
        self._keepalive_thread: Optional[threading.Thread] = None
        self._stop_keepalive = threading.Event()

    def acquire(self, lock_key: str,
                timeout: float = 30.0) -> Optional[Tuple[int, int]]:
        """
        Acquire lock. Returns (revision, lease_id) on success.
        Revision can be used as Fencing Token.
        """
        self.lease = self.client.lease(self.ttl)
        self.lock_key = lock_key

        # Atomic lock acquisition via transaction
        # Create only when key doesn't exist (Compare-And-Swap)
        success, responses = self.client.transaction(
            compare=[
                self.client.transactions.create(lock_key) == 0
            ],
            success=[
                self.client.transactions.put(
                    lock_key, "locked", lease=self.lease
                )
            ],
            failure=[]
        )

        if success:
            # Use Revision as Fencing Token
            revision = responses[0].header.revision
            self._start_keepalive()
            logging.info(
                f"Lock acquired. revision(fencing token): {revision}"
            )
            return (revision, self.lease.id)

        logging.warning("Lock acquisition failed: already held by another client")
        self.lease.revoke()
        return None

    def _start_keepalive(self) -> None:
        """Start Lease auto-renewal thread"""
        self._stop_keepalive.clear()

        def keepalive_loop():
            while not self._stop_keepalive.is_set():
                try:
                    self.lease.refresh()
                except Exception as e:
                    logging.error(f"Lease renewal failed: {e}")
                    break
                self._stop_keepalive.wait(timeout=self.ttl / 3.0)

        self._keepalive_thread = threading.Thread(
            target=keepalive_loop, daemon=True
        )
        self._keepalive_thread.start()

    def release(self) -> None:
        """Release lock"""
        self._stop_keepalive.set()
        if self.lease:
            try:
                self.lease.revoke()
                logging.info("Lock released")
            except Exception as e:
                logging.error(f"Lock release failed: {e}")

    def close(self) -> None:
        self.release()
        self.client.close()


# Usage example
lock = EtcdDistributedLock(host="etcd1.example.com", ttl=15)
result = lock.acquire("locks/payment/order-99999")

if result:
    fencing_token, lease_id = result
    try:
        # Write to storage with fencing_token (revision)
        logging.info(f"Performing work, fencing token: {fencing_token}")
    finally:
        lock.release()

Jepsen Test Results and Cautions

In the Jepsen etcd 3.4.3 test, it was confirmed that etcd locks may not be safe in certain network partition scenarios. Specifically, if Lease renewal is delayed during leader changes, the client may still believe it holds the lock while the Lease has actually expired and another client has acquired it. Therefore, etcd locks must also be used with Fencing Tokens.

Three-Way Comparison Analysis

Core Comparison Table

ItemRedis RedlockZooKeeperetcd
Consensus algorithmNone (independent node majority)Zab (Atomic Broadcast)Raft
Consistency modelEventual consistency approximationLinearizableLinearizable
Fencing TokenNot supportedzxid availableRevision available
Fault toleranceUp to N/2 node failuresUp to N/2 node failuresUp to N/2 node failures
Lock release mechanismTTL expirationSession expiry + ephemeral node deletionLease expiration
Performance (acquisition latency)Very fast (1-5ms)Medium (5-20ms)Medium (5-15ms)
ThroughputHigh (10K+ ops/s)Medium (1-5K ops/s)Medium (2-8K ops/s)
Operational complexityLowHigh (dedicated ensemble)Medium (already present in K8s)
Watch/notificationPub/Sub (no guarantee)Watch (ordered)Watch (Revision-based)
Client ecosystemVery richRichRich (especially Go ecosystem)

Selection Guide by Use Case

Choose Redis Redlock when:

  • Preventing duplicate work for efficiency (cache warming, batch jobs, etc.)
  • Fast millisecond-level lock acquisition is needed
  • Redis is already in use and you don't want to add infrastructure
  • Occasional double execution is acceptable

Choose ZooKeeper when:

  • Correctness is paramount and Fencing Token is essential
  • Various coordination needs like leader election, config management
  • Already operating ZooKeeper-dependent systems (Hadoop, Kafka, etc.)
  • Complex lock patterns like Read-Write Lock are needed

Choose etcd when:

  • Already operating etcd in a Kubernetes environment
  • Want to easily leverage Fencing Token (Revision)
  • Prefer gRPC-based API and Go ecosystem
  • Want strong consistency with relatively low operational burden

Cost-Complexity Matrix

CriteriaRedis RedlockZooKeeperetcd
Infrastructure costLow (reuse Redis)High (dedicated cluster)Medium (can be included in K8s)
Learning curveLowHighMedium
Consistency guaranteeWeakStrongStrong
Debugging easeHighMediumMedium
Community supportVery activeMatureGrowing

Failure Cases and Recovery Procedures

Case 1: Dual Lock Acquisition During GC Pause with TTL Expiration

Scenario:

Time ------>

Client A: [Lock acquired] --- [GC pause starts] -------------------- [GC resumes, attempts write]
                                              TTL expires
Client B:                          [Lock acquired] --- [Write complete] --- [Write complete]

Result: Both A and B write -> Data corruption

This scenario can occur not only with Redis Redlock but with all TTL-based locks. The defense pattern is to use Fencing Tokens:

class FencingAwareStorage:
    """Storage wrapper that validates Fencing Tokens"""

    def __init__(self):
        self.last_token = 0
        self._lock = threading.Lock()

    def write(self, data: dict, fencing_token: int) -> bool:
        """Allow writes only when fencing_token is greater than previous value"""
        with self._lock:
            if fencing_token <= self.last_token:
                logging.warning(
                    f"Rejected: token {fencing_token} <= "
                    f"last {self.last_token}"
                )
                return False
            self.last_token = fencing_token
            # Perform actual write
            self._do_write(data)
            logging.info(
                f"Write successful: token {fencing_token}"
            )
            return True

    def _do_write(self, data: dict) -> None:
        # Actual storage write logic
        pass

Case 2: Lock Loss Due to ZooKeeper Session Expiry

Scenario:

When the connection between ZooKeeper client and ensemble is severed due to network partition, after session timeout the ephemeral node is deleted and the lock is released. Meanwhile, the client may still be performing work.

Defense pattern:

from kazoo.client import KazooState


def connection_listener(state):
    """ZooKeeper connection state monitoring"""
    if state == KazooState.SUSPENDED:
        # Connection suspended: pause ongoing work
        logging.warning("ZK connection suspended - pausing operations")
        pause_current_operations()
    elif state == KazooState.LOST:
        # Session expired: lock loss confirmed, stop work immediately
        logging.error("ZK session expired - lock lost, aborting operations")
        abort_current_operations()
    elif state == KazooState.CONNECTED:
        # Reconnection successful: attempt lock reacquisition
        logging.info("ZK reconnected - attempting lock reacquisition")
        reacquire_lock()


zk = KazooClient(hosts="zk1:2181,zk2:2181,zk3:2181")
zk.add_listener(connection_listener)
zk.start()

Recovery Procedure Checklist

  1. Immediate response: Stop current work immediately upon detecting lock loss
  2. State verification: Check data consistency of partially completed work
  3. Idempotency design: Design work to be idempotent ensuring same results on retry
  4. Compensating transactions: Execute compensation logic to revert partially completed state
  5. Alert dispatch: Notify operations team of potential double execution

Operational Considerations

Lock Granularity and Scope Design

Lock scope should be designed as narrow as possible. Broad-scoped locks increase contention and reduce throughput.

Bad:     /locks/orders           (Single lock for all orders)
Medium:  /locks/orders/user-123  (Per-user lock)
Good:    /locks/orders/99999     (Per-order lock)

Deadlock Detection and Timeout Strategy

In distributed environments, deadlocks can occur when two processes wait for each other's locks. Prevention strategies include:

  • Fixed-order acquisition: When locks on multiple resources are needed, always acquire in the same order
  • Timeout setting: Set timeouts on all lock acquisitions to prevent infinite waiting
  • Lock hierarchies: Acquire locks from parent resources to child resources in order

Monitoring Metrics

Metrics that must be collected when operating distributed locks:

MetricDescriptionDanger Threshold
lock_acquisition_time_p99Lock acquisition latency P99Exceeds 50% of TTL
lock_contention_rateLock contention rate (fails/total)Exceeds 30%
lock_hold_duration_p99Lock hold duration P99Exceeds 80% of TTL
lock_timeout_rateTimeout occurrence rateExceeds 5%
fencing_token_reject_rateFencing Token rejection rateInvestigate immediately if > 0

If fencing_token_reject_rate is greater than 0, it means double execution has occurred, so this metric requires the most urgent response.

Conclusion

Distributed locks are not simple API calls but architecture decisions about understanding and designing the trade-off between consistency and availability.

  • For efficiency purposes, a Redis single instance lock is sufficient. Simple and fast.
  • For efficiency but concerned about Redis single point of failure, consider Redlock. However, consistency guarantees are limited.
  • When correctness is important, choose ZooKeeper or etcd and always use with Fencing Tokens.
  • Regardless of implementation, Fencing Token + idempotency design + monitoring are the three pillars of a complete distributed lock.

No perfect distributed lock exists. What matters is clearly defining the guarantee level your system requires, choosing the appropriate implementation, and having defense strategies for failure scenarios.

References