Lock-Free Concurrency Programming Complete Guide 2025: Atomics, Memory Model, CAS, Lock-Free Data Structures

TL;DR

Lock-Free essence: concurrency without locks, using atomic instructions like CAS (Compare-And-Swap).
Memory model: x86 provides strong ordering; ARM/RISC-V are weak. Understanding Acquire/Release semantics is essential.
Pros: no deadlocks, no priority inversion, sometimes faster than locks.
Cons: extremely hard, debugging nightmare, tiny mistakes corrupt data.
When to use: simple tasks like counters, sequence IDs, log buffers. For complex structures, use battle-tested libraries.

1. What is Lock-Free?

1.1 Definition

Lock-Free algorithm: when multiple threads operate concurrently, at least one thread is guaranteed to make progress.

Hierarchy:

Level	Guarantee
Wait-Free	All threads complete in bounded time (strongest)
Lock-Free	At least one thread makes progress in bounded time
Obstruction-Free	Progress possible if other threads stop
Blocking	Threads may wait indefinitely (uses locks)

1.2 Why Lock-Free?

Problems with locks:

Deadlock — two threads waiting for each other's locks.
Priority inversion — low-priority thread holds lock while high-priority waits.
Contention — many threads competing for the same lock serializes execution.
Blocking — lock holder page-faults, GCs, or is preempted, stalling everyone.

Lock-Free promises:

Deadlock-free (no locks).
Avoids syscalls (userspace only).
Cache-friendly (CPU atomic instructions).

1.3 Real Trade-offs

Lock-Free is not a silver bullet:

Implementation complexity: 10-100x harder than lock-based.
Verification difficulty: race conditions are extremely subtle.
Memory reclamation: safe deletion is hard (Hazard Pointers, Epochs).
Under contention: can be slower than locks when heavily contended.

Use Lock-Free only for simple operations. For complex structures, rely on vetted libraries like crossbeam or concurrent_hash_map.

2. CPU Memory Model

2.1 Why the Memory Model Matters

// Thread 1
x = 1;
y = 2;

// Thread 2
print(y);
print(x);

Intuitively, if print(y) shows 2, then print(x) should show 1. But CPUs and compilers can reorder instructions, so this is not always true.

2.2 x86 vs ARM Memory Models

	x86	ARM	RISC-V
Store-Store reorder	No	Yes	Yes
Load-Load reorder	No	Yes	Yes
Load-Store reorder	No	Yes	Yes
Store-Load reorder	Yes	Yes	Yes
Model	TSO (Total Store Order)	Weak	Weak

Code that works on x86 may fail on ARM. With iPhones, M1/M2 Macs, and AWS Graviton, understanding the ARM memory model is essential.

2.3 Memory Ordering

C++ / Rust std::memory_order:

#include <atomic>

std::atomic<int> x{0};

// Strongest (Sequential Consistency, default)
x.store(1, std::memory_order_seq_cst);

// Release: all prior memory ops complete before this store is visible
x.store(1, std::memory_order_release);

// Acquire: all subsequent memory ops happen after this load
int v = x.load(std::memory_order_acquire);

// Relaxed: atomicity only, no ordering (fastest)
x.store(1, std::memory_order_relaxed);

Release-Acquire pattern:

std::atomic<bool> ready{false};
int data = 0;

// Thread 1 (Producer)
data = 42;
ready.store(true, std::memory_order_release);

// Thread 2 (Consumer)
while (!ready.load(std::memory_order_acquire)) {}
assert(data == 42);  // Guaranteed!

All writes before release are visible after acquire.

3. CAS — Compare-And-Swap

3.1 Principle

bool compare_and_swap(int* ptr, int expected, int new_value) {
    if (*ptr == expected) {
        *ptr = new_value;
        return true;
    }
    return false;
}

The entire operation is atomic. Directly supported by cmpxchg (x86) or ldrex/strex (ARM).

3.2 CAS-based Counter

std::atomic<int> counter{0};

void increment() {
    int old_value, new_value;
    do {
        old_value = counter.load();
        new_value = old_value + 1;
    } while (!counter.compare_exchange_weak(old_value, new_value));
}

Why weak? Some architectures (ARM) allow spurious failures. Inside a loop, weak is faster.

3.3 std::atomic fetch_add

The code above is simpler with the standard library:

counter.fetch_add(1, std::memory_order_relaxed);

Internally uses LOCK XADD (x86) or LDADDX (ARM v8.1+).

3.4 Rust Atomics

use std::sync::atomic::{AtomicUsize, Ordering};

let counter = AtomicUsize::new(0);

// Atomic increment
counter.fetch_add(1, Ordering::Relaxed);

// CAS
let result = counter.compare_exchange(
    0,                     // expected
    1,                     // new
    Ordering::Acquire,     // success ordering
    Ordering::Relaxed      // failure ordering
);

Rust detects data races at compile time — misusing atomics is a compile error.

4. Lock-Free Data Structures

4.1 Treiber Stack — Simplest Lock-Free

template<typename T>
class LockFreeStack {
    struct Node {
        T value;
        Node* next;
    };
    std::atomic<Node*> head{nullptr};

public:
    void push(T value) {
        Node* new_node = new Node{value, nullptr};
        Node* old_head;
        do {
            old_head = head.load();
            new_node->next = old_head;
        } while (!head.compare_exchange_weak(old_head, new_node));
    }

    bool pop(T& value) {
        Node* old_head;
        Node* new_head;
        do {
            old_head = head.load();
            if (!old_head) return false;
            new_head = old_head->next;
        } while (!head.compare_exchange_weak(old_head, new_head));
        value = old_head->value;
        delete old_head;  // Unsafe! Other threads may still be reading
        return true;
    }
};

delete old_head is unsafe — another thread may be reading the same node.

4.2 ABA Problem

Initial: head -> A -> B -> C

Thread 1:                    Thread 2:
old_head = A
new_head = B
                              pop A -> delete A
                              pop B -> delete B
                              push A (reallocated memory)
                              head -> A -> C

CAS(old_head=A, new_head=B)  <- Succeeds! But wrong
head -> B (already freed memory)

Solution 1: Tagged Pointers (add version number)

struct TaggedPtr {
    Node* ptr;
    uint64_t tag;  // increment each time
};
std::atomic<TaggedPtr> head;

CAS compares the (ptr, tag) pair. Same ptr with different tag fails.

Solution 2: Hazard Pointers

Each thread announces pointers it's currently using; others won't free them.

Solution 3: Epoch-Based Reclamation

A global epoch counter. Defer free until all threads cross the same epoch. crossbeam-epoch (Rust) is the canonical implementation.

4.3 Michael-Scott Queue

The most famous Lock-Free queue.

struct Node {
    T value;
    std::atomic<Node*> next{nullptr};
};

std::atomic<Node*> head;
std::atomic<Node*> tail;

void enqueue(T value) {
    Node* new_node = new Node{value};
    Node* old_tail;
    Node* old_next;
    while (true) {
        old_tail = tail.load();
        old_next = old_tail->next.load();
        if (old_tail == tail.load()) {  // consistency check
            if (old_next == nullptr) {
                // try CAS to set next
                if (old_tail->next.compare_exchange_weak(old_next, new_node)) {
                    break;
                }
            } else {
                // tail is lagging, help it forward
                tail.compare_exchange_weak(old_tail, old_next);
            }
        }
    }
    // Advance tail (OK if it fails — another thread will help)
    tail.compare_exchange_weak(old_tail, new_node);
}

The helping trick: if tail is lagging, another thread helps advance it. That's what makes it Lock-Free — one stalled thread doesn't block others.

5. RCU — Read-Copy-Update

5.1 Linux Kernel's Secret Weapon

RCU makes reads lock-free:

Read: no locks, just read the pointer.
Update: copy data, modify, atomically swap pointer.
Reclaim: wait for all existing readers to finish, then free old data.

// Pseudo-code
struct config* config = rcu_dereference(global_config);
int value = config->some_value;  // read without locks

// Update
struct config* new_config = malloc(sizeof(*new_config));
memcpy(new_config, old_config, sizeof(*new_config));
new_config->some_value = 42;
rcu_assign_pointer(global_config, new_config);
synchronize_rcu();  // wait for readers
free(old_config);

Why fast? Reads have no atomic instructions — just normal memory reads. Cache-friendly.

When to use: when reads vastly outnumber writes (1000:1+). Used by Linux kernel routing tables, configs, etc.

5.2 Userspace RCU

liburcu (userspace RCU library)
crossbeam-epoch (Rust)
folly::rcu (Facebook C++)

6. Language Comparison

6.1 Java

import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.LongAdder;

// Simple counter (low contention)
AtomicLong counter = new AtomicLong(0);
counter.incrementAndGet();

// High-contention counter (Java 8+)
LongAdder adder = new LongAdder();
adder.increment();  // sharded cells avoid contention
long value = adder.sum();  // aggregate

LongAdder spreads writes across shards to reduce contention; reads sum all shards. For counters, LongAdder is almost always faster.

java.util.concurrent.atomic package:

AtomicInteger, AtomicLong, AtomicReference
AtomicIntegerArray, AtomicLongArray
LongAdder, LongAccumulator (high perf)
ConcurrentHashMap (Lock-Free reads)

6.2 C++

#include <atomic>

std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

// Must specify memory order precisely
std::atomic<bool> flag{false};
flag.store(true, std::memory_order_release);
bool v = flag.load(std::memory_order_acquire);

C++20 additions:

std::atomic_ref — atomic ops on non-atomic values.
std::atomic_wait/notify — efficient alternative to spinning.

6.3 Rust

use std::sync::atomic::{AtomicUsize, AtomicBool, Ordering};

static COUNTER: AtomicUsize = AtomicUsize::new(0);
COUNTER.fetch_add(1, Ordering::Relaxed);

Rust's strength: compile-time data race detection.

crossbeam library:

crossbeam-channel — Lock-Free MPMC channel
crossbeam-epoch — epoch-based reclamation
crossbeam-queue — Lock-Free queue
crossbeam-skiplist — concurrent skip list

6.4 Go

import "sync/atomic"

var counter int64
atomic.AddInt64(&counter, 1)
value := atomic.LoadInt64(&counter)

Go's memory model is relatively simple — sync.Mutex or channels are recommended for most cases. Use atomics only when performance is critical.

7. Practical Examples

7.1 Lock-Free Logger

class LockFreeLogger {
    static constexpr size_t SIZE = 1024;
    std::atomic<size_t> head{0};
    std::array<std::string, SIZE> buffer;

public:
    void log(const std::string& msg) {
        size_t idx = head.fetch_add(1, std::memory_order_relaxed) % SIZE;
        buffer[idx] = msg;  // Note: concurrent writes possible, use more care in real code
    }
};

Simplified example — real systems need ring buffer plus more careful synchronization.

7.2 Sequence ID Generator

use std::sync::atomic::{AtomicU64, Ordering};

struct IdGenerator {
    next_id: AtomicU64,
}

impl IdGenerator {
    fn next(&self) -> u64 {
        self.next_id.fetch_add(1, Ordering::Relaxed)
    }
}

Handles millions of IDs per second — 100x faster than locks.

7.3 Spin Lock (educational)

class SpinLock {
    std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:
    void lock() {
        while (flag.test_and_set(std::memory_order_acquire)) {
            // busy wait
        }
    }
    void unlock() {
        flag.clear(std::memory_order_release);
    }
};

Use only when hold time is very short (tens of ns). Otherwise, OS mutex is more efficient.

Improvement: x86 PAUSE instruction for better spinning:

while (flag.test_and_set(std::memory_order_acquire)) {
    _mm_pause();  // CPU hint
}

8. Performance Measurement and Pitfalls

struct Counters {
    std::atomic<int> a;
    std::atomic<int> b;  // same cache line!
};

If a and b share a 64-byte cache line, one core modifying a invalidates another core's cached b. Performance can drop 10x+.

Fix: pad to separate lines.

struct Counters {
    alignas(64) std::atomic<int> a;
    alignas(64) std::atomic<int> b;
};

8.2 Contention vs Throughput

Benchmark (simple counter, 4 cores):

Method	1 thread	4 threads	16 threads
Mutex	30M ops/s	8M ops/s	2M ops/s
atomic CAS	100M ops/s	25M ops/s	6M ops/s
LongAdder	80M ops/s	200M ops/s	800M ops/s

Under heavy contention, even simple atomics hit limits. Sharded counters like LongAdder win decisively.

8.3 Measurement Tools

Linux: perf stat -e cache-misses, perf c2c (false sharing detection)
Intel VTune — microarchitectural analysis
Java: JMH (Java Microbenchmark Harness)
Rust: criterion
C++: Google Benchmark

9. Battle-Tested Libraries

Almost never write your own. Use vetted libraries:

Language	Library	Provides
Rust	`crossbeam`	Lock-Free queue, channel, atomic, epoch
Rust	`dashmap`	Lock-Free HashMap
Rust	`parking_lot`	Fast mutex (alternative)
C++	`folly` (Facebook)	Various Lock-Free structures
C++	`boost::lockfree`	queue, stack, spsc_queue
C++	`concurrencpp`	coroutines + concurrency
Java	`j.u.c.atomic`, `j.u.c.ConcurrentHashMap`	standard
Java	JCTools	high-performance queues
Go	`sync/atomic`, `sync.Map`	standard

Quiz

1. What's the difference between Lock-Free and Wait-Free?

Lock-Free guarantees that at least one thread makes progress; other threads may starve. Wait-Free guarantees every thread completes in bounded time. Wait-Free is stronger but much harder to implement. Most practical algorithms are Lock-Free.

2. What is the ABA problem and how is it solved?

A value changes A to B to A; CAS sees the current value matches the expected value (A) and succeeds — but state in between changed. Solutions: (1) Tagged Pointers — add a version number; (2) Hazard Pointers — announce in-use pointers; (3) Epoch-Based Reclamation — free only at safe epochs.

3. How do x86 and ARM memory models differ?

x86 uses Total Store Order (TSO): no Load-Load, Store-Store, or Load-Store reordering (only Store-Load). ARM has a Weak Memory Model: almost all reorderings allowed. Code working on x86 may fail on ARM. In the iPhone, M1/M2 Mac, and AWS Graviton era, understanding Acquire/Release semantics is essential.

4. What is false sharing and how do you prevent it?

Two atomic variables on the same 64-byte cache line cause invalidations across cores when one is modified. Performance drops 10x+. Fix: align to cache line boundaries with alignas(64) (C++) or #[repr(align(64))] (Rust), or add padding. Detect with perf c2c.

5. When should you avoid Lock-Free?

(1) When a simple lock suffices — low-contention mutex may be faster; (2) Complex data structures — use vetted libraries like crossbeam or folly instead of rolling your own; (3) Maintenance cost — subtle bugs if the team doesn't understand the memory model; (4) Debuggability — Lock-Free bugs are extremely hard to reproduce and debug. Use it yourself only for simple counters, sequence IDs, and flags.

References

The Art of Multiprocessor Programming — Maurice Herlihy
C++ Concurrency in Action — Anthony Williams
Memory Models for C/C++ Programmers — Russ Cox
Preshing on Programming — Lock-Free series
crossbeam
folly
Linux Kernel RCU Documentation
Java Concurrency in Practice — Brian Goetz
Mechanical Sympathy — Martin Thompson
JCTools
perf c2c