- Published on
Lock-Free Concurrency Programming Complete Guide 2025: Atomics, Memory Model, CAS, Lock-Free Data Structures
- Authors

- Name
- Youngju Kim
- @fjvbn20031
TL;DR
- Lock-Free essence: concurrency without locks, using atomic instructions like CAS (Compare-And-Swap).
- Memory model: x86 provides strong ordering; ARM/RISC-V are weak. Understanding Acquire/Release semantics is essential.
- Pros: no deadlocks, no priority inversion, sometimes faster than locks.
- Cons: extremely hard, debugging nightmare, tiny mistakes corrupt data.
- When to use: simple tasks like counters, sequence IDs, log buffers. For complex structures, use battle-tested libraries.
1. What is Lock-Free?
1.1 Definition
Lock-Free algorithm: when multiple threads operate concurrently, at least one thread is guaranteed to make progress.
Hierarchy:
| Level | Guarantee |
|---|---|
| Wait-Free | All threads complete in bounded time (strongest) |
| Lock-Free | At least one thread makes progress in bounded time |
| Obstruction-Free | Progress possible if other threads stop |
| Blocking | Threads may wait indefinitely (uses locks) |
1.2 Why Lock-Free?
Problems with locks:
- Deadlock — two threads waiting for each other's locks.
- Priority inversion — low-priority thread holds lock while high-priority waits.
- Contention — many threads competing for the same lock serializes execution.
- Blocking — lock holder page-faults, GCs, or is preempted, stalling everyone.
Lock-Free promises:
- Deadlock-free (no locks).
- Avoids syscalls (userspace only).
- Cache-friendly (CPU atomic instructions).
1.3 Real Trade-offs
Lock-Free is not a silver bullet:
- Implementation complexity: 10-100x harder than lock-based.
- Verification difficulty: race conditions are extremely subtle.
- Memory reclamation: safe deletion is hard (Hazard Pointers, Epochs).
- Under contention: can be slower than locks when heavily contended.
Use Lock-Free only for simple operations. For complex structures, rely on vetted libraries like crossbeam or concurrent_hash_map.
2. CPU Memory Model
2.1 Why the Memory Model Matters
// Thread 1
x = 1;
y = 2;
// Thread 2
print(y);
print(x);
Intuitively, if print(y) shows 2, then print(x) should show 1. But CPUs and compilers can reorder instructions, so this is not always true.
2.2 x86 vs ARM Memory Models
| x86 | ARM | RISC-V | |
|---|---|---|---|
| Store-Store reorder | No | Yes | Yes |
| Load-Load reorder | No | Yes | Yes |
| Load-Store reorder | No | Yes | Yes |
| Store-Load reorder | Yes | Yes | Yes |
| Model | TSO (Total Store Order) | Weak | Weak |
Code that works on x86 may fail on ARM. With iPhones, M1/M2 Macs, and AWS Graviton, understanding the ARM memory model is essential.
2.3 Memory Ordering
C++ / Rust std::memory_order:
#include <atomic>
std::atomic<int> x{0};
// Strongest (Sequential Consistency, default)
x.store(1, std::memory_order_seq_cst);
// Release: all prior memory ops complete before this store is visible
x.store(1, std::memory_order_release);
// Acquire: all subsequent memory ops happen after this load
int v = x.load(std::memory_order_acquire);
// Relaxed: atomicity only, no ordering (fastest)
x.store(1, std::memory_order_relaxed);
Release-Acquire pattern:
std::atomic<bool> ready{false};
int data = 0;
// Thread 1 (Producer)
data = 42;
ready.store(true, std::memory_order_release);
// Thread 2 (Consumer)
while (!ready.load(std::memory_order_acquire)) {}
assert(data == 42); // Guaranteed!
All writes before release are visible after acquire.
3. CAS — Compare-And-Swap
3.1 Principle
bool compare_and_swap(int* ptr, int expected, int new_value) {
if (*ptr == expected) {
*ptr = new_value;
return true;
}
return false;
}
The entire operation is atomic. Directly supported by cmpxchg (x86) or ldrex/strex (ARM).
3.2 CAS-based Counter
std::atomic<int> counter{0};
void increment() {
int old_value, new_value;
do {
old_value = counter.load();
new_value = old_value + 1;
} while (!counter.compare_exchange_weak(old_value, new_value));
}
Why weak? Some architectures (ARM) allow spurious failures. Inside a loop, weak is faster.
3.3 std::atomic fetch_add
The code above is simpler with the standard library:
counter.fetch_add(1, std::memory_order_relaxed);
Internally uses LOCK XADD (x86) or LDADDX (ARM v8.1+).
3.4 Rust Atomics
use std::sync::atomic::{AtomicUsize, Ordering};
let counter = AtomicUsize::new(0);
// Atomic increment
counter.fetch_add(1, Ordering::Relaxed);
// CAS
let result = counter.compare_exchange(
0, // expected
1, // new
Ordering::Acquire, // success ordering
Ordering::Relaxed // failure ordering
);
Rust detects data races at compile time — misusing atomics is a compile error.
4. Lock-Free Data Structures
4.1 Treiber Stack — Simplest Lock-Free
template<typename T>
class LockFreeStack {
struct Node {
T value;
Node* next;
};
std::atomic<Node*> head{nullptr};
public:
void push(T value) {
Node* new_node = new Node{value, nullptr};
Node* old_head;
do {
old_head = head.load();
new_node->next = old_head;
} while (!head.compare_exchange_weak(old_head, new_node));
}
bool pop(T& value) {
Node* old_head;
Node* new_head;
do {
old_head = head.load();
if (!old_head) return false;
new_head = old_head->next;
} while (!head.compare_exchange_weak(old_head, new_head));
value = old_head->value;
delete old_head; // Unsafe! Other threads may still be reading
return true;
}
};
delete old_head is unsafe — another thread may be reading the same node.
4.2 ABA Problem
Initial: head -> A -> B -> C
Thread 1: Thread 2:
old_head = A
new_head = B
pop A -> delete A
pop B -> delete B
push A (reallocated memory)
head -> A -> C
CAS(old_head=A, new_head=B) <- Succeeds! But wrong
head -> B (already freed memory)
Solution 1: Tagged Pointers (add version number)
struct TaggedPtr {
Node* ptr;
uint64_t tag; // increment each time
};
std::atomic<TaggedPtr> head;
CAS compares the (ptr, tag) pair. Same ptr with different tag fails.
Solution 2: Hazard Pointers
Each thread announces pointers it's currently using; others won't free them.
Solution 3: Epoch-Based Reclamation
A global epoch counter. Defer free until all threads cross the same epoch. crossbeam-epoch (Rust) is the canonical implementation.
4.3 Michael-Scott Queue
The most famous Lock-Free queue.
struct Node {
T value;
std::atomic<Node*> next{nullptr};
};
std::atomic<Node*> head;
std::atomic<Node*> tail;
void enqueue(T value) {
Node* new_node = new Node{value};
Node* old_tail;
Node* old_next;
while (true) {
old_tail = tail.load();
old_next = old_tail->next.load();
if (old_tail == tail.load()) { // consistency check
if (old_next == nullptr) {
// try CAS to set next
if (old_tail->next.compare_exchange_weak(old_next, new_node)) {
break;
}
} else {
// tail is lagging, help it forward
tail.compare_exchange_weak(old_tail, old_next);
}
}
}
// Advance tail (OK if it fails — another thread will help)
tail.compare_exchange_weak(old_tail, new_node);
}
The helping trick: if tail is lagging, another thread helps advance it. That's what makes it Lock-Free — one stalled thread doesn't block others.
5. RCU — Read-Copy-Update
5.1 Linux Kernel's Secret Weapon
RCU makes reads lock-free:
- Read: no locks, just read the pointer.
- Update: copy data, modify, atomically swap pointer.
- Reclaim: wait for all existing readers to finish, then free old data.
// Pseudo-code
struct config* config = rcu_dereference(global_config);
int value = config->some_value; // read without locks
// Update
struct config* new_config = malloc(sizeof(*new_config));
memcpy(new_config, old_config, sizeof(*new_config));
new_config->some_value = 42;
rcu_assign_pointer(global_config, new_config);
synchronize_rcu(); // wait for readers
free(old_config);
Why fast? Reads have no atomic instructions — just normal memory reads. Cache-friendly.
When to use: when reads vastly outnumber writes (1000:1+). Used by Linux kernel routing tables, configs, etc.
5.2 Userspace RCU
- liburcu (userspace RCU library)
- crossbeam-epoch (Rust)
- folly::rcu (Facebook C++)
6. Language Comparison
6.1 Java
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.LongAdder;
// Simple counter (low contention)
AtomicLong counter = new AtomicLong(0);
counter.incrementAndGet();
// High-contention counter (Java 8+)
LongAdder adder = new LongAdder();
adder.increment(); // sharded cells avoid contention
long value = adder.sum(); // aggregate
LongAdder spreads writes across shards to reduce contention; reads sum all shards. For counters, LongAdder is almost always faster.
java.util.concurrent.atomic package:
AtomicInteger,AtomicLong,AtomicReferenceAtomicIntegerArray,AtomicLongArrayLongAdder,LongAccumulator(high perf)ConcurrentHashMap(Lock-Free reads)
6.2 C++
#include <atomic>
std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);
// Must specify memory order precisely
std::atomic<bool> flag{false};
flag.store(true, std::memory_order_release);
bool v = flag.load(std::memory_order_acquire);
C++20 additions:
std::atomic_ref— atomic ops on non-atomic values.std::atomic_wait/notify— efficient alternative to spinning.
6.3 Rust
use std::sync::atomic::{AtomicUsize, AtomicBool, Ordering};
static COUNTER: AtomicUsize = AtomicUsize::new(0);
COUNTER.fetch_add(1, Ordering::Relaxed);
Rust's strength: compile-time data race detection.
crossbeam library:
crossbeam-channel— Lock-Free MPMC channelcrossbeam-epoch— epoch-based reclamationcrossbeam-queue— Lock-Free queuecrossbeam-skiplist— concurrent skip list
6.4 Go
import "sync/atomic"
var counter int64
atomic.AddInt64(&counter, 1)
value := atomic.LoadInt64(&counter)
Go's memory model is relatively simple — sync.Mutex or channels are recommended for most cases. Use atomics only when performance is critical.
7. Practical Examples
7.1 Lock-Free Logger
class LockFreeLogger {
static constexpr size_t SIZE = 1024;
std::atomic<size_t> head{0};
std::array<std::string, SIZE> buffer;
public:
void log(const std::string& msg) {
size_t idx = head.fetch_add(1, std::memory_order_relaxed) % SIZE;
buffer[idx] = msg; // Note: concurrent writes possible, use more care in real code
}
};
Simplified example — real systems need ring buffer plus more careful synchronization.
7.2 Sequence ID Generator
use std::sync::atomic::{AtomicU64, Ordering};
struct IdGenerator {
next_id: AtomicU64,
}
impl IdGenerator {
fn next(&self) -> u64 {
self.next_id.fetch_add(1, Ordering::Relaxed)
}
}
Handles millions of IDs per second — 100x faster than locks.
7.3 Spin Lock (educational)
class SpinLock {
std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:
void lock() {
while (flag.test_and_set(std::memory_order_acquire)) {
// busy wait
}
}
void unlock() {
flag.clear(std::memory_order_release);
}
};
Use only when hold time is very short (tens of ns). Otherwise, OS mutex is more efficient.
Improvement: x86 PAUSE instruction for better spinning:
while (flag.test_and_set(std::memory_order_acquire)) {
_mm_pause(); // CPU hint
}
8. Performance Measurement and Pitfalls
8.1 False Sharing
struct Counters {
std::atomic<int> a;
std::atomic<int> b; // same cache line!
};
If a and b share a 64-byte cache line, one core modifying a invalidates another core's cached b. Performance can drop 10x+.
Fix: pad to separate lines.
struct Counters {
alignas(64) std::atomic<int> a;
alignas(64) std::atomic<int> b;
};
8.2 Contention vs Throughput
Benchmark (simple counter, 4 cores):
| Method | 1 thread | 4 threads | 16 threads |
|---|---|---|---|
| Mutex | 30M ops/s | 8M ops/s | 2M ops/s |
| atomic CAS | 100M ops/s | 25M ops/s | 6M ops/s |
| LongAdder | 80M ops/s | 200M ops/s | 800M ops/s |
Under heavy contention, even simple atomics hit limits. Sharded counters like LongAdder win decisively.
8.3 Measurement Tools
- Linux:
perf stat -e cache-misses,perf c2c(false sharing detection) - Intel VTune — microarchitectural analysis
- Java: JMH (Java Microbenchmark Harness)
- Rust:
criterion - C++: Google Benchmark
9. Battle-Tested Libraries
Almost never write your own. Use vetted libraries:
| Language | Library | Provides |
|---|---|---|
| Rust | crossbeam | Lock-Free queue, channel, atomic, epoch |
| Rust | dashmap | Lock-Free HashMap |
| Rust | parking_lot | Fast mutex (alternative) |
| C++ | folly (Facebook) | Various Lock-Free structures |
| C++ | boost::lockfree | queue, stack, spsc_queue |
| C++ | concurrencpp | coroutines + concurrency |
| Java | j.u.c.atomic, j.u.c.ConcurrentHashMap | standard |
| Java | JCTools | high-performance queues |
| Go | sync/atomic, sync.Map | standard |
Quiz
1. What's the difference between Lock-Free and Wait-Free?
Lock-Free guarantees that at least one thread makes progress; other threads may starve. Wait-Free guarantees every thread completes in bounded time. Wait-Free is stronger but much harder to implement. Most practical algorithms are Lock-Free.
2. What is the ABA problem and how is it solved?
A value changes A to B to A; CAS sees the current value matches the expected value (A) and succeeds — but state in between changed. Solutions: (1) Tagged Pointers — add a version number; (2) Hazard Pointers — announce in-use pointers; (3) Epoch-Based Reclamation — free only at safe epochs.
3. How do x86 and ARM memory models differ?
x86 uses Total Store Order (TSO): no Load-Load, Store-Store, or Load-Store reordering (only Store-Load). ARM has a Weak Memory Model: almost all reorderings allowed. Code working on x86 may fail on ARM. In the iPhone, M1/M2 Mac, and AWS Graviton era, understanding Acquire/Release semantics is essential.
4. What is false sharing and how do you prevent it?
Two atomic variables on the same 64-byte cache line cause invalidations across cores when one is modified. Performance drops 10x+. Fix: align to cache line boundaries with alignas(64) (C++) or #[repr(align(64))] (Rust), or add padding. Detect with perf c2c.
5. When should you avoid Lock-Free?
(1) When a simple lock suffices — low-contention mutex may be faster; (2) Complex data structures — use vetted libraries like crossbeam or folly instead of rolling your own; (3) Maintenance cost — subtle bugs if the team doesn't understand the memory model; (4) Debuggability — Lock-Free bugs are extremely hard to reproduce and debug. Use it yourself only for simple counters, sequence IDs, and flags.
References
- The Art of Multiprocessor Programming — Maurice Herlihy
- C++ Concurrency in Action — Anthony Williams
- Memory Models for C/C++ Programmers — Russ Cox
- Preshing on Programming — Lock-Free series
- crossbeam
- folly
- Linux Kernel RCU Documentation
- Java Concurrency in Practice — Brian Goetz
- Mechanical Sympathy — Martin Thompson
- JCTools
- perf c2c