Split View: Lock-Free 동시성 프로그래밍 완전 가이드 2025: Atomics, 메모리 모델, CAS, 무잠금 자료구조

Lock-Free 동시성 프로그래밍 완전 가이드 2025: Atomics, 메모리 모델, CAS, 무잠금 자료구조

TL;DR

Lock-Free의 핵심: 락 없이 동시성 보장. CAS(Compare-And-Swap) 같은 원자 명령 활용
메모리 모델: x86은 강한 순서 보장, ARM/RISC-V는 약함. Acquire/Release semantics 이해 필수
장점: 데드락 없음, 우선순위 역전 없음, 일부 시나리오에서 락보다 빠름
단점: 매우 어려움, 디버깅 악몽, 미세한 실수가 데이터 손상
언제 사용: 카운터, 시퀀스 ID, 로그 버퍼 등 단순 작업. 복잡한 자료구조는 검증된 라이브러리 사용

1. Lock-Free란 무엇인가?

1.1 정의

Lock-Free 알고리즘: 여러 스레드가 동시에 작업할 때, 적어도 하나의 스레드는 항상 진전(progress)을 보장하는 알고리즘.

계층 구조:

수준	보장
Wait-Free	모든 스레드가 유한 시간 내 완료 (가장 강력)
Lock-Free	적어도 하나의 스레드는 유한 시간 내 진전
Obstruction-Free	다른 스레드가 멈추면 진전 가능
Blocking	스레드가 무한 대기 가능 (락 사용)

1.2 왜 Lock-Free인가?

락의 문제점:

데드락 — 두 스레드가 서로의 락을 기다림
우선순위 역전 — 낮은 우선순위 스레드가 락 보유 → 높은 우선순위 스레드가 대기
컨텐션 — 많은 스레드가 같은 락 경쟁 → 직렬화
블로킹 — 락 보유자가 페이지 폴트, GC, 스케줄링 → 다른 스레드 모두 멈춤

Lock-Free의 약속:

데드락 불가능 (락이 없음)
시스템 호출 회피 (사용자 공간만)
캐시 친화적 (CPU atomic 명령)

1.3 현실적 트레이드오프

Lock-Free는 만능이 아닙니다:

구현 복잡도: 락 기반의 10-100배
검증 어려움: 인종 조건이 매우 미묘
메모리 회수: 안전한 객체 삭제가 어려움 (Hazard Pointer, Epoch)
경합 시 성능: 무거운 컨텐션에서 락보다 느릴 수 있음

조언: 단순한 작업에만 Lock-Free 사용. 복잡한 자료구조는 검증된 라이브러리(crossbeam, concurrent_hash_map).

2. CPU 메모리 모델

2.1 왜 메모리 모델이 중요한가?

// Thread 1
x = 1;
y = 2;

// Thread 2
print(y);
print(x);

직관적으로는 print(y)가 2이면 print(x)도 1이어야 합니다. 하지만 CPU와 컴파일러가 명령을 재배치할 수 있어 항상 그렇지 않습니다.

2.2 x86 vs ARM 메모리 모델

	x86	ARM	RISC-V
Store-Store 재배치	❌	✅	✅
Load-Load 재배치	❌	✅	✅
Load-Store 재배치	❌	✅	✅
Store-Load 재배치	✅	✅	✅
모델	TSO (Total Store Order)	Weak	Weak

의미: x86에서 잘 동작하던 코드가 ARM에서 실패할 수 있습니다. iPhone, M1/M2 Mac, AWS Graviton의 등장으로 ARM 메모리 모델 이해는 필수.

2.3 메모리 순서 (Memory Ordering)

C++ / Rust의 std::memory_order:

#include <atomic>

std::atomic<int> x{0};

// 가장 강함 (Sequential Consistency, 기본값)
x.store(1, std::memory_order_seq_cst);

// Release: 이 store 이전의 모든 메모리 작업이 완료된 후 수행
x.store(1, std::memory_order_release);

// Acquire: 이 load 이후의 모든 메모리 작업이 이 load 이후에 수행
int v = x.load(std::memory_order_acquire);

// Relaxed: 원자성만 보장, 순서 X (가장 빠름)
x.store(1, std::memory_order_relaxed);

Release-Acquire 패턴:

std::atomic<bool> ready{false};
int data = 0;

// Thread 1 (Producer)
data = 42;
ready.store(true, std::memory_order_release);

// Thread 2 (Consumer)
while (!ready.load(std::memory_order_acquire)) {}
assert(data == 42);  // 보장됨!

release 이전의 모든 쓰기는 acquire 이후에 보입니다.

3. CAS — Compare-And-Swap

3.1 원리

bool compare_and_swap(int* ptr, int expected, int new_value) {
    if (*ptr == expected) {
        *ptr = new_value;
        return true;
    }
    return false;
}

핵심: 이 전체가 원자적으로 실행됩니다. CPU 명령 cmpxchg(x86) 또는 ldrex/strex(ARM)가 직접 지원.

3.2 CAS 기반 카운터

std::atomic<int> counter{0};

void increment() {
    int old_value, new_value;
    do {
        old_value = counter.load();
        new_value = old_value + 1;
    } while (!counter.compare_exchange_weak(old_value, new_value));
}

왜 weak?: 일부 아키텍처(ARM)에서 spurious failure 가능. 루프 안에서는 weak가 더 빠름.

3.3 std::atomic의 fetch_add

위 코드는 표준 라이브러리로 더 단순:

counter.fetch_add(1, std::memory_order_relaxed);

내부적으로 LOCK XADD (x86) 또는 LDADDX (ARM v8.1+) 사용.

3.4 Rust의 atomics

use std::sync::atomic::{AtomicUsize, Ordering};

let counter = AtomicUsize::new(0);

// 원자적 증가
counter.fetch_add(1, Ordering::Relaxed);

// CAS
let result = counter.compare_exchange(
    0,                     // expected
    1,                     // new
    Ordering::Acquire,     // success ordering
    Ordering::Relaxed      // failure ordering
);

Rust는 컴파일 타임에 데이터 레이스를 검출 — atomics를 잘못 사용하면 컴파일 에러.

4. Lock-Free 자료구조

4.1 Treiber Stack — 가장 단순한 Lock-Free

template<typename T>
class LockFreeStack {
    struct Node {
        T value;
        Node* next;
    };
    std::atomic<Node*> head{nullptr};

public:
    void push(T value) {
        Node* new_node = new Node{value, nullptr};
        Node* old_head;
        do {
            old_head = head.load();
            new_node->next = old_head;
        } while (!head.compare_exchange_weak(old_head, new_node));
    }

    bool pop(T& value) {
        Node* old_head;
        Node* new_head;
        do {
            old_head = head.load();
            if (!old_head) return false;
            new_head = old_head->next;
        } while (!head.compare_exchange_weak(old_head, new_head));
        value = old_head->value;
        delete old_head;  // 위험! 다른 스레드가 읽고 있을 수 있음
        return true;
    }
};

문제: delete old_head가 안전하지 않음! 다른 스레드가 같은 노드를 읽는 중일 수 있습니다.

4.2 ABA 문제

초기: head → A → B → C

Thread 1:                    Thread 2:
old_head = A
new_head = B
                              pop A → delete A
                              pop B → delete B
                              push A (재할당된 메모리)
                              head → A → C

CAS(old_head=A, new_head=B)  ← 성공! 하지만 잘못됨
head → B (이미 free된 메모리)

해결책 1: Tagged Pointers (버전 번호 추가)

struct TaggedPtr {
    Node* ptr;
    uint64_t tag;  // 매번 증가
};
std::atomic<TaggedPtr> head;

CAS는 (ptr, tag) 쌍을 비교 → 같은 ptr이라도 tag가 다르면 실패.

해결책 2: Hazard Pointers

각 스레드가 "내가 지금 사용 중인 포인터"를 발표(announce). 다른 스레드는 그 포인터를 free하지 않음.

해결책 3: Epoch-Based Reclamation

전역 "에폭(epoch)" 카운터. 모든 스레드가 같은 에폭을 통과할 때까지 free 지연. crossbeam-epoch (Rust)가 표준 구현.

4.3 Michael-Scott Queue

가장 유명한 Lock-Free 큐.

struct Node {
    T value;
    std::atomic<Node*> next{nullptr};
};

std::atomic<Node*> head;
std::atomic<Node*> tail;

void enqueue(T value) {
    Node* new_node = new Node{value};
    Node* old_tail;
    Node* old_next;
    while (true) {
        old_tail = tail.load();
        old_next = old_tail->next.load();
        if (old_tail == tail.load()) {  // 일관성 체크
            if (old_next == nullptr) {
                // CAS로 next 설정 시도
                if (old_tail->next.compare_exchange_weak(old_next, new_node)) {
                    break;
                }
            } else {
                // tail이 뒤처짐, 도와주기 (helping)
                tail.compare_exchange_weak(old_tail, old_next);
            }
        }
    }
    // tail 갱신 (실패해도 OK, 다른 스레드가 도와줌)
    tail.compare_exchange_weak(old_tail, new_node);
}

핵심 트릭 — Helping: tail이 뒤처져 있으면 다른 스레드가 자기 작업 전에 도와줌. 이게 "Lock-Free" — 한 스레드가 멈춰도 다른 스레드가 진전 가능.

5. RCU — Read-Copy-Update

5.1 Linux 커널의 비밀 무기

RCU는 읽기를 락 없이 만드는 기법:

Read: 락 없이, 그냥 포인터 읽기
Update: 데이터를 복사 → 수정 → 원자적 포인터 교체
Reclaim: 모든 기존 readers가 끝나기를 기다린 후 옛 데이터 free

// 의사 코드
struct config* config = rcu_dereference(global_config);
int value = config->some_value;  // 락 없이 읽기

// 업데이트
struct config* new_config = malloc(sizeof(*new_config));
memcpy(new_config, old_config, sizeof(*new_config));
new_config->some_value = 42;
rcu_assign_pointer(global_config, new_config);
synchronize_rcu();  // 모든 readers 완료 대기
free(old_config);

왜 빠른가?: 읽기에 atomic 명령조차 없습니다 — 그냥 일반 메모리 읽기. 캐시 친화적.

언제 사용: 읽기가 쓰기보다 1000:1 이상 많을 때. Linux 커널의 라우팅 테이블, 설정 등.

5.2 사용자 공간 RCU

liburcu (사용자 공간 RCU 라이브러리)
crossbeam-epoch (Rust)
folly::rcu (Facebook C++)

6. 언어별 비교

6.1 Java

import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.LongAdder;

// 단순 카운터 (낮은 컨텐션)
AtomicLong counter = new AtomicLong(0);
counter.incrementAndGet();

// 높은 컨텐션 카운터 (Java 8+)
LongAdder adder = new LongAdder();
adder.increment();  // 분할 셀로 컨텐션 분산
long value = adder.sum();  // 집계

LongAdder의 비밀: 여러 카운터 셀로 분산하여 컨텐션 회피. 읽기는 모든 셀 합산. 카운터에는 거의 항상 LongAdder가 더 빠름.

java.util.concurrent.atomic 패키지:

AtomicInteger, AtomicLong, AtomicReference
AtomicIntegerArray, AtomicLongArray
LongAdder, LongAccumulator (고성능)
ConcurrentHashMap (Lock-Free 읽기)

6.2 C++

#include <atomic>

std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

// 메모리 순서를 정확히 지정해야 함
std::atomic<bool> flag{false};
flag.store(true, std::memory_order_release);
bool v = flag.load(std::memory_order_acquire);

C++20 추가:

std::atomic_ref — 비-atomic 값에 atomic 작업
std::atomic_wait/notify — 효율적 spinning 대안

6.3 Rust

use std::sync::atomic::{AtomicUsize, AtomicBool, Ordering};

static COUNTER: AtomicUsize = AtomicUsize::new(0);
COUNTER.fetch_add(1, Ordering::Relaxed);

Rust의 강점: 컴파일 타임에 데이터 레이스 검출. atomics를 잘못 사용하면 컴파일 에러.

crossbeam 라이브러리:

crossbeam-channel — Lock-Free MPMC 채널
crossbeam-epoch — Epoch-based reclamation
crossbeam-queue — Lock-Free 큐
crossbeam-skiplist — 동시성 skip list

6.4 Go

import "sync/atomic"

var counter int64
atomic.AddInt64(&counter, 1)
value := atomic.LoadInt64(&counter)

Go의 메모리 모델은 비교적 단순 — 대부분 sync.Mutex나 channel을 권장. atomics는 성능이 정말 중요한 경우만.

7. 실전 예시

7.1 Lock-Free Logger

class LockFreeLogger {
    static constexpr size_t SIZE = 1024;
    std::atomic<size_t> head{0};
    std::array<std::string, SIZE> buffer;

public:
    void log(const std::string& msg) {
        size_t idx = head.fetch_add(1, std::memory_order_relaxed) % SIZE;
        buffer[idx] = msg;  // 주의: 동시 쓰기 가능, 실제로는 더 정교하게
    }
};

이는 단순화된 예시 — 실제로는 ring buffer + 더 정교한 동기화 필요.

7.2 Sequence ID Generator

use std::sync::atomic::{AtomicU64, Ordering};

struct IdGenerator {
    next_id: AtomicU64,
}

impl IdGenerator {
    fn next(&self) -> u64 {
        self.next_id.fetch_add(1, Ordering::Relaxed)
    }
}

수백만 ID/초 처리 가능. 락보다 100배 빠름.

7.3 Spin Lock (학습용)

class SpinLock {
    std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:
    void lock() {
        while (flag.test_and_set(std::memory_order_acquire)) {
            // 바쁘게 대기 (busy wait)
        }
    }
    void unlock() {
        flag.clear(std::memory_order_release);
    }
};

언제 사용: 락을 보유하는 시간이 매우 짧을 때 (수십 ns). 그보다 길면 OS mutex가 더 효율적.

개선: x86의 PAUSE 명령으로 spinning 효율 향상:

while (flag.test_and_set(std::memory_order_acquire)) {
    _mm_pause();  // CPU 힌트
}

8. 성능 측정과 함정

struct Counters {
    std::atomic<int> a;
    std::atomic<int> b;  // 같은 캐시 라인!
};

a와 b가 같은 64바이트 캐시 라인에 있으면, 한 코어가 a를 수정할 때 다른 코어의 b 캐시가 무효화됩니다. 성능 10배+ 저하 가능.

해결: 패딩으로 분리

struct Counters {
    alignas(64) std::atomic<int> a;
    alignas(64) std::atomic<int> b;
};

8.2 컨텐션 vs 처리량

벤치마크 결과 (간단한 카운터, 4 코어):

방법	1 스레드	4 스레드	16 스레드
Mutex	30M ops/s	8M ops/s	2M ops/s
atomic CAS	100M ops/s	25M ops/s	6M ops/s
LongAdder	80M ops/s	200M ops/s	800M ops/s

교훈: 컨텐션이 높을 때 단순 atomic도 한계. LongAdder 같은 분할 카운터가 압도적으로 빠릅니다.

8.3 측정 도구

Linux: perf stat -e cache-misses, perf c2c (false sharing 감지)
Intel VTune — 마이크로아키텍처 분석
Java: JMH (Java Microbenchmark Harness)
Rust: criterion
C++: Google Benchmark

9. 검증된 라이브러리

거의 항상 직접 작성하지 마세요. 검증된 라이브러리 사용:

언어	라이브러리	제공
Rust	`crossbeam`	Lock-Free 큐, 채널, atomic, epoch
Rust	`dashmap`	Lock-Free HashMap
Rust	`parking_lot`	빠른 mutex (대안)
C++	`folly` (Facebook)	Lock-Free 다양한 자료구조
C++	`boost::lockfree`	큐, 스택, spsc_queue
C++	`concurrencpp`	코루틴 + 동시성
Java	`j.u.c.atomic`, `j.u.c.ConcurrentHashMap`	표준
Java	JCTools	고성능 큐
Go	`sync/atomic`, `sync.Map`	표준

퀴즈

1. Lock-Free와 Wait-Free의 차이는?

답: Lock-Free는 여러 스레드 중 적어도 하나가 항상 진전(progress)을 보장합니다. 다른 스레드는 starvation 가능. Wait-Free는 모든 스레드가 유한 시간 내 완료를 보장합니다. Wait-Free가 더 강하지만 구현이 훨씬 어렵습니다. 대부분의 실용적 알고리즘은 Lock-Free입니다.

2. ABA 문제는 무엇이고 어떻게 해결하나요?

답: 값이 A→B→A로 변했을 때, CAS는 현재 값이 기대값(A)과 같으므로 성공한다고 판단합니다. 하지만 그 사이에 변화가 있었을 수 있어 잘못된 결정. 해결: (1) Tagged Pointers — 버전 번호 추가, (2) Hazard Pointers — 사용 중인 포인터 발표, (3) Epoch-Based Reclamation — 안전한 시점에만 free.

3. x86과 ARM의 메모리 모델 차이는?

답: x86은 Total Store Order (TSO) — Load-Load, Store-Store, Load-Store 재배치 안 됨 (Store-Load만 가능). ARM은 Weak Memory Model — 거의 모든 재배치 가능. 결과: x86에서 잘 동작하던 코드가 ARM에서 실패 가능. iPhone, M1/M2 Mac, AWS Graviton 시대에는 Acquire/Release 의미를 정확히 이해해야 합니다.

4. False Sharing이 무엇이고 어떻게 방지하나요?

답: 두 atomic 변수가 같은 64바이트 캐시 라인에 있으면, 한 코어가 한 변수를 수정할 때 다른 코어의 캐시가 무효화됩니다. 성능이 10배+ 저하될 수 있습니다. 해결: alignas(64) (C++) 또는 #[repr(align(64))] (Rust)로 캐시 라인 경계에 정렬하거나 패딩 추가. perf c2c로 감지 가능.

5. 언제 Lock-Free를 사용하면 안 되나요?

답: (1) 단순한 락으로 충분할 때 — 락 컨텐션이 낮으면 mutex가 더 빠를 수 있습니다, (2) 복잡한 자료구조 — 직접 구현하지 말고 crossbeam, folly 같은 검증된 라이브러리 사용, (3) 유지보수 비용 — 팀이 메모리 모델을 이해하지 못하면 미묘한 버그 위험, (4) 디버깅 가능성 — Lock-Free 버그는 재현이 어렵고 디버깅이 매우 어렵습니다. 단순 카운터, 시퀀스 ID, 플래그에만 직접 사용하세요.

참고 자료

The Art of Multiprocessor Programming — Maurice Herlihy
C++ Concurrency in Action — Anthony Williams
Memory Models for C/C++ Programmers — Russ Cox
Preshing on Programming — Lock-Free 시리즈
crossbeam — Rust Lock-Free 라이브러리
folly — Facebook C++ 라이브러리
Linux Kernel RCU Documentation
Java Concurrency in Practice — Brian Goetz
Mechanical Sympathy — Martin Thompson
JCTools — Java 고성능 큐
perf c2c — false sharing 감지

Lock-Free Concurrency Programming Complete Guide 2025: Atomics, Memory Model, CAS, Lock-Free Data Structures

TL;DR

Lock-Free essence: concurrency without locks, using atomic instructions like CAS (Compare-And-Swap).
Memory model: x86 provides strong ordering; ARM/RISC-V are weak. Understanding Acquire/Release semantics is essential.
Pros: no deadlocks, no priority inversion, sometimes faster than locks.
Cons: extremely hard, debugging nightmare, tiny mistakes corrupt data.
When to use: simple tasks like counters, sequence IDs, log buffers. For complex structures, use battle-tested libraries.

1. What is Lock-Free?

1.1 Definition

Lock-Free algorithm: when multiple threads operate concurrently, at least one thread is guaranteed to make progress.

Hierarchy:

Level	Guarantee
Wait-Free	All threads complete in bounded time (strongest)
Lock-Free	At least one thread makes progress in bounded time
Obstruction-Free	Progress possible if other threads stop
Blocking	Threads may wait indefinitely (uses locks)

1.2 Why Lock-Free?

Problems with locks:

Deadlock — two threads waiting for each other's locks.
Priority inversion — low-priority thread holds lock while high-priority waits.
Contention — many threads competing for the same lock serializes execution.
Blocking — lock holder page-faults, GCs, or is preempted, stalling everyone.

Lock-Free promises:

Deadlock-free (no locks).
Avoids syscalls (userspace only).
Cache-friendly (CPU atomic instructions).

1.3 Real Trade-offs

Lock-Free is not a silver bullet:

Implementation complexity: 10-100x harder than lock-based.
Verification difficulty: race conditions are extremely subtle.
Memory reclamation: safe deletion is hard (Hazard Pointers, Epochs).
Under contention: can be slower than locks when heavily contended.

Use Lock-Free only for simple operations. For complex structures, rely on vetted libraries like crossbeam or concurrent_hash_map.

2. CPU Memory Model

2.1 Why the Memory Model Matters

// Thread 1
x = 1;
y = 2;

// Thread 2
print(y);
print(x);

Intuitively, if print(y) shows 2, then print(x) should show 1. But CPUs and compilers can reorder instructions, so this is not always true.

2.2 x86 vs ARM Memory Models

	x86	ARM	RISC-V
Store-Store reorder	No	Yes	Yes
Load-Load reorder	No	Yes	Yes
Load-Store reorder	No	Yes	Yes
Store-Load reorder	Yes	Yes	Yes
Model	TSO (Total Store Order)	Weak	Weak

Code that works on x86 may fail on ARM. With iPhones, M1/M2 Macs, and AWS Graviton, understanding the ARM memory model is essential.

2.3 Memory Ordering

C++ / Rust std::memory_order:

#include <atomic>

std::atomic<int> x{0};

// Strongest (Sequential Consistency, default)
x.store(1, std::memory_order_seq_cst);

// Release: all prior memory ops complete before this store is visible
x.store(1, std::memory_order_release);

// Acquire: all subsequent memory ops happen after this load
int v = x.load(std::memory_order_acquire);

// Relaxed: atomicity only, no ordering (fastest)
x.store(1, std::memory_order_relaxed);

Release-Acquire pattern:

std::atomic<bool> ready{false};
int data = 0;

// Thread 1 (Producer)
data = 42;
ready.store(true, std::memory_order_release);

// Thread 2 (Consumer)
while (!ready.load(std::memory_order_acquire)) {}
assert(data == 42);  // Guaranteed!

All writes before release are visible after acquire.

3. CAS — Compare-And-Swap

3.1 Principle

bool compare_and_swap(int* ptr, int expected, int new_value) {
    if (*ptr == expected) {
        *ptr = new_value;
        return true;
    }
    return false;
}

The entire operation is atomic. Directly supported by cmpxchg (x86) or ldrex/strex (ARM).

3.2 CAS-based Counter

std::atomic<int> counter{0};

void increment() {
    int old_value, new_value;
    do {
        old_value = counter.load();
        new_value = old_value + 1;
    } while (!counter.compare_exchange_weak(old_value, new_value));
}

Why weak? Some architectures (ARM) allow spurious failures. Inside a loop, weak is faster.

3.3 std::atomic fetch_add

The code above is simpler with the standard library:

counter.fetch_add(1, std::memory_order_relaxed);

Internally uses LOCK XADD (x86) or LDADDX (ARM v8.1+).

3.4 Rust Atomics

use std::sync::atomic::{AtomicUsize, Ordering};

let counter = AtomicUsize::new(0);

// Atomic increment
counter.fetch_add(1, Ordering::Relaxed);

// CAS
let result = counter.compare_exchange(
    0,                     // expected
    1,                     // new
    Ordering::Acquire,     // success ordering
    Ordering::Relaxed      // failure ordering
);

Rust detects data races at compile time — misusing atomics is a compile error.

4. Lock-Free Data Structures

4.1 Treiber Stack — Simplest Lock-Free

template<typename T>
class LockFreeStack {
    struct Node {
        T value;
        Node* next;
    };
    std::atomic<Node*> head{nullptr};

public:
    void push(T value) {
        Node* new_node = new Node{value, nullptr};
        Node* old_head;
        do {
            old_head = head.load();
            new_node->next = old_head;
        } while (!head.compare_exchange_weak(old_head, new_node));
    }

    bool pop(T& value) {
        Node* old_head;
        Node* new_head;
        do {
            old_head = head.load();
            if (!old_head) return false;
            new_head = old_head->next;
        } while (!head.compare_exchange_weak(old_head, new_head));
        value = old_head->value;
        delete old_head;  // Unsafe! Other threads may still be reading
        return true;
    }
};

delete old_head is unsafe — another thread may be reading the same node.

4.2 ABA Problem

Initial: head -> A -> B -> C

Thread 1:                    Thread 2:
old_head = A
new_head = B
                              pop A -> delete A
                              pop B -> delete B
                              push A (reallocated memory)
                              head -> A -> C

CAS(old_head=A, new_head=B)  <- Succeeds! But wrong
head -> B (already freed memory)

Solution 1: Tagged Pointers (add version number)

struct TaggedPtr {
    Node* ptr;
    uint64_t tag;  // increment each time
};
std::atomic<TaggedPtr> head;

CAS compares the (ptr, tag) pair. Same ptr with different tag fails.

Solution 2: Hazard Pointers

Each thread announces pointers it's currently using; others won't free them.

Solution 3: Epoch-Based Reclamation

A global epoch counter. Defer free until all threads cross the same epoch. crossbeam-epoch (Rust) is the canonical implementation.

4.3 Michael-Scott Queue

The most famous Lock-Free queue.

struct Node {
    T value;
    std::atomic<Node*> next{nullptr};
};

std::atomic<Node*> head;
std::atomic<Node*> tail;

void enqueue(T value) {
    Node* new_node = new Node{value};
    Node* old_tail;
    Node* old_next;
    while (true) {
        old_tail = tail.load();
        old_next = old_tail->next.load();
        if (old_tail == tail.load()) {  // consistency check
            if (old_next == nullptr) {
                // try CAS to set next
                if (old_tail->next.compare_exchange_weak(old_next, new_node)) {
                    break;
                }
            } else {
                // tail is lagging, help it forward
                tail.compare_exchange_weak(old_tail, old_next);
            }
        }
    }
    // Advance tail (OK if it fails — another thread will help)
    tail.compare_exchange_weak(old_tail, new_node);
}

The helping trick: if tail is lagging, another thread helps advance it. That's what makes it Lock-Free — one stalled thread doesn't block others.

5. RCU — Read-Copy-Update

5.1 Linux Kernel's Secret Weapon

RCU makes reads lock-free:

Read: no locks, just read the pointer.
Update: copy data, modify, atomically swap pointer.
Reclaim: wait for all existing readers to finish, then free old data.

// Pseudo-code
struct config* config = rcu_dereference(global_config);
int value = config->some_value;  // read without locks

// Update
struct config* new_config = malloc(sizeof(*new_config));
memcpy(new_config, old_config, sizeof(*new_config));
new_config->some_value = 42;
rcu_assign_pointer(global_config, new_config);
synchronize_rcu();  // wait for readers
free(old_config);

Why fast? Reads have no atomic instructions — just normal memory reads. Cache-friendly.

When to use: when reads vastly outnumber writes (1000:1+). Used by Linux kernel routing tables, configs, etc.

5.2 Userspace RCU

liburcu (userspace RCU library)
crossbeam-epoch (Rust)
folly::rcu (Facebook C++)

6. Language Comparison

6.1 Java

import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.LongAdder;

// Simple counter (low contention)
AtomicLong counter = new AtomicLong(0);
counter.incrementAndGet();

// High-contention counter (Java 8+)
LongAdder adder = new LongAdder();
adder.increment();  // sharded cells avoid contention
long value = adder.sum();  // aggregate

LongAdder spreads writes across shards to reduce contention; reads sum all shards. For counters, LongAdder is almost always faster.

java.util.concurrent.atomic package:

AtomicInteger, AtomicLong, AtomicReference
AtomicIntegerArray, AtomicLongArray
LongAdder, LongAccumulator (high perf)
ConcurrentHashMap (Lock-Free reads)

6.2 C++

#include <atomic>

std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

// Must specify memory order precisely
std::atomic<bool> flag{false};
flag.store(true, std::memory_order_release);
bool v = flag.load(std::memory_order_acquire);

C++20 additions:

std::atomic_ref — atomic ops on non-atomic values.
std::atomic_wait/notify — efficient alternative to spinning.

6.3 Rust

use std::sync::atomic::{AtomicUsize, AtomicBool, Ordering};

static COUNTER: AtomicUsize = AtomicUsize::new(0);
COUNTER.fetch_add(1, Ordering::Relaxed);

Rust's strength: compile-time data race detection.

crossbeam library:

crossbeam-channel — Lock-Free MPMC channel
crossbeam-epoch — epoch-based reclamation
crossbeam-queue — Lock-Free queue
crossbeam-skiplist — concurrent skip list

6.4 Go

import "sync/atomic"

var counter int64
atomic.AddInt64(&counter, 1)
value := atomic.LoadInt64(&counter)

Go's memory model is relatively simple — sync.Mutex or channels are recommended for most cases. Use atomics only when performance is critical.

7. Practical Examples

7.1 Lock-Free Logger

class LockFreeLogger {
    static constexpr size_t SIZE = 1024;
    std::atomic<size_t> head{0};
    std::array<std::string, SIZE> buffer;

public:
    void log(const std::string& msg) {
        size_t idx = head.fetch_add(1, std::memory_order_relaxed) % SIZE;
        buffer[idx] = msg;  // Note: concurrent writes possible, use more care in real code
    }
};

Simplified example — real systems need ring buffer plus more careful synchronization.

7.2 Sequence ID Generator

use std::sync::atomic::{AtomicU64, Ordering};

struct IdGenerator {
    next_id: AtomicU64,
}

impl IdGenerator {
    fn next(&self) -> u64 {
        self.next_id.fetch_add(1, Ordering::Relaxed)
    }
}

Handles millions of IDs per second — 100x faster than locks.

7.3 Spin Lock (educational)

class SpinLock {
    std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:
    void lock() {
        while (flag.test_and_set(std::memory_order_acquire)) {
            // busy wait
        }
    }
    void unlock() {
        flag.clear(std::memory_order_release);
    }
};

Use only when hold time is very short (tens of ns). Otherwise, OS mutex is more efficient.

Improvement: x86 PAUSE instruction for better spinning:

while (flag.test_and_set(std::memory_order_acquire)) {
    _mm_pause();  // CPU hint
}

8. Performance Measurement and Pitfalls

struct Counters {
    std::atomic<int> a;
    std::atomic<int> b;  // same cache line!
};

If a and b share a 64-byte cache line, one core modifying a invalidates another core's cached b. Performance can drop 10x+.

Fix: pad to separate lines.

struct Counters {
    alignas(64) std::atomic<int> a;
    alignas(64) std::atomic<int> b;
};

8.2 Contention vs Throughput

Benchmark (simple counter, 4 cores):

Method	1 thread	4 threads	16 threads
Mutex	30M ops/s	8M ops/s	2M ops/s
atomic CAS	100M ops/s	25M ops/s	6M ops/s
LongAdder	80M ops/s	200M ops/s	800M ops/s

Under heavy contention, even simple atomics hit limits. Sharded counters like LongAdder win decisively.

8.3 Measurement Tools

Linux: perf stat -e cache-misses, perf c2c (false sharing detection)
Intel VTune — microarchitectural analysis
Java: JMH (Java Microbenchmark Harness)
Rust: criterion
C++: Google Benchmark

9. Battle-Tested Libraries

Almost never write your own. Use vetted libraries:

Language	Library	Provides
Rust	`crossbeam`	Lock-Free queue, channel, atomic, epoch
Rust	`dashmap`	Lock-Free HashMap
Rust	`parking_lot`	Fast mutex (alternative)
C++	`folly` (Facebook)	Various Lock-Free structures
C++	`boost::lockfree`	queue, stack, spsc_queue
C++	`concurrencpp`	coroutines + concurrency
Java	`j.u.c.atomic`, `j.u.c.ConcurrentHashMap`	standard
Java	JCTools	high-performance queues
Go	`sync/atomic`, `sync.Map`	standard

Quiz

1. What's the difference between Lock-Free and Wait-Free?

Lock-Free guarantees that at least one thread makes progress; other threads may starve. Wait-Free guarantees every thread completes in bounded time. Wait-Free is stronger but much harder to implement. Most practical algorithms are Lock-Free.

2. What is the ABA problem and how is it solved?

A value changes A to B to A; CAS sees the current value matches the expected value (A) and succeeds — but state in between changed. Solutions: (1) Tagged Pointers — add a version number; (2) Hazard Pointers — announce in-use pointers; (3) Epoch-Based Reclamation — free only at safe epochs.

3. How do x86 and ARM memory models differ?

x86 uses Total Store Order (TSO): no Load-Load, Store-Store, or Load-Store reordering (only Store-Load). ARM has a Weak Memory Model: almost all reorderings allowed. Code working on x86 may fail on ARM. In the iPhone, M1/M2 Mac, and AWS Graviton era, understanding Acquire/Release semantics is essential.

4. What is false sharing and how do you prevent it?

Two atomic variables on the same 64-byte cache line cause invalidations across cores when one is modified. Performance drops 10x+. Fix: align to cache line boundaries with alignas(64) (C++) or #[repr(align(64))] (Rust), or add padding. Detect with perf c2c.

5. When should you avoid Lock-Free?

(1) When a simple lock suffices — low-contention mutex may be faster; (2) Complex data structures — use vetted libraries like crossbeam or folly instead of rolling your own; (3) Maintenance cost — subtle bugs if the team doesn't understand the memory model; (4) Debuggability — Lock-Free bugs are extremely hard to reproduce and debug. Use it yourself only for simple counters, sequence IDs, and flags.

References

The Art of Multiprocessor Programming — Maurice Herlihy
C++ Concurrency in Action — Anthony Williams
Memory Models for C/C++ Programmers — Russ Cox
Preshing on Programming — Lock-Free series
crossbeam
folly
Linux Kernel RCU Documentation
Java Concurrency in Practice — Brian Goetz
Mechanical Sympathy — Martin Thompson
JCTools
perf c2c

Lock-Free 동시성 프로그래밍 완전 가이드 2025: Atomics, 메모리 모델, CAS, 무잠금 자료구조

TL;DR

1. Lock-Free란 무엇인가?

1.1 정의

1.2 왜 Lock-Free인가?

1.3 현실적 트레이드오프

2. CPU 메모리 모델

2.1 왜 메모리 모델이 중요한가?

2.2 x86 vs ARM 메모리 모델

2.3 메모리 순서 (Memory Ordering)

3. CAS — Compare-And-Swap

3.1 원리

3.2 CAS 기반 카운터

3.3 std::atomic의 fetch_add

3.4 Rust의 atomics

4. Lock-Free 자료구조

4.1 Treiber Stack — 가장 단순한 Lock-Free

4.2 ABA 문제

4.3 Michael-Scott Queue

5. RCU — Read-Copy-Update

5.1 Linux 커널의 비밀 무기

5.2 사용자 공간 RCU

6. 언어별 비교

6.1 Java

6.2 C++

6.3 Rust

6.4 Go

7. 실전 예시

7.1 Lock-Free Logger

7.2 Sequence ID Generator

7.3 Spin Lock (학습용)

8. 성능 측정과 함정

8.1 false sharing

8.2 컨텐션 vs 처리량

8.3 측정 도구

9. 검증된 라이브러리

퀴즈

참고 자료

Lock-Free Concurrency Programming Complete Guide 2025: Atomics, Memory Model, CAS, Lock-Free Data Structures

TL;DR

1. What is Lock-Free?

1.1 Definition

1.2 Why Lock-Free?

1.3 Real Trade-offs

2. CPU Memory Model

2.1 Why the Memory Model Matters

2.2 x86 vs ARM Memory Models

2.3 Memory Ordering

3. CAS — Compare-And-Swap

3.1 Principle

3.2 CAS-based Counter

3.3 std::atomic fetch_add

3.4 Rust Atomics

4. Lock-Free Data Structures

4.1 Treiber Stack — Simplest Lock-Free

4.2 ABA Problem

4.3 Michael-Scott Queue

5. RCU — Read-Copy-Update

5.1 Linux Kernel's Secret Weapon

5.2 Userspace RCU

6. Language Comparison

6.1 Java

6.2 C++

6.3 Rust

6.4 Go

7. Practical Examples

7.1 Lock-Free Logger

7.2 Sequence ID Generator

7.3 Spin Lock (educational)

8. Performance Measurement and Pitfalls

8.1 False Sharing

8.2 Contention vs Throughput

8.3 Measurement Tools

9. Battle-Tested Libraries

Quiz

References