Embedded Performance Engineering · 36/57

Lock-Free 자료구조 성능 — CAS·ABA·Hazard Pointer·Epoch Reclamation

2026년 4월 26일 · Hawk · 6분 읽기

#한 줄 요약

“Lock-free는 시스템 전체에서 한 thread는 반드시 진행을 보장하며, wait-free는 모든 thread의 진행을 보장합니다.”

#어떤 문제를 푸는가

Lock 기반 동기화는 thread가 lock을 잡은 채 죽거나 무한 loop에 빠지면 다른 thread도 모두 멈춥니다. Lock-free 자료구조는 어떤 thread가 어디서 멈추더라도 다른 thread는 진행할 수 있도록 설계됩니다.

Lock 자체를 제거하므로 priority inversion이나 deadlock도 없습니다. 단, 구현이 까다롭고 ABA 같은 미묘한 버그가 발생하기 쉽습니다. 잘못 만들면 데이터가 깨지는 것은 물론, lock보다 느려지기도 합니다.

이 글에서는 CAS라는 기본 atomic 명령에서 출발해 lock-free stack을 만들고, ABA 문제와 그 해결책인 hazard pointer, epoch reclamation을 살펴봅니다.

#Progress Property — 진행 보장의 등급

1
Obstruction-free: thread 혼자 실행되면 진행 보장
2
Lock-free       : 시스템 전체에서 적어도 한 thread는 진행
3
Wait-free       : 모든 thread가 n cycle 안에 진행 보장

대부분의 실용 구현은 lock-free 수준에 머뭅니다. Wait-free는 이론적으로는 강력하지만 구현이 매우 복잡해 일반 사용에는 부담스럽습니다.

#CAS — Compare-And-Swap

1
bool cas(int *ptr, int *expected, int desired) {
2
    if (*ptr == *expected) {
3
        *ptr = desired;
4
        return true;
5
    }
6
    *expected = *ptr;
7
    return false;
8
}

ARM에서는 LDREX/STREX 쌍 또는 v8.1의 CASAL로 구현합니다.

1
loop:
2
    ldrex r1, [r0]
3
    cmp   r1, expected
4
    bne   fail
5
    strex r2, desired, [r0]
6
    cmp   r2, #0
7
    bne   loop

x86에서는 LOCK CMPXCHG 한 명령으로 처리됩니다. CAS는 거의 모든 lock-free 알고리즘의 기본 building block입니다.

#Lock-Free Counter

1
atomic_int counter;
2

3
void inc(void) {
4
    int old;
5
    do {
6
        old = atomic_load(&counter);
7
    } while (!atomic_compare_exchange_weak(&counter, &old, old + 1));
8
}
9

10
/* 또는 atomic_fetch_add — 한 명령 */
11
atomic_fetch_add(&counter, 1);

ARMv8.1 이후에는 LDADD 명령으로 fetch-and-add가 single instruction이 됩니다. CAS retry loop보다 훨씬 효율적이며 contention이 높을 때 차이가 큽니다.

#Lock-Free Stack

1
struct node { int value; struct node *next; };
2

3
struct node *top;
4

5
void push(int v) {
6
    struct node *new_node = malloc(sizeof(*new_node));
7
    new_node->value = v;
8
    struct node *old_top;
9
    do {
10
        old_top = atomic_load(&top);
11
        new_node->next = old_top;
12
    } while (!atomic_compare_exchange(&top, &old_top, new_node));
13
}
14

15
bool pop(int *out) {
16
    struct node *old_top;
17
    do {
18
        old_top = atomic_load(&top);
19
        if (!old_top) return false;
20
    } while (!atomic_compare_exchange(&top, &old_top, old_top->next));
21
    *out = old_top->value;
22
    free(old_top);   /* ABA·use-after-free 위험 */
23
    return true;
24
}

언뜻 보면 동작할 것 같지만 두 가지 위험이 있습니다. 다른 thread가 동시에 pop 중이면 old_top->next 접근이 use-after-free가 될 수 있고, ABA 문제로 잘못된 다음 노드를 가리킬 수 있습니다.

#ABA 문제

1
Thread 1 starts pop:
2
  old_top = X (X->next = Y)
3
  preempt
4
Thread 2: pop X, pop Y, push X (재사용)
5
  top = X (그러나 X->next는 이제 Z)
6
  resume Thread 1
7
Thread 1: CAS(top, X, Y) — 성공
8
  → top = Y (그런데 Y는 이미 free됨)

CAS는 값이 같은지만 비교하므로 같은 pointer라도 의미가 달라졌는지 알 수 없습니다. 이를 ABA 문제라고 부르며, 1980년대 IBM 360에서 처음 보고되었습니다.

#해결 1 — Tagged Pointer

1
struct { void *ptr; uint64_t tag; } top;   /* 128-bit */
2

3
push:
4
  loop:
5
    old = top;
6
    new->next = old.ptr;
7
    new_pair = { new, old.tag + 1 };
8
    CAS(top, old, new_pair);

매 push마다 tag를 증가시키면 같은 pointer라도 다른 값이 되어 CAS가 실패합니다. x86의 CMPXCHG16B, ARM의 CASP로 128-bit double-word CAS를 사용합니다.

단점은 64-bit pointer + 64-bit tag로 메모리 footprint가 두 배가 되며, 일부 ARM 코어는 128-bit atomic을 지원하지 않습니다.

#해결 2 — Hazard Pointer

각 thread가 현재 접근 중인 pointer를 hazard array에 등록해 두면, free하려는 thread가 다른 thread의 hazard pointer를 검사해 안전한지 확인합니다.

1
__thread void *hp;
2
struct node *hp_list[MAX_THREADS];
3

4
void pop(...) {
5
    struct node *top_;
6
    do {
7
        top_ = atomic_load(&top);
8
        hp = top_;   /* 보호 등록 */
9
        if (top_ != atomic_load(&top)) continue;
10
    } while (!CAS(&top, top_, top_->next));
11

12
    retire(top_);   /* 즉시 free하지 않고 retired list로 */
13
}
14

15
void scan_and_free(void) {
16
    for (each retired r) {
17
        if (no_hp_points_to(r)) free(r);
18
    }
19
}

Maged Michael의 2002년 논문이 표준입니다. Folly, DPDK, LMAX Disruptor 같은 고성능 라이브러리가 이 방식을 채택했습니다.

#해결 3 — Epoch Reclamation

1
atomic_int global_epoch;
2
__thread int local_epoch;
3

4
void read(void) {
5
    local_epoch = atomic_load(&global_epoch);
6
    /* access shared ptr */
7
    local_epoch = -1;
8
}
9

10
void writer(void) {
11
    /* modify, retire old */
12
    atomic_fetch_add(&global_epoch, 1);
13
    wait_until_all_threads_pass_epoch();
14
    free(retired);
15
}

모든 thread가 같은 epoch에 머무는 동안에는 retire한 노드를 free하지 않고, 모두가 다음 epoch으로 넘어가면 안전하게 free합니다. Linux kernel의 RCU가 이 아이디어 위에 만들어져 있습니다.

Hazard pointer가 per-pointer 보호라면 epoch은 per-thread 보호입니다. 일반적으로 epoch이 read overhead가 더 낮지만, retire한 객체가 free될 때까지 메모리를 더 오래 점유할 수 있습니다.

#Lock-Free SPSC Queue

1
struct spsc {
2
    alignas(64) atomic_size_t head;   /* producer */
3
    alignas(64) atomic_size_t tail;   /* consumer */
4
    alignas(64) T buf[CAPACITY];
5
};
6

7
bool push(struct spsc *q, T v) {
8
    size_t h = atomic_load_explicit(&q->head, memory_order_relaxed);
9
    size_t t = atomic_load_explicit(&q->tail, memory_order_acquire);
10
    if (h - t == CAPACITY) return false;
11
    q->buf[h % CAPACITY] = v;
12
    atomic_store_explicit(&q->head, h + 1, memory_order_release);
13
    return true;
14
}
15

16
bool pop(struct spsc *q, T *out) {
17
    size_t t = atomic_load_explicit(&q->tail, memory_order_relaxed);
18
    size_t h = atomic_load_explicit(&q->head, memory_order_acquire);
19
    if (h == t) return false;
20
    *out = q->buf[t % CAPACITY];
21
    atomic_store_explicit(&q->tail, t + 1, memory_order_release);
22
    return true;
23
}

Single producer와 single consumer가 보장된 경우에는 CAS조차 필요 없습니다. Release-acquire memory order만으로 충분합니다. ISR에서 task로 데이터를 넘기는 logging pipe 같은 곳에서 표준 패턴입니다.

#MPMC — Multi-Producer Multi-Consumer

여러 producer와 consumer가 동시에 접근하려면 더 복잡한 알고리즘이 필요합니다. Vyukov의 bounded MPMC queue가 대표적입니다.

1
struct cell { atomic_size_t sequence; T data; };
2
struct mpmc {
3
    alignas(64) atomic_size_t enqueue_pos;
4
    alignas(64) atomic_size_t dequeue_pos;
5
    struct cell buf[CAPACITY];
6
};

각 cell에 sequence number를 두어 enqueue와 dequeue가 CAS로 자기 cell을 claim 합니다. Lock-free이며 Folly의 MPMCQueue가 같은 구조입니다.

#Lock-Free List와 Tree

Harris-Michael linked list — CAS로 insert와 delete, marked pointer로 logically deleted를 표시합니다
Lock-free skip list — 복잡하지만 효과가 큽니다
Lock-free B-tree — 거의 연구 영역에 가깝습니다

복잡도가 빠르게 올라가며 코드 검증이 어렵습니다. 실 시스템에서는 fine-grained lock에 striping을 결합한 방식이 더 흔합니다.

#자주 보는 함정과 안티패턴

⚠️ ABA를 무시한 lock-free stack

위의 단순한 lock-free stack은 ABA 위험이 있습니다. Tagged pointer나 hazard pointer 같은 보호 mechanism이 반드시 필요합니다.

⚠️ Memory order 누락

1
atomic_store(&flag, 1, memory_order_relaxed);   /* producer */
2
/* consumer가 data 변경을 못 봄 */

Release-acquire pair가 없으면 다른 thread가 데이터 변경을 관찰하지 못합니다. 4-08 편에서 자세히 다룹니다.

⚠️ Critical path의 malloc/free

1
push: malloc(node);   /* lock-free가 아니라 lock 안에 있음 */

glibc malloc은 내부적으로 lock을 사용하므로 lock-free 알고리즘에서 호출하면 의미가 없어집니다. Object pool과 lock-free free list로 미리 할당해 두는 것이 필요합니다.

⚠️ Default seq_cst 가정

C++에서 atomic 연산의 기본 memory order는 seq_cst로 가장 비쌉니다. 의도적으로 relaxed나 acq_rel을 명시해야 lock-free의 성능 이점이 살아납니다.

#측정 — 실측 결과

Cortex-A72 4-core에서 4 thread가 counter를 1M번씩 증가시킨 결과입니다.

1
                          Latency      Throughput
2
Mutex (uncontended)        30 ns        13 M/s
3
Mutex (contended)           2 µs         2 M/s
4
Spinlock (contended)      200 ns        20 M/s
5
CAS retry loop            100 ns        40 M/s
6
atomic_fetch_add (LDADD)   50 ns       320 M/s
7
SPSC queue                 25 ns       400 M/s

LDADD처럼 single instruction atomic이 가능한 경우가 lock-free의 최고 성능 구간입니다. CAS retry loop는 contention이 높아지면 retry rate가 올라가 throughput이 떨어지므로 측정이 필요합니다.

#정리

Progress 보장은 obstruction-free, lock-free, wait-free 순으로 강해집니다.
CAS가 거의 모든 lock-free 알고리즘의 기본 명령입니다.
ABA는 pointer 재사용으로 발생하는 false positive이며 tagged pointer나 hazard pointer로 해결합니다.
Epoch reclamation은 hazard pointer보다 read overhead가 낮으며 Linux RCU의 기반입니다.
SPSC queue는 CAS 없이 release-acquire만으로 구현 가능합니다.
Retry rate가 lock-free 성능의 핵심 측정 지표입니다.

다음 편은 Memory Ordering — acquire-release semantics를 살펴봅니다.

Lock-Free 자료구조 성능 — CAS·ABA·Hazard Pointer·Epoch Reclamation

#한 줄 요약

#어떤 문제를 푸는가

#Progress Property — 진행 보장의 등급

#CAS — Compare-And-Swap

#Lock-Free Counter

#Lock-Free Stack

#ABA 문제

#해결 1 — Tagged Pointer

#해결 2 — Hazard Pointer

#해결 3 — Epoch Reclamation

#Lock-Free SPSC Queue

#MPMC — Multi-Producer Multi-Consumer

#Lock-Free List와 Tree

#자주 보는 함정과 안티패턴

#측정 — 실측 결과

#정리

#관련 항목

Embedded Performance Engineering · 37 of 57

관련 글

실전 사례 — CXL.mem 추가로 LLM inference KV cache 처리량 회복

CXL 성능 프로파일링 도구 — cxl-cli·DAMON·perf-mem 활용

CXL.mem 지연·대역폭 실측 — Direct·Switch·Pooled 토폴로지 비교

이 글을 참조하는 글 (5)