Embedded Performance Engineering · 14/57

Cache Miss 3C Model 분석 — Compulsory·Capacity·Conflict

2026년 4월 24일 · Hawk · 5분 읽기

cache miss compulsory capacity conflict 3c

#한 줄 요약

**“3C — Compulsory·Capacity·Conflict”**입니다. 각각 원인과 대응이 다릅니다.

#Compulsory Miss (Cold Miss)

처음 access 시 무조건 miss입니다. Cache가 비어 있기 때문입니다.

1
int data[1000];   // 새로 할당 — cache 안에 없음
2
sum += data[0];   // ← compulsory miss

해결책은 사실상 불가피하지만, prefetch로 latency를 숨길 수 있습니다.

1
__builtin_prefetch(&data[i+16], 0, 0);   // 16 element 앞 prefetch
2
for (i = 0; i < N; i++) {
3
    sum += data[i];
4
}

ARM에서는 pld(preload data) 명령을 사용합니다.

#Capacity Miss

Working set이 cache 크기를 초과하면 데이터가 반복적으로 evict되고 다시 load됩니다.

1
// L1 D-cache 32 KB
2
int buf[16384];   // 64 KB → cache 못 다 넣음
3

4
for (iter = 0; iter < 100; iter++) {
5
    for (i = 0; i < 16384; i++) {
6
        sum += buf[i];   // 매 iter capacity miss
7
    }
8
}

해결책은 blocking / tiling입니다.

1
#define BLOCK 4096   // L1 안에 들어가게
2
for (b = 0; b < 16384; b += BLOCK) {
3
    for (iter = 0; iter < 100; iter++) {
4
        for (i = b; i < b + BLOCK; i++) {
5
            sum += buf[i];
6
        }
7
    }
8
}

각 BLOCK이 cache 안에 머무는 동안 iter를 100번 처리합니다.

#Conflict Miss

Set associativity 한계 때문에 같은 set으로 mapping되는 데이터들이 서로를 evict합니다.

1
// 4-way set assoc cache, line size 64
2
int A[2048];   // 8 KB
3
int B[2048];   // 8 KB
4
int C[2048];
5

6
// A, B, C가 같은 set에 mapping된다면
7
for (i = 0; i < 2048; i++) {
8
    A[i] += B[i] * C[i];   // 매 access conflict
9
}

해결책:

Padding: array 크기를 non-power-of-2로 조정
Loop fission: 한 번에 적은 변수만 access
Cache line padding: 변수 사이에 padding 삽입

1
int A[2048];
2
char pad1[64];   // line offset 변경
3
int B[2048];
4
char pad2[64];
5
int C[2048];

#측정 — PMU 이벤트

Cortex-A53 perf events:

0x03 L1D_CACHE_REFILL L1 D 미스 (refill 횟수)
0x04 L1D_CACHE L1 D 액세스
0x01 L1I_CACHE_REFILL L1 I 미스
0x14 L1I_CACHE L1 I 액세스
0x17 L2D_CACHE_REFILL L2 D 미스
0x16 L2D_CACHE L2 D 액세스

$\text{L1 miss rate} = \frac{\text{L1D\_CACHE\_REFILL}}{\text{L1D\_CACHE}}, \quad \text{L2 miss rate} = \frac{\text{L2D\_CACHE\_REFILL}}{\text{L2D\_CACHE}}$

1
perf stat -e r03,r04,r17,r16 ./prog
2

3
# 좋음: L1 miss < 5%, L2 miss < 30%
4
# 나쁨: L1 miss > 15%

#Cold vs Capacity vs Conflict 구분

증상	원인
첫 실행만 느리고 두 번째부터 빠르면	Compulsory(해소됨)
매번 일정하게 느리지만 작은 데이터엔 빠르면	Capacity(working set 축소로 해결)
특정 stride에서만 느리면	Conflict(padding으로 해결)

#Stride Pattern과 Conflict

1
// L1 D = 32 KB, 4-way set assoc, 64 byte line
2
// → 128 set, set당 4 line
3
// → 같은 set 주기 = 32 KB / 4 = 8 KB
4

5
int A[2048];  // 8 KB — set 모두 다 차지
6
int B[2048];  // 8 KB — A와 같은 set들에 매핑 → conflict
7

8
for (i = 0; i < 2048; i++) {
9
    A[i] = B[i];   // ← 매번 conflict
10
}

C 표준 malloc도 page align이라 8KB나 64KB 같은 깔끔한 숫자에 정렬되기 때문에 conflict가 빈번합니다.

1
struct {
2
    int counter_a;   // CPU 0 사용
3
    int counter_b;   // CPU 1 사용
4
} stats;

두 변수가 같은 cache line에 있으면 한 CPU가 쓸 때 다른 CPU의 cache가 invalidate됩니다. Conflict miss와 비슷한 양상이지만 cache coherence 차원의 문제입니다.

해결책은 line padding입니다.

1
struct {
2
    int counter_a;
3
    char pad1[60];   // line 분리
4
    int counter_b;
5
    char pad2[60];
6
} stats;

C++17에서는 alignas(std::hardware_destructive_interference_size)를 활용합니다.

#Loop Tiling 실전

1
/* 회피 — N=1024, working set 8MB */
2
for (i = 0; i < N; i++)
3
    for (j = 0; j < N; j++)
4
        for (k = 0; k < N; k++)
5
            C[i][j] += A[i][k] * B[k][j];
6

7
/* Good — block 64×64, working set 48 KB */
8
#define B 64
9
for (ii = 0; ii < N; ii += B)
10
    for (jj = 0; jj < N; jj += B)
11
        for (kk = 0; kk < N; kk += B)
12
            for (i = ii; i < ii+B; i++)
13
                for (j = jj; j < jj+B; j++)
14
                    for (k = kk; k < kk+B; k++)
15
                        C[i][j] += A[i][k] * B[k][j];

GEMM(matrix multiply)의 표준 최적화 기법입니다. BLIS·MKL·OpenBLAS 모두 multi-level tiling을 사용합니다.

#Prefetch — Software vs Hardware

#Hardware Prefetcher

Cortex-A57 이상은 stride를 자동으로 감지합니다. 같은 간격으로 access하는 패턴을 발견하면 자동으로 prefetch합니다.

#Software Prefetch

1
for (i = 0; i < N; i++) {
2
    __builtin_prefetch(&A[i + 16]);   // 16 element 앞
3
    sum += A[i];
4
}

거리는 cache miss latency / 명령 latency 정도로 잡습니다. L1 miss가 100 cycle이고 명령이 1 cycle이라면 100 element 앞을 prefetch합니다.

너무 가까우면 효과가 없고, 너무 멀면 prefetch한 데이터가 cache에서 폐기됩니다.

#Inclusive vs Exclusive Cache 영향

Intel (inclusive):

L2 miss → L1·L2 둘 다 update → L1 conflict 가능

ARM (exclusive):

L2 miss → L1만 update, L2엔 안 들어감
L1 evict → L2 victim cache로 입력

ARM은 작은 L2도 효율적이라 데이터를 잘 retention합니다.

#자주 하는 실수

⚠️ “Random access 가 cache friendly” 오해

1
for (i = 0; i < N; i++) {
2
    sum += data[random_index[i]];   // ← random — prefetcher 동작 안 함
3
}

Hardware prefetch는 stride detect만 합니다. Random access는 매번 cold miss를 일으킵니다.

⚠️ 깊은 union, struct로 false sharing

1
struct {
2
    atomic_int next_id;   // hot
3
    char name[60];
4
    atomic_int counter;   // hot
5
} thing;

두 atomic이 같은 line에 있으면 SMP에서 false sharing이 발생합니다. 반드시 분리해야 합니다.

⚠️ Powers of 2 array stride

1
int matrix[1024][1024];   // 4 MB
2
sum += matrix[i][j];      // stride 4096 byte = page size — TLB miss·conflict

→ int matrix[1024][1025]나 [1024][1024 + padding] 형태로 padding을 추가해야 합니다.

⚠️ Cold miss 무시

1
init_huge_array(arr);   // 처음 한 번 — 모든 line cold miss
2
process(arr);            // 작은 working set — fast

큰 array 초기화의 cold miss 비용은 무시할 수 없습니다. 지연 초기화를 고려해야 합니다.

#정리

3C는 Compulsory·Capacity·Conflict입니다.
Compulsory는 prefetch로 latency를 숨깁니다.
Capacity는 blocking/tiling으로 working set을 축소합니다.
Conflict는 padding이나 rearrangement로 해결합니다.
측정은 PMU의 L1D_CACHE_REFILL로 합니다.
False sharing은 SMP에서 발생하는 또 다른 conflict입니다.

다음 편은 Cache Line 최적화입니다.

Cache Miss 3C Model 분석 — Compulsory·Capacity·Conflict

#한 줄 요약

#Compulsory Miss (Cold Miss)

#Capacity Miss

#Conflict Miss

#측정 — PMU 이벤트

#Cold vs Capacity vs Conflict 구분

#Stride Pattern과 Conflict

#Loop Tiling 실전

#Prefetch — Software vs Hardware

#Hardware Prefetcher

#Software Prefetch

#Inclusive vs Exclusive Cache 영향

#자주 하는 실수

#정리

#관련 항목

Embedded Performance Engineering · 15 of 57

관련 글

실전 사례 — Matrix Multiply가 예상의 10배 느린 이유

DMA 성능 최적화 — Burst·Scatter-Gather·Chain·Cache 일관성

Cache Line 최적화 — Alignment·Prefetch·False Sharing 처리

이 글을 참조하는 글 (4)

#한 줄 요약

#Compulsory Miss (Cold Miss)

#Capacity Miss

#Conflict Miss

#측정 — PMU 이벤트

#Cold vs Capacity vs Conflict 구분

#Stride Pattern과 Conflict

#False Sharing (SMP 시)

#Loop Tiling 실전

#Prefetch — Software vs Hardware

#Hardware Prefetcher

#Software Prefetch

#Inclusive vs Exclusive Cache 영향

#자주 하는 실수

#정리

#관련 항목

Embedded Performance Engineering · 15 of 57

관련 글

실전 사례 — Matrix Multiply가 예상의 10배 느린 이유

DMA 성능 최적화 — Burst·Scatter-Gather·Chain·Cache 일관성

Cache Line 최적화 — Alignment·Prefetch·False Sharing 처리

이 글을 참조하는 글 (4)