Embedded Performance Engineering · 13/57

CPU Cache 기초 — L1·L2·L3·Set Associative·Replacement Policy

2026년 4월 24일 · Hawk · 6분 읽기

#한 줄 요약

**“Cache = locality 활용”**입니다. 자주 쓰는 데이터를 코어 가까이 둡니다.

#Memory Hierarchy

레벨	접근 시간	크기 (Cortex-A72)	위치
Register	0 cycle	32 × 32-bit	CPU 내부
L1 Cache	3-4 cycle	I + D	CPU core 옆
L2 Cache	10-15 cycle	1MB shared (4-core)	Cluster
L3 Cache (있다면)	30-50 cycle	4-8MB	SoC 공유
DRAM	100-300 cycle	GB	external

매 단계마다 약 10배 느려지고 10배 커집니다. 그림으로 보면 hierarchy의 폭과 latency가 한눈에 들어옵니다.

L1/L2/L3/DRAM 메모리 hierarchy — 단계마다 약 10배 커지고 10배 느려진다

#L1 — Split I/D (Harvard)

L1 I-cache — 명령 전용 (read-only, no write-back)
L1 D-cache — 데이터 (read + write)

L1 Split I/D Cache — fetch와 load/store 동시 처리

fetch와 load/store가 동시에 가능해 structural hazard를 회피합니다.

#L2/L3 — Unified

L2 이상은 instruction과 data를 통합합니다. Inclusive와 Exclusive로 나뉩니다.

정책	의미	사용처
Inclusive	L1의 데이터는 L2에도 있음	Intel (snoop 효율)
Exclusive	L1의 데이터는 L2에 없음	ARM Cortex-A (용량 효율)
NINE (Non-Inclusive Non-Exclusive)	가능하나 보장 안 함	최신 Intel L3

#Cache Line

1
일반적 line size — 64 bytes (Cortex-A, x86)
2
                  32 bytes (Cortex-M7)
3
                   16 bytes (Cortex-M4 cache)

CPU가 1 byte만 읽어도 전체 line이 fetch되어 spatial locality를 활용합니다.

1
int arr[1024];
2
arr[0];   // 64 byte line fetch: arr[0]~arr[15] 캐시 입력
3
arr[1];   // ← 같은 line, hit!
4
arr[16];  // ← 다른 line, miss

#Direct Mapped Cache

Address bits:

[Tag] [Index] [Offset]
20 8 6 (64 KB cache, 64 byte line)

각 메모리 주소는 유일한 cache line에만 들어갑니다.

1
addr 0x1000 → line 64
2
addr 0x5000 → line 64 (conflict! evict 0x1000)
3
addr 0x9000 → line 64 (conflict!)

Conflict miss가 빈번합니다. hardware는 단순하지만 hit rate가 낮습니다.

#N-Way Set Associative

1
[Tag][Index][Offset]
2
Index → set (한 set에 N개 way)
3

4
Set 64:
5
   way 0: [tag=0x1, data=...]
6
   way 1: [tag=0x5, data=...]
7
   way 2: [tag=0x9, data=...]
8
   way 3: [tag=...]

Cortex-A72 L1 D = 4-way set associative. Cortex-A72 L2 = 16-way.

1
addr 0x1000 hits set X way 0
2
addr 0x5000 hits set X way 1   ← coexist!

Hit rate가 올라가고 hardware complexity도 함께 올라갑니다.

#Fully Associative

모든 cache line이 임의 데이터를 보유할 수 있습니다. Translation Lookaside Buffer (TLB)의 작은 부분이 fully associative 구조로 되어 있습니다.

1
Lookup: compare tag with ALL entries — 비싸지만 conflict 0

용량이 작을 때만 씁니다 (8-32 entry).

#Replacement Policy

#LRU (Least Recently Used)

각 way의 access 시간을 timestamp로 저장합니다. Eviction 시 가장 오래된 것을 폐기합니다.

1
Way: 0  1  2  3
2
Time: 5 8  3  10   → 다음 miss 시 way 2 (가장 오래된) 폐기

4-way 정도까지는 정확하게 구현하지만 그 이상은 비용이 큽니다.

#Pseudo-LRU (PLRU) — 표준

Binary tree 형태로 log N bit으로 근사합니다.

1
4-way PLRU (3 bits):
2
       [root: 0/1]
3
        /        \
4
   [0/1]         [0/1]
5
   /    \        /    \
6
 way 0  way 1  way 2  way 3

각 노드의 bit가 마지막 access의 반대 방향을 가리키며, eviction은 그 방향을 따라 내려갑니다.

근사적이지만 간단하고 빠릅니다. ARM Cortex-A53이 PLRU를 사용합니다.

#Random

진짜 random이거나 round-robin 방식입니다. WCET 예측이 가능해 자동차 인증에서 자주 사용합니다 (예측이 어려운 LRU를 회피).

#Write Policy

#Write-Through

1
Write → cache + memory 둘 다 즉시 업데이트

장점은 coherence가 단순하다는 점입니다. 단점은 write traffic이 크다는 점입니다.

#Write-Back

1
Write → cache만 (dirty bit set)
2
Eviction or flush → memory 업데이트

장점은 write traffic이 적다는 점입니다. 단점은 DMA·SMP coherence가 복잡하다는 점입니다.

ARM L1 D-cache는 write-back에 write-allocate 방식입니다.

#Write-Allocate vs No-Write-Allocate

1
*ptr = 42;   // write miss

Write-allocate: miss 시 line을 fetch하고 cache에 write
No-write-allocate: miss 시 cache를 거치지 않고 memory에 직행

Streaming write(한 번 쓰고 다시 읽지 않는 패턴)에는 no-write-allocate가 효율적입니다.

#Cortex-M Cache

MCU	Cache
Cortex-M0/M3/M4	없음 (TCM 또는 직접 flash 실행)
Cortex-M7	L1 I + L1 D (선택적 enable)
Cortex-M33	optional
Cortex-M55	optional

Cortex-M7 cache enable:

1
SCB_EnableICache();
2
SCB_EnableDCache();

DMA 사용 시 cache maintenance가 필수입니다. SCB_CleanDCache_by_Addr나 SCB_InvalidateDCache_by_Addr를 사용합니다.

#TCM (Tightly Coupled Memory)

1
__attribute__((section(".dtcm"))) uint8_t fast_buf[4096];
2
__attribute__((section(".itcm"))) void critical_isr(void) { ... }

Cache miss가 없는 결정성 메모리입니다. Cortex-M7·R52 등에 32-256 KB가 내장되어 있고 자동차·항공기 critical loop에 사용됩니다.

#측정 — Cache Miss Rate

1
perf stat -e cache-references,cache-misses ./prog
2

3
# miss-rate = misses / references
4
# < 5% — 좋음
5
# 10-20% — 보통
6
# > 30% — 문제

PMU event:

L1D_CACHE L1D_CACHE_REFILL (Cortex-A)
L1I_CACHE L1I_CACHE_REFILL
L2D_CACHE L2D_CACHE_REFILL

#자주 하는 실수

⚠️ 큰 array 순회 시 cache line 무시

1
struct Big { int id; char name[60]; int value; };
2
for (i = 0; i < N; i++) sum += arr[i].value;
3
// 64 byte struct → line당 1개 → memory bandwidth 낭비

→ SoA(Structure of Arrays)를 쓰거나 value만 별도 array로 빼내야 합니다.

⚠️ Stride access

1
for (j = 0; j < W; j++)
2
    for (i = 0; i < H; i++)
3
        sum += matrix[i][j];   // ← column-major access
4
                               // 매 access cache miss

→ row-major 방식으로 안쪽 loop을 j로 두어야 합니다 (matrix[i][j]).

⚠️ DMA 후 cache invalidate 안 함

1
HAL_UART_Receive_DMA(&huart, buf, len);
2
// DMA 끝났는데
3
printf("%s", buf);   // ← cache가 옛 데이터 보여줌

SCB_InvalidateDCache_by_Addr(buf, len); 호출이 필요합니다.

⚠️ Cache 활성화 후 Variable alignment

DMA buffer는 cache line aligned여야 하며 32-byte align을 권장합니다.

1
__attribute__((aligned(32))) uint8_t dma_buf[256];

#정리

Hierarchy는 L1 split(I+D)에서 L2/L3 unified, DRAM 순으로 이어집니다.
Cache line은 Cortex-A에서 64 byte, Cortex-M7에서 32 byte입니다.
N-way set associative는 Cortex-A에서 4-16 way로 사용합니다.
Replacement는 LRU, PLRU, Random(WCET용)으로 나뉩니다.
Write-back + write-allocate가 표준입니다.
Cortex-M7에서 cache와 DMA를 함께 쓸 때는 cache maintenance가 필수입니다.

다음 편은 Cache Miss 분석으로 3C model을 다룹니다.

#관련 항목

Embedded Performance Engineering · 14 of 57

실전 사례 — Matrix Multiply가 예상의 10배 느린 이유

1024×1024 matrix multiply가 이론값의 10배 느렸다. SIMD부터 의심했지만 진짜 범인은 캐시 미스 90%였다.

2026년 4월 28일·case-study

DMA 성능 최적화 — Burst·Scatter-Gather·Chain·Cache 일관성

Burst size 최적화. Scatter-gather, chain. Cache clean/invalidate, double buffer.

2026년 4월 25일·dma

Cache Line 최적화 — Alignment·Prefetch·False Sharing 처리

64-byte line alignment, software prefetch, false sharing 회피, SoA·AoS 선택.

2026년 4월 24일·cache