Embedded Performance Engineering · 11/57

Branch Prediction 분석 — Static·2-bit·BTB·BHT·Mispredict 비용

2026년 4월 24일 · Hawk · 5분 읽기

#한 줄 요약

“Branch prediction = pipeline 살리는 트릭” 입니다. mispredict 시 전체 flush가 일어납니다.

#Mispredict 비용

1
Pipeline depth: 8
2
Branch resolved at: stage 5
3
Misprediction → 5 cycle flush → 5 cycle 손실

CPU	Pipeline	Mispredict Penalty
Cortex-M0	2	1 cycle
Cortex-M3/M4	3	2 cycle
Cortex-M7	6	4~5 cycle
Cortex-A53	8	8 cycle
Cortex-A72	15	15 cycle
Intel Skylake	14~19	15+ cycle

깊은 pipeline일수록 예측 실패 비용이 커집니다. Modern CPU에서 mispredict는 cache miss와 비슷한 비용을 갖습니다.

#Static Prediction — BTFNT

Backward Taken, Forward Not-Taken.

1
for (int i = 0; i < N; i++) {   // backward branch — predict taken
2
    // loop body
3
}
4

5
if (rare_error) {                // forward — predict not-taken
6
    handle_error();
7
}

Loop은 대부분 taken이고 error path는 대부분 not-taken이라, 통계적 hit rate가 약 70%입니다.

ARM Cortex-M0/M3는 pure static 방식입니다. Branch instruction 자체가 backward·forward를 판별할 수 있습니다.

#Dynamic — 1-bit Predictor

1
last_branch_taken → 1
2
last_branch_not_taken → 0
3
predict = last result

1
for (int i = 0; i < 10; i++) { ... }   // 9 taken + 1 not-taken

루프 끝에서 두 번 mispredict가 발생합니다 (taken→not-taken 전환과 다음 루프 시작 not-taken→taken). 정확도는 70-80%입니다.

#2-bit Saturating Counter — 표준

1
[Strongly Not Taken] ←→ [Weakly Not Taken] ←→ [Weakly Taken] ←→ [Strongly Taken]
2
       00                     01                  10                 11
3

4
Predict:
5
  00, 01 → Not Taken
6
  10, 11 → Taken
7

8
Update on actual outcome:
9
  Taken → counter++  (saturate at 11)
10
  Not Taken → counter--  (saturate at 00)

루프 끝에서 한 번만 mispredict가 발생합니다. 한 번의 예외로 바로 flip되지 않기 때문입니다. 정확도는 85-95%입니다.

#BHT (Branch History Table)

각 branch PC를 index로 2-bit counter를 저장합니다.

1
PC (12-bit hash) → Index → 2-bit counter

크기는 보통 1k~16k entry입니다. ARM Cortex-A53은 256 entry × 4-way set associative 구조를 갖습니다.

#BTB (Branch Target Buffer)

분기 주소뿐 아니라 대상 주소도 캐시합니다.

PC → BTB entry:

{ target_addr, predict_bits }

분기 명령 fetch 시점에 BTB hit이면 바로 target fetch가 이루어집니다. 1 cycle도 잃지 않습니다.

#Global History — Two-Level Adaptive (gshare)

1
last 8 branches taken? → 8-bit history register
2

3
index = (PC ^ history) mod table_size

상관관계 학습입니다. if (a) {} if (b) {} if (a && b) {} 패턴에서 c의 결과를 a·b 결과로 예측합니다.

Cortex-A72는 Tournament predictor를 씁니다 (local + global을 모두 두고 동적으로 선택).

#Indirect Branch — Function Pointer·vtable

1
void (*handler)(int) = handlers[type];
2
handler(arg);    // indirect branch — target 가변

BTB 한 entry는 최근 target만 기억합니다. 가변이면 mispredict가 빈번해집니다.

해결책은 Indirect Branch Predictor입니다 (Cortex-A72의 별도 hardware).

#Return Address Stack

1
function_a();   // call → push return addr to stack
2
   /* ... */
3
   return;      // pop stack → predict return target

별도의 return stack을 둡니다 (8~16 entry). Function call/return은 거의 완벽하게 예측됩니다.

깊은 recursion이 stack을 초과하면 mispredict가 발생합니다.

#Cortex-M3/M4 — Limited Prediction

1
; Cortex-M3 prediction
2
beq label       ; static prediction만 (BTFNT 기본)
3
bx lr           ; return — *prediction 없음*, 항상 flush

Cortex-M3는 branch 자체에 prefetch buffer를 둡니다. mispredict는 2 cycle이고 그 외 hit은 0 cycle입니다.

#Cortex-A53 — Branch Predictor Spec

항목	사양
BTB	256 entry, 4-way set assoc
BHT	6144 entry
Return Stack	8-entry
Predict per cycle	1
Mispredict penalty	8 cycle

#측정 — PMU Event

Event	의미
`0x10` BR_PRED	분기 명령 수
`0x11` BR_MIS_PRED	잘못 예측한 분기 수
`0x18` BR_RETURN_MIS_PRED	잘못 예측한 return

1
# Linux perf
2
perf stat -e branches,branch-misses ./prog
3
# branch-miss-rate = branch-misses / branches
4
# 일반 코드: < 5%
5
# 잘 짜인 코드: < 1%
6
# branch-heavy worst case: 15~20%

#Branchless Code

1
// 회피
2
int max(int a, int b) {
3
    if (a > b) return a;
4
    else return b;
5
}
6

7
// Good — branchless
8
int max(int a, int b) {
9
    int diff = a - b;
10
    int mask = diff >> 31;   // -1 if a<b, 0 if a>=b
11
    return b + (diff & ~mask);
12
}

또는 ARM Thumb-2 IT block:

1
cmp r0, r1
2
it gt
3
movgt r0, r1    ; conditional move

Mispredict를 회피하는 기법입니다. 다만 modern OoO에서 predict가 잘 되면 branchless가 더 빠르지 않을 수 있어 측정이 우선입니다.

#__builtin_expect — 컴파일러 힌트

1
if (__builtin_expect(rare_error, 0)) {
2
    handle();
3
}
4

5
// 매크로화
6
#define likely(x)   __builtin_expect(!!(x), 1)
7
#define unlikely(x) __builtin_expect(!!(x), 0)
8

9
if (unlikely(rare_error)) {
10
    handle();
11
}

컴파일러가 forward branch arrangement를 hint대로 배치합니다. 일부 ARM에서는 static prediction에도 영향을 줍니다.

#Spectre — Branch Prediction의 어두운 면

1
if (x < array_size) {            // mispredict → speculative execution
2
    y = array[secret_offset];    // 실행되지만 commit 안 됨
3
                                  // 그러나 *cache state 변경* → side channel
4
}

CVE-2017-5754 Meltdown·Spectre 계열입니다. ARM에서도 Cortex-A75 이상에서 영향이 있습니다. 완화 기법으로는 csdb barrier와 KAISER 기법이 있습니다.

#자주 하는 실수

⚠️ volatile로 mispredict 회피 시도

volatile은 컴파일러 재정렬 차단입니다. branch prediction과는 무관합니다.

⚠️ 짧은 if-else로 무조건 branchless

1
if (rare_case) special_path();   // ← 1% taken

99% predict가 성공하면 mispredict 비용은 적습니다. Branchless가 항상 빠른 것은 아닙니다.

⚠️ Indirect call 남발

1
op_table[op_code]();   // 함수 포인터 — indirect branch mispredict 빈번

switch는 jump table로 컴파일되지만 direct jump 형태라 일부 컴파일러는 BTB 친화적으로 처리합니다. Computed goto (&&label)도 옵션으로 쓸 수 있습니다.

⚠️ Inline assembly로 branch 자제

1
asm volatile ("b label");   // 직접 jump → mispredict 가능성 ↑ (BHT 학습 못 함)

컴파일러가 자동으로 생성한 branch보다 덜 효율적입니다. 명시적인 이유가 없으면 자제합니다.

#정리

Mispredict 비용은 pipeline 깊이만큼 cycle을 손실합니다.
Static BTFNT에서 2-bit saturating, BTB + BHT + return stack 순으로 발전합니다.
Cortex-A는 tournament + indirect predictor까지 지원합니다.
PMU BR_MIS_PRED로 측정하며, 목표는 5% 미만입니다.
__builtin_expect·branchless·jump table을 적극 활용합니다.

다음 편은 Speculative Execution입니다.

#관련 항목

Embedded Performance Engineering · 12 of 57

DMA vs CPU Copy 성능 비교 — Break-even·Setup Overhead 실측

DMA setup overhead. CPU memcpy 최적화. Break-even size. 실측 데이터.

2026년 4월 25일·dma

Speculative Execution 분석 — OoO·Reorder Buffer·Register Renaming

Out-of-order execution. ROB·issue queue·rename. Spectre 측면. Cortex-A 사례.

2026년 4월 24일·cpu

Pipeline Stall 분석 — Data·Structural·Control Hazard·Forwarding

Stall은 pipeline bubble을 만듭니다. RAW·WAR·WAW hazard, forwarding, PMU STALL counter를 살펴봅니다.