HBM·GDDR 심화 · 5/12

메모리 대역폭 병목 분석 — Theoretical vs Achievable·Roofline·Memory Wall

2026년 5월 16일 · Hawk · 8분 읽기

#한 줄 요약

**“공칭 대역폭의 70~85%만 실제로 쓸 수 있습니다.” — refresh·row activation·bank conflict로 효율이 깎입니다. AI workload는 대부분 memory bound이고, LLM inference에서는 *MFU 30~~50%*에서 *MBU 60~~85%*가 동시에 나옵니다. compute가 빨리 늘어 memory가 병목인 memory wall의 핵심을 정량적으로 풉니다.

Ch 4에서 GDDR signaling의 진화를 봤습니다. 이번 장은 대역폭이 실제로 어떻게 쓰이는지입니다. 데이터시트의 819 GB/s가 실제 application에서 700 GB/s만 나오는 그 30%의 손실이 어디로 가는지가 핵심입니다.

#Theoretical BW의 계산

먼저 공칭 대역폭은 단순합니다.

1
공식: BW = clock × bus_width × DDR(2)
2
       = pin_rate × bus_width / 8
3

4
HBM3 stack:
5
  6.4 Gbps × 1024-bit / 8 = 819 GB/s
6

7
HBM3E stack (9.6 Gbps):
8
  9.6 Gbps × 1024-bit / 8 = 1228 GB/s ≈ 1.23 TB/s
9

10
HBM4 stack (8.0 Gbps × 2048-bit):
11
  8.0 Gbps × 2048-bit / 8 = 2048 GB/s ≈ 2.0 TB/s
12

13
GDDR6X chip (21 Gbps × 32-bit):
14
  21 Gbps × 32-bit / 8 = 84 GB/s
15

16
NVIDIA H100 (5 stack × HBM3):
17
  5 × 819 = 4096 GB/s ≈ 4 TB/s 공칭
18
  실제 spec: 3.35 TB/s (이미 effective rate 표기)

H100 데이터시트는 3.35 TB/s로 이미 깎인 값을 적습니다. NVIDIA가 channel utilization과 refresh overhead를 반영한 spec을 제품 marketing에 씁니다. 4 TB/s 이론치가 아닙니다.

#효율 손실 — 어디로 가는가

공칭에서 실효치까지의 갭은 네 가지 요인에서 옵니다.

1
Bandwidth loss sources
2

3
theoretical BW
4
  │
5
  ├── refresh overhead         ─── -3~5%
6
  │   (auto refresh, REF/REFsb)
7
  │
8
  ├── row activation           ─── -5~10%
9
  │   (RAS → CAS, tRCD)
10
  │
11
  ├── bank conflict            ─── -5~15%
12
  │   (queue stall, schedule miss)
13
  │
14
  ├── command/address overhead ─── -2~5%
15
  │   (precharge, mode register)
16
  │
17
  └── ECC redundancy           ─── -0~5%
18
      (data path가 ECC 비트로 일부 점유)
19

20
→ sustained BW ≈ 70~85% of theoretical

각 요인을 수치로 봅니다.

#Refresh overhead

DRAM은 주기적으로 row를 refresh해야 데이터가 보존됩니다. HBM3 기준 64 ms마다 모든 row를 한 번씩 refresh합니다.

1
HBM3 refresh budget
2

3
총 row 수 (24 Gb DRAM × 12 = 36 GB stack):
4
  약 65536 rows per bank × 32 bank × 16 channel
5
  = 33M rows per stack
6

7
refresh 명령 (tREFI = 3.9 μs):
8
  64 ms / 3.9 μs = 16384 commands per cycle
9

10
한 command가 8 row를 refresh (REFab):
11
  → 16384 × 8 = 131K rows / 64 ms
12

13
각 refresh의 tRFC = 350 ns
14
  → 16384 × 350 ns = 5.7 ms 동안 bus busy
15
  → 64 ms 중 5.7 ms = 8.9% busy

실제로는 *per-bank refresh(REFpb)*나 fine-grained refresh를 써서 bus 점유율을 분산시킵니다. 효과적인 refresh overhead는 3~5% 수준입니다.

#Row activation latency

DRAM access는 row를 먼저 activate해야 column read/write가 가능합니다.

DRAM access sequence — row open(ACT) → 같은 row 안의 RD 연속 → row close(PRE) → 새 row open. row hit는 burst 효율적, row miss는 ACT+PRE+ACT의 42 ns overhead

*같은 row 내 access(row hit)*는 효율적이지만, random access는 row miss 비율이 높아 bandwidth 깎입니다. 잘 설계된 컨트롤러는 row buffer locality가 70~90% 정도입니다.

#Bank conflict

여러 outstanding request가 같은 bank를 노리면 직렬화됩니다.

Bank conflict — bank 3에 다른 row 요청이 쌓여 직렬화되는 동안 bank 7은 곧바로 처리하므로 utilization이 저하된다

HBM3는 16 channel × 2 PC × 16 bank = 512 bank가 독립합니다. 컨트롤러가 address mapping을 XOR hash로 잘 설계하면 bank conflict가 5% 이하로 떨어집니다.

#Command/address overhead

precharge, activate, mode register write 같은 데이터 전송이 아닌 명령도 command bus를 차지합니다. HBM3에서 C/A overhead는 *전체 cycle의 2~5%*입니다.

#Achievable BW의 측정

실제 BW를 측정하는 가장 간단한 방법은 STREAM 벤치마크입니다.

1
// STREAM Triad: a[i] = b[i] + scalar * c[i]
2
// memory traffic = 3N (2 read, 1 write) × sizeof(double)
3

4
#define N (1<<28)  // 256M elements
5
double *a, *b, *c;
6
double scalar = 3.0;
7

8
cudaMalloc(&a, N * sizeof(double));
9
cudaMalloc(&b, N * sizeof(double));
10
cudaMalloc(&c, N * sizeof(double));
11

12
// kernel
13
__global__ void triad(double *a, double *b, double *c, double s, int n) {
14
    int i = blockIdx.x * blockDim.x + threadIdx.x;
15
    if (i < n) a[i] = b[i] + s * c[i];
16
}
17

18
// measure time → BW = 3 * N * 8 / time

H100 (3.35 TB/s spec)의 STREAM triad 결과는 일반적으로 2.5~2.8 TB/s 정도입니다. spec 대비 75~83% 효율입니다.

카드	워크로드	sustained BW	효율
A100 80GB (2.0 TB/s spec)	STREAM triad	1.65 TB/s	82%
A100 80GB	cuBLAS sgemm	1.50 TB/s	75%
A100 80GB	random access	0.55 TB/s	28% (worst case)
H100 80GB (3.35 TB/s spec)	STREAM triad	2.75 TB/s	82%
H100 80GB	LLM inference	2.40 TB/s	72%
MI300X (5.3 TB/s spec)	STREAM triad	4.20 TB/s	79%

random access가 27% 효율까지 떨어지는 게 충격적입니다. 이는 row hit 비율 낮음, bank conflict 다발이 겹친 worst case입니다.

#Roofline 모델

Roofline은 compute와 memory bandwidth의 관계를 한 그림에 보여 줍니다.

Roofline 모델 — memory bound와 compute bound 영역, knee point

knee = peak_compute / peak_BW. NVIDIA H100 기준 1000 TFLOPS / 3.35 TB/s = 298 FLOPS/Byte입니다.

Arithmetic Intensity는 byte 1개당 몇 FLOP를 하는지입니다. AI workload별로 보면:

연산	Intensity (FLOPS/Byte)	결론
Vector add (`a[i]+b[i]`)	0.25	강한 memory bound
Sparse matvec	0.5~2	memory bound
GEMM 1024×1024	30~50	memory bound (H100)
GEMM 16384×16384	100~300	knee 근처
Conv 3×3 dense	10~30	memory bound
LLM attention (KV cache 읽기)	5~15	강한 memory bound
LLM MLP (matmul)	50~100	memory bound (H100)
TPU TPC integer ops	200+	compute bound

H100의 knee = 298 FLOPS/Byte인데 대부분 AI workload가 10~100 FLOPS/Byte에 있습니다. 다시 말해 대부분이 memory bound입니다.

#Memory wall

compute와 memory BW의 증가 속도 차이가 벌어지고 있습니다.

Memory wall — compute는 25배, memory BW는 9배만 늘어 격차 확대

compute 25배 growth 동안 memory BW 9배 growth. knee point가 오른쪽으로 이동하며 더 많은 workload가 memory bound로 떨어집니다.

같은 GPU 안에서 compute 25배가 늘 동안 memory BW는 9배만 늘었습니다. knee가 오른쪽으로 이동하며 더 많은 workload가 memory bound로 떨어집니다.

해결책은 세 갈래입니다.

on-chip cache 늘리기 — H100의 L2가 50 MB, B200은 82 MB. 그러나 LLM weight 100 GB에는 턱없이 부족합니다.
HBM 세대 빠르게 진화 — HBM3E(1.2 TB/s) → HBM4(2 TB/s) → HBM4E(2.5 TB/s+).
알고리즘 측에서 AI를 compute bound로 옮기기 — flash attention, mixture-of-experts(MoE) sparsity.

#LLM inference의 MBU

LLM 추론에서 *Memory Bandwidth Utilization(MBU)*가 core metric이 됐습니다.

1
LLM inference 단계
2

3
1. Prefill (prompt encoding)
4
   - 모든 weight 한 번씩 읽음
5
   - batch 클수록 reuse 증가
6
   - compute bound 경향
7

8
2. Decode (token by token)
9
   - 매 토큰마다 weight 다시 읽음
10
   - KV cache 추가 읽기 (sequence 길이에 비례)
11
   - 강한 memory bound
12

13
MBU = actual BW used / peak BW
14
MFU = actual FLOPS / peak FLOPS
15

16
LLaMA 70B inference on H100 (batch=1):
17
  Decode latency: 30 ms/token
18
  weight = 140 GB / 3.35 TB/s = 42 ms (이론)
19
  실제 30 ms ← KV cache는 cache hit 효과
20
  MBU ≈ 80%
21
  MFU ≈ 5%

LLaMA 70B decode는 *MBU 80%, MFU 5%*입니다. compute는 거의 쉬고 memory만 돌아갑니다. batch를 키워야 MFU가 올라가지만 latency도 같이 늘어납니다.

#측정 도구

GPU/NPU에서 어디가 병목인지 보는 도구가 있습니다.

1
# NVIDIA — Nsight Compute (GPU profiling)
2
ncu --set full --target-processes all ./inference
3
# 보고서에서 "Memory Workload Analysis" 섹션
4
# - DRAM Throughput (% of peak)
5
# - L2 Cache Hit Rate
6
# - Memory Pipes Busy
7

8
# AMD — ROCm Profiler
9
rocprof --hsa-trace --hip-trace ./inference
10
# rocprof-compute analyze
11

12
# Linux generic — perf
13
perf stat -e cache-references,cache-misses,LLC-load-misses ./app
14

15
# Intel Gaudi — habana-tools
16
hl-prof-config -gaudi -trace=memory

데이터센터급에서는 *DCGM(NVIDIA Data Center GPU Manager)*로 fleet-wide BW utilization을 모니터링합니다.

1
# DCGM API 예시
2
import pydcgm
3
import dcgm_fields
4

5
handle = pydcgm.DcgmHandle()
6
group = pydcgm.DcgmGroup(handle, groupName="my_gpus")
7

8
fields = [
9
    dcgm_fields.DCGM_FI_DEV_MEM_COPY_UTIL,     # DRAM access %
10
    dcgm_fields.DCGM_FI_PROF_DRAM_ACTIVE,       # active cycles
11
    dcgm_fields.DCGM_FI_PROF_PIPE_FP16_ACTIVE,  # FP16 pipe busy
12
]
13

14
# 1초마다 샘플링
15
fg = pydcgm.DcgmFieldGroup(handle, "my_fields", fields)
16
fg.SamplingFrequency(1000000)  # 1 sec in usec
17

18
samples = fg.GetLatest()
19
for sample in samples:
20
    print(f"DRAM: {sample.DCGM_FI_DEV_MEM_COPY_UTIL}%")
21
    print(f"FP16: {sample.DCGM_FI_PROF_PIPE_FP16_ACTIVE}%")

#자주 하는 실수

#”spec BW를 그대로 capacity planning에 쓴다”

3.35 TB/s spec을 그대로 가정하면 실제로는 2.5 TB/s 정도 나옵니다. 25% 정도의 헤드룸을 보고 latency target을 잡아야 합니다.

#”BW가 충분하니 더 빠른 chip을 쓴다”

profiling 없이 BW upgrade만 하면 돈만 나가고 효과 없음일 수 있습니다. compute bound 단계에서는 BW를 늘려도 throughput이 안 늘어납니다. Nsight Compute로 어디 단계에서 어느 쪽이 병목인지 먼저 봐야 합니다.

#”STREAM이 충분한 벤치마크다”

STREAM은 순차 access pattern입니다. 실제 LLM은 KV cache random access가 섞입니다. 효율 측정은 대표 workload로 해야 합니다. MLPerf inference가 현실적인 기준입니다.

#”MFU만 보면 시스템 효율을 안다”

MFU만 보면 compute 사용률은 알지만 memory 사용률은 모릅니다. MBU와 함께 봐야 진짜 병목이 보입니다. *MFU 30%, MBU 90%*면 memory bound이고, *MFU 80%, MBU 40%*면 compute bound입니다.

#bank conflict를 컨트롤러 책임으로만 가정

컨트롤러가 잘 해도 application의 access pattern이 adversarial이면 conflict가 누적됩니다. CUDA의 coalesced access나 ROCm의 wavefront access 같은 software-side 최적화가 결정적입니다.

#정리

공칭 대역폭에서 refresh·activation·bank conflict·C/A overhead로 *15~30%*가 깎입니다.
sustained BW는 일반적으로 *공칭의 70~85%*입니다. random access는 30%까지 떨어집니다.
Roofline 모델은 Arithmetic Intensity에 따라 compute bound인지 memory bound인지를 보여 줍니다.
AI workload의 대부분은 memory bound입니다. LLM decode는 MBU 80%, MFU 5% 같은 패턴이 흔합니다.
compute 25배 vs memory BW 9배 증가 속도 차이가 memory wall입니다.
해결책은 큰 on-chip cache, HBM 세대 진화, 알고리즘 측 sparsity의 세 갈래입니다.
측정은 NVIDIA Nsight, AMD rocprof, Intel hl-prof로 DRAM 활용률과 cache hit rate를 따로 봅니다.
다음 장에서 HBM의 열·전력 부담을 다룹니다. bandwidth 효율의 또 다른 면입니다.

#다음 편

Ch 6: 열 설계와 전력 관리에서는 HBM stack 1개가 12 W를 얼마나 작은 면적에서 발산하는지, thermal throttling이 언제 발동되는지, liquid cooling과 direct-to-die의 차이를 봅니다.

#관련 항목

Ch 3: HBM 세대 비교
Ch 6: 열 설계와 전력 관리
Ch 8: NPU·GPU 활용
CXL Ch 8: ML 가속기 — memory wall 보완 경로