PCIe Deep Dive · 17/19

Ch 17: Performance — Bandwidth·Latency·Tuning

2026년 5월 19일 · Hawk · 8분 읽기

pcie performance bandwidth latency max-payload tuning

#한 줄 요약

“PCIe 성능은 theoretical BW × encoding overhead × MaxPayload 효율 × NUMA locality × Posted/Non-Posted 비율에 의해 결정됩니다.” — Effective BW는 theoretical의 70~90%. MaxPayload·MaxReadReq tuning이 throughput을 1.5~3배. Latency는 RC·switch·EP의 hop 누적. NUMA mismatch가 5~30% 손실. ASPM은 low-load 절전이지만 burst latency.

Ch 16 Troubleshooting에서 성능 미달 시나리오를 봤습니다. 이 장은 성능 측정·tuning의 전체 그림을 본격적으로 분해합니다.

#Theoretical vs Effective Bandwidth

Generation	per-lane raw	per-lane effective	x16 raw	x16 effective
Gen 3	8 GT/s	~7.88 Gb/s	128 GT/s	~126 Gb/s (16 GB/s)
Gen 4	16 GT/s	~15.75 Gb/s	256 GT/s	~252 Gb/s (32 GB/s)
Gen 5	32 GT/s	~31.5 Gb/s	512 GT/s	~504 Gb/s (63 GB/s)
Gen 6	64 GT/s	~63 Gb/s	1024 GT/s	~126 GB/s
Gen 7	128 GT/s	~126 Gb/s	2048 GT/s	~252 GB/s

encoding overhead (8b/10b 20%·128b/130b 1.5%·PAM4+FEC ~3%) 적용 후가 effective.

실측 BW는 effective의 70~90% — TLP header overhead·ACK/NAK·Update FC·idle gap.

#MaxPayloadSize (MPS) — TLP Payload 최대 크기

MPS	Header overhead 비율
128 byte	~12.5% (header 16 byte / 128 byte)
256 byte	~6%
512 byte	~3%
1024 byte	~1.5%
4096 byte	~0.4%

전 link의 device들이 최소값으로 협상. 낮은 MPS device 하나가 전체를 끌어내림.

1
# 현재 MPS 확인
2
lspci -vv | grep "MaxPayload"
3

4
# 변경 (driver·BIOS 지원해야)
5
setpci -s 01:00.0 CAP_EXP+8.W=0x2810  # MaxPayload 256

NVMe·Mellanox NIC은 256·512 byte 일반. legacy device가 128 byte면 bottleneck.

#MaxReadRequestSize (MRRS)

outbound Memory Read TLP의 max payload. 즉 한 번에 얼마나 큰 read 요청할지:

MRRS	효과
작음 (128 byte)	여러 read TLP → header overhead
큼 (4096 byte)	적은 read TLP → 효율적

1
# 현재 MRRS
2
lspci -vv | grep "MaxReadReq"
3

4
# 변경
5
setpci -s 01:00.0 CAP_EXP+8.W=0x5810  # MRRS 4096

NVMe·NIC는 4096 byte가 일반. MPS와 다름 — MPS는 받을 수 있는 payload, MRRS는 요청할 size.

#Latency Breakdown

End-to-end latency:

영역	일반 latency
CPU → Root Complex	< 100 ns
Root Complex → Switch	50100 ns
Switch → Endpoint	~50 ns/hop
Endpoint internal	device 의존 (NVMe ~10 µs·NIC ~µs)
Completion 돌아옴	같은 path 역순

Latency = 2 × (hop count × hop latency) + endpoint processing. *Direct attach (RC → EP)*가 switch traversal보다 빠름.

#Completion Combining·Coalescing

큰 Read는 여러 Completion으로 split (Ch 2). Coalescing이 Completion 묶어 처리:

기능	효과
TLP coalescing	RC가 여러 작은 Completion을 batch — software overhead 감소
NIC interrupt coalescing	다수 packet → 한 interrupt
NVMe Q-depth	submission queue 깊이

ethtool -C eth0 rx-usecs 100처럼 NIC-level tuning.

#NUMA Locality

Multi-socket system에서 device가 어느 socket에 매달려 있는지:

시나리오	latency·BW
Same NUMA (local)	최적
Cross-NUMA (UPI/Infinity Fabric traverse)	5~30% 손실

1
# Device NUMA
2
cat /sys/bus/pci/devices/0000:01:00.0/numa_node
3

4
# Process 측 NUMA pinning
5
numactl --cpunodebind=0 --membind=0 ./app

NVMe·NIC을 해당 NUMA의 CPU·memory에서 사용. 8-socket server에서 NUMA mismatch는 큰 성능 차.

#Posted vs Non-Posted 비율

Workload	Posted·Non-Posted
NIC RX (DMA write to host)	거의 Posted
NVMe Read	Non-Posted (Read 요청) + Completion
NIC TX (host → NIC DMA)	NIC이 Read 발행 → 응답
GPU compute	Posted Write 중심

Non-Posted가 많으면 latency 노출. 큰 MRRS로 Non-Posted 줄임.

#Relaxed Ordering·No Snoop

Attribute 비트가 throughput에 큰 영향:

Attr	효과
RO (Relaxed Ordering)	strict order 완화 → parallelism ↑
NS (No Snoop)	RC cache snoop skip → host CPU 부담 ↓

GPU·HPC accelerator가 RO·NS 활성해 최대 throughput. 데이터 손상 위험은 driver가 barrier로 control.

#ASPM·LTR 영향

활성	Idle 전력	Burst latency
ASPM Off	높음	낮음·일정
ASPM L0s	중간	µs 잡음
ASPM L1	낮음	µs~수십 µs 잡음
L1.2 substates	매우 낮음	수백 µs

*Latency-critical (NVMe·NIC)*는 ASPM off가 일반. 모바일·노트북은 L1.2까지 활성.

#P2P DMA

EP A → EP B 직접 DMA (RC 경유 안 함):

시나리오	효과
NVMe → GPU 직접 copy	host memory bypass
GPU ↔ GPU (NCCL)	inter-GPU NVLink + PCIe
같은 switch	switch peer-to-peer

/sys/bus/pci/.../p2pdma로 지원 확인. driver·BIOS·switch ACS 모두 P2P 허용해야.

#DDIO — Direct Data I/O

Intel Xeon의 NIC traffic을 LLC에 직접 DMA:

시나리오	효과
DDIO 활성	NIC RX가 LLC로, CPU L3 hit
DDIO 비활성	DRAM 경유 — latency 큼

Latency 200~300 ns 절감. 100 GbE·NVMe·CXL에 유효.

#측정 도구

도구	용도
`lspci -vv	grep “Lnk”`
`fio`	NVMe IOPS·throughput·latency
`iperf3·netperf`	NIC throughput
`perf c2c`	cache contention
`intel-pmu-tools`	PCIe BW counter
`pcm-pcie` (Intel PCM)	PCIe BW per device
`nvidia-smi dmon`	GPU PCIe BW

#자주 하는 실수

#”Gen 5면 NVMe Gen 4의 2배”

Effective BW는 그렇지만 NVMe 내부 throughput 한계도 있음. NVMe controller·flash bandwidth가 bottleneck이면 PCIe upgrade 효과 적음.

#”MPS 키우면 항상 빠름”

latency-sensitive small transfer는 작은 MPS가 유리. MPS·MRRS는 workload 따라 다른 optimum.

#”ASPM Off가 항상 최선”

low-load device는 electricity·thermal cost. server에서도 deeper power state 활용이 total cost optimum.

#”NUMA가 software 결정”

PCIe slot의 NUMA는 hardware 고정. board layout으로 결정. bind 못 함. workload를 그 NUMA에 맞춤.

#”Coalescing이 latency 손해만”

low load에서 false. high load에서 throughput per ISR가 크게 향상. workload-aware tuning.

#정리

Effective BW는 theoretical의 70~90%. encoding·header·ACK/NAK 모두 overhead.
MaxPayload·MaxReadReq가 throughput tuning의 1순위.
Latency는 hop count × hop latency + endpoint. Direct attach 빠름.
NUMA locality가 5~30% 차이. Hardware 고정.
Posted·Non-Posted 비율·RO·NS attribute가 parallelism.
ASPM·LTR은 power vs latency trade-off.
P2P DMA·DDIO가 host memory bypass·LLC 직접.
측정: lspci·fio·perf·pcm-pcie·nvidia-smi.

#다음 편

Ch 18: Register Maps — Configuration Space 비트 reference에서 Configuration·PCIe Cap·AER·MSI/MSI-X·SR-IOV의 비트별 reference를 제공합니다.

#관련 항목

Ch 1: PCIe Fundamentals — Gen·encoding
Ch 2: TLP — Posted·Non-Posted·attribute
Ch 6: Power Management — ASPM
Ch 11: DMA·IOMMU — P2P DMA

#시리즈 자료 출처 안내

본 글의 1차 자료·정책은 Ch 1 footer 참고.

PCIe Deep Dive · 17 of 19

Ch 19: 고급 기능 — Lane Margining·10-bit Tag·TPH·ACS·L0p

코어 동작 너머의 PCIe spec 기능들 — Lane Margining(신호 마진 측정)·10-bit Tag(outstanding 확장)·TPH(캐시 주입 힌트)·ACS(격리)·L0p(부분폭 저전력)을 실무 관점에서 정리합니다.

2026년 5월 19일·pcie

Ch 18: Register Maps — Config Space·Capability 비트 reference

PCIe register reference — Type 0/1 header·PCIe Cap·AER·MSI·MSI-X·SR-IOV·ACS·LTR의 주요 비트 layout.

2026년 5월 19일·pcie

Ch 16: Troubleshooting — 실무 시나리오북

Device not visible·link training fail·downgrade·CE storm·hang·ACS group·hot-plug·성능 미달·lane reversal·power budget.

2026년 5월 19일·pcie

Ch 17: Performance — Bandwidth·Latency·Tuning

#한 줄 요약

#Theoretical vs Effective Bandwidth

#MaxPayloadSize (MPS) — TLP Payload 최대 크기

#MaxReadRequestSize (MRRS)

#Latency Breakdown

#Completion Combining·Coalescing

#NUMA Locality

#Posted vs Non-Posted 비율

#Relaxed Ordering·No Snoop

#ASPM·LTR 영향

#P2P DMA

#DDIO — Direct Data I/O

#측정 도구

#자주 하는 실수

#”Gen 5면 NVMe Gen 4의 2배”

#”MPS 키우면 항상 빠름”

#”ASPM Off가 항상 최선”

#”NUMA가 software 결정”

#”Coalescing이 latency 손해만”

#정리

#다음 편

#관련 항목

#시리즈 자료 출처 안내

PCIe Deep Dive · 17 of 19

관련 글

Ch 19: 고급 기능 — Lane Margining·10-bit Tag·TPH·ACS·L0p

Ch 18: Register Maps — Config Space·Capability 비트 reference

Ch 16: Troubleshooting — 실무 시나리오북

이 글을 참조하는 글 (2)