Embedded Performance Engineering · 25/57

MMIO 접근 성능 — Cache Policy·Write-Combining·Volatile·Barrier

2026년 4월 25일 · Hawk · 4분 읽기

mmio register cache-policy volatile barrier

#한 줄 요약

“MMIO는 peripheral register를 메모리처럼 다루는 방식입니다.” 다만 cache와 reorder는 금지입니다.

#MMIO Cache Policy

ARM v7/v8 Memory Type:

Type	Cache	Order	Speculation	사용
Normal (cacheable)	yes	weak	yes	DRAM
Normal (non-cacheable)	no	weak	yes	DMA buffer
Device (nGnRnE)	no	strict	no	MMIO register
Device (nGnRE)	no	strict	early-write OK	MMIO 일부
Strongly-Ordered	no	very strict	no	critical config

nGnRnE는 no Gather, no Reorder, no Early Write의 약자입니다.

MMIO는 보통 Device-nGnRE를 씁니다. Read는 cache와 prefetch가 없고, write는 순서가 보장됩니다.

#Linux ioremap

1
void __iomem *mmio = ioremap(0xC0000000, 0x1000);
2
                              /* phys */ /* size */
3
/* Returns virtual address mapped to Device-nGnRnE */
4
iowrite32(0x12345678, mmio + 0x10);
5
val = ioread32(mmio + 0x20);
6
iounmap(mmio);

ioremap은 MMIO mapping을 만듭니다. ioremap_wc는 write-combining 매핑이며 PCIe BAR 등에 씁니다.

#Write-Combining

일반 MMIO write:

매 store → 즉시 transaction
→ 32 × 4-byte write = 32 transaction

Write-combining:

store buffer가 근접 주소 합침
→ 32 × 4-byte → 4 × 32-byte burst transaction

PCIe BAR에서 GPU framebuffer 같은 큰 sequential write를 할 때 bandwidth가 8배로 늘어납니다.

1
mmio = ioremap_wc(phys, size);   // Write-combining

#Volatile - 컴파일러 차단

1
/* 잘못된 예 - 컴파일러가 두 read 중 하나를 제거할 수 있습니다 */
2
uint32_t *reg = (uint32_t*)0x40000000;
3
uint32_t a = *reg;
4
uint32_t b = *reg;   // 같은 주소여서 컴파일러가 변수에 *cache*합니다
5

6
/* Good */
7
volatile uint32_t *reg = (uint32_t*)0x40000000;
8
uint32_t a = *reg;
9
uint32_t b = *reg;   // 두 번 read가 보장됩니다

volatile은 컴파일러 최적화만 차단합니다. CPU의 OoO 실행이나 write buffer에는 영향이 없으므로 barrier가 별도로 필요합니다.

#ARM Memory Barrier

1
__DMB();   // Data Memory Barrier - memory access 순서 보장
2
__DSB();   // Data Synchronization Barrier - 모든 memory access 완료 대기
3
__ISB();   // Instruction Synchronization Barrier - pipeline flush

#언제 어떤 barrier?

1
/* Peripheral 활성화 후 사용 */
2
RCC->APB1ENR |= RCC_APB1ENR_TIM2EN;
3
__DSB();   // clock enable 완료를 기다립니다
4
TIM2->CR1 = 1;   // 이제 안전합니다
5

6
/* Self-modifying code */
7
flash_write(code_buf);
8
__DSB(); __ISB();   // cache flush와 pipeline refill
9
call_new_code();
10

11
/* DMA 시작 전 cache flush */
12
SCB_CleanDCache_by_Addr(buf, len);
13
__DSB();
14
HAL_DMA_Start(...);

#Read-Modify-Write Race

1
GPIO->ODR |= (1 << 5);   // read + OR + write

ARM 명령으로:

1
ldr r0, [r1]    ; read
2
orr r0, #0x20   ; modify
3
str r0, [r1]    ; write

ISR이 중간에 다른 bit를 바꿔도 ISR의 변경이 사라집니다.

해결 방법은 다음과 같습니다.

Atomic set/clear register (ARM Cortex-M bit-band)
BSRR (Bit Set/Reset Register) - STM32 GPIO

1
GPIO->BSRR = (1 << 5);          // atomic set
2
GPIO->BSRR = (1 << 5) << 16;    // atomic reset

#Bit-Banding (Cortex-M3·M4)

1
SRAM bit-band region: 0x20000000 - 0x200FFFFF
2
Alias region:         0x22000000 - 0x23FFFFFF
3

4
bit_addr = alias_base + (byte_offset × 32) + (bit × 4)

1
#define BITBAND(addr, bit) \
2
    ((__IO uint32_t*)(0x22000000 + ((((uint32_t)(addr)) - 0x20000000) << 5) + ((bit) << 2)))
3

4
*BITBAND(&flag_byte, 3) = 1;   // atomic set bit 3

Cortex-M7부터는 제거되었습니다 (cache 동작이 복잡해지기 때문입니다).

#Strongly-Ordered vs Device

1
/* GIC register, system control */
2
- Strongly-Ordered: 매 access 완전 직렬화 (next 시작 전 이전 완료)
3
- Device:           gather 안 함, reorder 안 함, but speculation 일부 OK

GIC와 CPU control은 Strongly-Ordered로, 일반 peripheral은 Device로 매핑합니다.

#PCIe MMIO 특수성

1
/* Posted vs Non-posted */
2
- Memory write (PCIe) - posted (응답 없음, 빠름)
3
- Memory read         - non-posted (round-trip latency)
4
- Config read/write   - non-posted (sequenced)

PCIe MMIO write를 flush하려면 다음과 같이 합니다.

1
iowrite32(val, mmio);   // posted
2
ioread32(mmio + STATUS);   // read로 flush

Read는 post된 write가 완료된 뒤에 응답하므로 write 효과가 보장됩니다.

#DMA와 MMIO Ordering

1
/* 보내려는 buffer 준비 */
2
fill_buf(tx_buf, len);
3
__DSB();   // memory write 완료
4

5
/* DMA setup register */
6
DMA->SRC = (uint32_t)tx_buf;
7
DMA->LEN = len;
8
DMA->CR  = DMA_EN;   // start

DSB가 없으면 DMA가 비어 있는 buffer를 read할 수 있습니다.

#Word-Sized Access 강제

1
/* 32-bit register는 8-bit access 시 잘못 동작합니다 */
2
volatile uint8_t *reg8 = (uint8_t*)0x40000010;
3
*reg8 = 0x12;   // 일부 칩은 4 byte를 한꺼번에 처리하면서 나머지를 0으로 만들어 fault

ARM v7에서는 word-aligned word access만 안전합니다. iowrite32 / iowrite16 / iowrite8로 폭을 명시해야 합니다.

#STM32 register 비트 매크로

1
GPIO->MODER &= ~(GPIO_MODER_MODER5_Msk);   // clear
2
GPIO->MODER |=  (0b10 << GPIO_MODER_MODER5_Pos);   // set AF mode

_Msk는 mask, _Pos는 shift를 의미합니다. CMSIS 표준입니다.

#자주 하는 실수

⚠️ Volatile 없이 register access

1
*(uint32_t*)0x40000000 = 1;
2
/* 컴파일러가 dead store로 판단해 삭제할 수 있습니다 */

⚠️ Cache enable 잊고 빠르다고 착각

Cortex-M7에서는 D-cache가 enable된 뒤에는 MMIO 영역도 cacheable이 될 수 있습니다. 그러면 register read가 stale해집니다.

MPU로 MMIO 영역을 non-cacheable로 명시해야 합니다.

⚠️ Barrier 누락

1
clock_enable();
2
peripheral_use();   // clock이 안정되기 전에 access하면 fault

사이에 DSB를 두어야 합니다.

⚠️ Bit-band region 외에 사용

1
*BITBAND(&heap_var, 0) = 1;   // heap_var는 bit-band 영역이 아니어서 미정의 동작

Bit-band는 0x20000000-0x200FFFFF와 0x40000000-0x400FFFFF에서만 동작합니다.

#정리

MMIO는 Device memory type으로 매핑하며 uncached + strict order로 동작합니다.
Linux는 ioremap(Device)과 ioremap_wc(combining)를 제공합니다.
volatile은 컴파일러용, barrier(DMB, DSB, ISB)는 hardware용입니다.
BSRR이나 bit-band로 RMW race를 회피합니다.
DMA buffer를 준비하고 DSB를 거친 뒤에 start합니다.
PCIe에서는 read로 posted write를 flush합니다.

다음 편은 Peripheral Clock을 다룹니다.

#관련 항목

Embedded Performance Engineering · 26 of 57

실전 사례 — CXL.mem 추가로 LLM inference KV cache 처리량 회복

70B 모델 KV cache가 HBM 한계를 넘어 throughput이 무너졌을 때, CXL.mem 256 GB pool 추가로 회복한 실전 케이스.

2026년 6월 16일·cxl

CXL 성능 프로파일링 도구 — cxl-cli·DAMON·perf-mem 활용

CXL.mem 환경 성능 도구 — cxl-cli 토폴로지·DAMON page activity·perf-mem로 보는 CXL 트래픽·numastat 통계.

2026년 6월 16일·cxl

CXL.mem 지연·대역폭 실측 — Direct·Switch·Pooled 토폴로지 비교

CXL.mem 토폴로지별 실측 — Direct attach·Single switch·Multi-host pool의 지연·대역폭 비용 측정.

2026년 6월 16일·cxl

MMIO 접근 성능 — Cache Policy·Write-Combining·Volatile·Barrier

#한 줄 요약

#MMIO Cache Policy

#Linux ioremap

#Write-Combining

#Volatile - 컴파일러 차단

#ARM Memory Barrier

#언제 어떤 barrier?

#Read-Modify-Write Race

#Bit-Banding (Cortex-M3·M4)

#Strongly-Ordered vs Device

#PCIe MMIO 특수성

#DMA와 MMIO Ordering

#Word-Sized Access 강제

#STM32 register 비트 매크로

#자주 하는 실수

#정리

#관련 항목

Embedded Performance Engineering · 26 of 57

관련 글

실전 사례 — CXL.mem 추가로 LLM inference KV cache 처리량 회복

CXL 성능 프로파일링 도구 — cxl-cli·DAMON·perf-mem 활용

CXL.mem 지연·대역폭 실측 — Direct·Switch·Pooled 토폴로지 비교

이 글을 참조하는 글 (2)