Modern Embedded Recipes · 44/152

임베디드 DMA 기초 — Memory-to-Memory·Peripheral·Circular Mode

2026년 4월 13일 · Hawk · 6분 읽기

#한 줄 요약

“DMA = CPU 없이 메모리 ↔ 메모리/peripheral 이동.” Source, destination, length, trigger 네 가지만 정하면 됩니다.

#어떤 상황에서 쓰나

UART에서 1 KB를 받거나, ADC로 1024 sample을 모으거나, SPI로 framebuffer를 LCD에 쏟는 작업을 CPU로 처리하면 수십 ms를 날립니다. DMA를 쓰면 CPU는 다른 일을 하면서 peripheral과 메모리 사이 데이터가 흐릅니다. throughput이 늘고 latency는 줄며 power도 줄어듭니다.

이 글은 STM32F4의 DMA controller를 기준으로 peripheral-to-memory, memory-to-peripheral, memory-to-memory 세 가지 패턴을 모두 다룹니다.

#핵심 개념

#DMA architecture

블록	역할
DMA Controller	Stream0 ~ Stream7, 각 stream은 channel로 trigger source 선택 (UART·SPI·ADC·TIM 등)
AHB bus	DMA → bus matrix 연결
Bus matrix	SRAM과 peripheral로 분기
SRAM	data buffer 위치
Peripheral	source 또는 destination

STM32F4는 DMA1, DMA2 controller 각 8 stream, 각 stream은 8 channel에서 trigger source를 고릅니다. 어느 peripheral이 어느 stream·channel을 쓰는지는 datasheet의 DMA request mapping table을 참조합니다.

#Transfer 모드 세 가지

Peripheral → Memory: ADC 결과, UART RX, SPI RX.
Memory → Peripheral: UART TX, SPI TX, DAC waveform.
Memory → Memory: memcpy 가속 (peripheral trigger 없이 software가 시작).

#Circular vs Normal

Normal: 지정한 NDTR만큼 transfer 후 자동 stop.
Circular: NDTR 도달 시 처음으로 wrap, 무한 반복. ADC sampling, UART RX buffer에 적합.

#Half/Full complete interrupt

DMA는 NDTR의 절반과 전체에 도달했을 때 IRQ를 발생시킬 수 있습니다. circular buffer를 두 영역으로 나눠 한쪽이 차는 동안 다른 쪽 처리 패턴(double buffering)이 표준입니다.

offset	영역	trigger event	CPU 동작
0	first half (512 B)	start	(DMA fill 시작)
512	(HT 도달)	HT IRQ	process first half (A)
1024	second half end	TC IRQ	process second half (B), wrap

#Data width와 burst

1
PSIZE / MSIZE = 0 (byte), 1 (halfword), 2 (word)
2
PBURST / MBURST = single (0), incr4 (1), incr8 (2), incr16 (3)

burst는 bus arbitration overhead를 줄여 throughput을 올립니다. 단, source/destination 모두 burst-capable이어야 합니다.

#Cache coherency (Cortex-M7)

STM32H7·F7은 D-cache를 가집니다. CPU가 cache에 write한 데이터를 DMA가 SRAM에서 읽으면 stale data 위험이 있습니다. cache clean/invalidate를 명시적으로 호출해야 합니다.

1
SCB_CleanDCache_by_Addr(buf, len);   // CPU write → DMA read 전
2
SCB_InvalidateDCache_by_Addr(buf, len);   // DMA write → CPU read 전

#코드 예제

#1. ADC → memory (continuous + circular)

1
#define ADC_BUF_LEN 256
2
static uint16_t adc_buf[ADC_BUF_LEN];
3

4
void adc_dma_init(void) {
5
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;
6
    RCC->APB2ENR |= RCC_APB2ENR_ADC1EN;
7

8
    // DMA2 Stream0, Channel 0 = ADC1
9
    DMA2_Stream0->CR = 0;
10
    while (DMA2_Stream0->CR & DMA_SxCR_EN);
11

12
    DMA2_Stream0->PAR  = (uint32_t)&ADC1->DR;
13
    DMA2_Stream0->M0AR = (uint32_t)adc_buf;
14
    DMA2_Stream0->NDTR = ADC_BUF_LEN;
15
    DMA2_Stream0->CR   = (0u << 25)              // channel 0
16
                       | DMA_SxCR_MINC           // memory increment
17
                       | (1u << 11)              // PSIZE = halfword
18
                       | (1u << 13)              // MSIZE = halfword
19
                       | DMA_SxCR_CIRC           // circular
20
                       | DMA_SxCR_HTIE | DMA_SxCR_TCIE
21
                       | DMA_SxCR_EN;
22
    NVIC_EnableIRQ(DMA2_Stream0_IRQn);
23

24
    ADC1->CR2 |= ADC_CR2_DMA | ADC_CR2_DDS | ADC_CR2_CONT | ADC_CR2_ADON;
25
    ADC1->CR2 |= ADC_CR2_SWSTART;
26
}
27

28
volatile int adc_half_ready, adc_full_ready;
29

30
void DMA2_Stream0_IRQHandler(void) {
31
    uint32_t flags = DMA2->LISR;
32
    DMA2->LIFCR = flags;
33
    if (flags & DMA_LISR_HTIF0) adc_half_ready = 1;
34
    if (flags & DMA_LISR_TCIF0) adc_full_ready = 1;
35
}
36

37
void process_adc(void) {
38
    if (adc_half_ready) {
39
        adc_half_ready = 0;
40
        process(adc_buf, ADC_BUF_LEN/2);          // first half
41
    }
42
    if (adc_full_ready) {
43
        adc_full_ready = 0;
44
        process(adc_buf + ADC_BUF_LEN/2, ADC_BUF_LEN/2);
45
    }
46
}

main은 DMA가 어느 영역을 채우고 있는지 HT/TC flag로만 알면 됩니다.

#2. UART RX circular buffer

1
#define RX_DMA_LEN 256
2
static uint8_t rx_dma[RX_DMA_LEN];
3
static uint16_t rx_dma_read_pos;
4

5
void uart_dma_rx_init(void) {
6
    USART1->CR3 |= USART_CR3_DMAR;
7

8
    DMA2_Stream2->PAR  = (uint32_t)&USART1->DR;
9
    DMA2_Stream2->M0AR = (uint32_t)rx_dma;
10
    DMA2_Stream2->NDTR = RX_DMA_LEN;
11
    DMA2_Stream2->CR   = (4u << 25)
12
                       | DMA_SxCR_MINC
13
                       | DMA_SxCR_CIRC
14
                       | DMA_SxCR_EN;
15
}
16

17
// Polling — DMA가 어디까지 썼는지 NDTR로 계산
18
size_t uart_dma_rx_available(void) {
19
    uint16_t write_pos = RX_DMA_LEN - (uint16_t)DMA2_Stream2->NDTR;
20
    if (write_pos >= rx_dma_read_pos)
21
        return write_pos - rx_dma_read_pos;
22
    else
23
        return RX_DMA_LEN - rx_dma_read_pos + write_pos;
24
}
25

26
uint8_t uart_dma_rx_get(void) {
27
    uint8_t c = rx_dma[rx_dma_read_pos];
28
    rx_dma_read_pos = (rx_dma_read_pos + 1) % RX_DMA_LEN;
29
    return c;
30
}

이 패턴은 IRQ 없이 RX를 무한히 받습니다. CPU는 한가할 때만 buffer를 비웁니다.

#3. Memory-to-memory (software trigger)

1
void dma_memcpy(void *dst, const void *src, size_t n) {
2
    DMA2_Stream0->CR = 0;
3
    while (DMA2_Stream0->CR & DMA_SxCR_EN);
4

5
    DMA2_Stream0->PAR  = (uint32_t)src;
6
    DMA2_Stream0->M0AR = (uint32_t)dst;
7
    DMA2_Stream0->NDTR = n / 4;
8
    DMA2_Stream0->CR   = (2u << 6)              // DIR = mem→mem
9
                       | DMA_SxCR_MINC
10
                       | DMA_SxCR_PINC
11
                       | (2u << 11) | (2u << 13)
12
                       | DMA_SxCR_EN;
13

14
    while (!(DMA2->LISR & DMA_LISR_TCIF0));
15
    DMA2->LIFCR = DMA_LIFCR_CTCIF0;
16
}

memory-to-memory는 CPU memcpy보다 약간 빠르고 (peripheral bus 별도 사용), CPU 부담이 없습니다. 단 small copy는 setup overhead 때문에 손해.

#측정 / 동작 확인

DMA 동작 확인은 NDTR을 폴링하면 됩니다.

1
(gdb) p/d DMA2_Stream0->NDTR
2
$1 = 178           ← 256 → 178로 줄어들고 있음 (78 bytes 전송됨)

ADC sampling은 oscilloscope로 ADC 트리거 핀(TIM2 ETR 등)을 보고, sample rate 일치 확인:

TIM2 → ADC EXT trigger:

PA0 trigger: 1 µs period (1 MHz)
adc_buf 채우는 속도: 256 sample / 256 µs = 1 sample/µs ✓

DMA가 안 도는 가장 흔한 원인은 peripheral의 DMA enable bit 누락과 clock enable 누락입니다.

#자주 보는 함정

⚠️ Channel 잘못 선택

reference manual의 DMA request mapping은 패밀리마다 다릅니다. F4의 SPI1 RX = DMA2 Stream0/Channel 3, F7도 비슷하지만 다른 SoC는 다릅니다.

⚠️ Stream을 두 peripheral이 동시 사용

stream은 한 시점에 하나의 source만 처리합니다. 두 peripheral이 같은 stream이면 둘 중 하나 포기 (다른 stream으로 옮김).

⚠️ Width 불일치

8-bit PSIZE인데 16-bit MSIZE면 align 오류. 같은 width로 통일이 안전.

⚠️ Buffer가 stack에 있음

local array에 DMA target을 잡으면 함수 return 후 stale. 항상 static 또는 global.

⚠️ Buffer alignment

uint32_t transfer는 4-byte align이 필요합니다. __attribute__((aligned(4))).

⚠️ Cortex-M7 cache coherency

H7·F7에서 DMA가 SRAM에 쓴 데이터를 CPU가 cache hit로 stale read. SCB_InvalidateDCache_by_Addr 필수. 또는 MPU로 buffer 영역만 non-cacheable로 설정.

⚠️ Half-word access가 boundary를 넘김

odd address에 16-bit DMA write → bus fault. address도 width에 맞춰 align.

#정리

DMA = source + destination + length + trigger. setup 4가지로 시작.
Circular + HT/TC IRQ가 double buffer 패턴의 표준.
ADC·UART·SPI 모두 peripheral의 DMA enable bit도 같이 set.
Buffer는 static 또는 global, aligned.
Cortex-M7은 cache clean/invalidate 또는 MPU non-cacheable.

다음 편은 저전력 모드입니다. Sleep/Stop/Standby와 wake-up source, µA 측정을 다룹니다.

임베디드 DMA 기초 — Memory-to-Memory·Peripheral·Circular Mode

#한 줄 요약

#어떤 상황에서 쓰나

#핵심 개념

#DMA architecture

#Transfer 모드 세 가지

#Circular vs Normal

#Half/Full complete interrupt

#Data width와 burst

#Cache coherency (Cortex-M7)

#코드 예제

#1. ADC → memory (continuous + circular)

#2. UART RX circular buffer

#3. Memory-to-memory (software trigger)

#측정 / 동작 확인

#자주 보는 함정

#정리

#관련 항목

Modern Embedded Recipes · 45 of 152

관련 글

DMA Completion 메커니즘 — Interrupt·Polling·Completion Ring

UART 안 찍힐 때 — Bare-metal 체크리스트

DMA-Friendly Allocator — dma_alloc_coherent·IOMMU·Pool

이 글을 참조하는 글 (6)