Modern Embedded Recipes · 127/152

Zynq PS-PL 통신 — GP·HP·ACP 인터페이스 선택

2026년 4월 20일 · Hawk · 5분 읽기

#한 줄 요약

“Zynq의 GP/HP/ACP는 latency × throughput × cache coherence의 3축에서 다른 자리를 차지합니다.” 작은 control은 GP, 큰 throughput은 HP, cache-coherent shared data는 ACP가 답입니다.

#어떤 상황에서 쓰나

Zynq 7000, UltraScale+, Versal 등 SoC FPGA에서 *ARM PS(Processing System)*와 *PL(Programmable Logic)*이 데이터를 주고받을 때마다 인터페이스 선택이 필요합니다. 이걸 잘못 고르면 throughput이 1/10로 줍니다.

#핵심 개념 — 인터페이스 종류

Zynq 7000 기준입니다 (UltraScale+는 더 다양함).

1
PS → PL (PS가 master, PL이 slave):
2
  M_AXI_GP0, GP1                — General Purpose, 32-bit, 250 MHz
3
                                  → PL의 control register access
4

5
PL → PS (PL이 master, PS가 slave):
6
  S_AXI_GP0, GP1                — 32-bit, 일반 access
7
  S_AXI_HP0..HP3                — High Performance, 64-bit, ~600 MB/s 각
8
                                  → DDR 직접 access (cache bypass)
9
  S_AXI_ACP                     — Accelerator Coherency Port, 64-bit
10
                                  → L2 cache coherent

인터페이스	폭	Cache coherent	용도
M_AXI_GP	32	No	PS → PL register write
S_AXI_GP	32	No	PL → PS, 작은 size
S_AXI_HP	64	No (cache bypass)	PL → DDR, 큰 throughput
S_AXI_ACP	64	Yes (L2)	PL ↔ PS cache-shared data

#GP — Control Register

1
// PL 측 AXI-Lite slave (M_AXI_GP의 target)
2
module ctrl_regs (
3
    input  wire        s_aclk,
4
    input  wire        s_aresetn,
5
    // ... AXI-Lite slave signals
6
    output reg  [31:0] cmd,
7
    output reg  [31:0] args[0:7],
8
    input  wire [31:0] status,
9
    input  wire [31:0] res[0:7]
10
);

PS의 Cortex-A에서 일반 pointer write로 PL register에 access:

1
volatile uint32_t *regs = (uint32_t*)0x43C00000;   // BAR base
2
regs[0] = OP_PROCESS;          // cmd
3
regs[1] = buf_addr;
4
regs[2] = buf_len;

작은 control은 GP, 전송 비용은 µs 단위. 큰 data 못 옮김.

#HP — DDR Bulk Throughput

PL에서 DDR에 직접 read/write. PS의 L1/L2 cache를 우회합니다.

1
S_AXI_HP path:
2
  PL AXI master → HP port → DDR controller → DDR
3

4
throughput: 64-bit × 150 MHz × 4 port = ~4.8 GB/s 이론값

1
// PS 측에서 buffer 준비
2
uint8_t *buf = aligned_alloc(64, SIZE);
3
// cache 일관성 위해 *PS write 후 flush*
4
__clean_dcache_area_poc(buf, SIZE);
5

6
// PL에 buf physical addr 전달
7
regs[0] = OP_DMA;
8
regs[1] = virt_to_phys(buf);
9
regs[2] = SIZE;
10

11
// PL이 HP로 DDR에서 read
12
// 완료 IRQ 받음
13

14
// 결과는 *PL이 DDR에 write*. PS는 cache invalidate 후 read.
15
__invalidate_dcache_area_poc(buf, SIZE);
16
process(buf);

Cache flush/invalidate를 잊으면 옛 cache 값을 봅니다.

#ACP — Cache-Coherent

PL이 L2 cache에 coherent하게 access. Cache flush 필요 없음.

1
S_AXI_ACP path:
2
  PL → ACP port → L2 cache (snoop) → DDR (miss 시)
3

4
throughput: HP보다 *낮음* (~400 MB/s)
5
latency:    HP보다 *낮음* (cache hit 시)

1
// PS 측 — cache flush 필요 없음
2
uint8_t *buf = aligned_alloc(64, SIZE);
3
prepare(buf);
4

5
regs[0] = OP_ACP_DMA;
6
regs[1] = virt_to_phys(buf);
7
regs[2] = SIZE;
8

9
// PL이 ACP로 read — L2 cache에서 가져옴
10
// PL이 ACP로 write — L2 cache에 쓰임 (snoop)
11
// PS가 buf를 다시 읽으면 자동 coherent
12
process(buf);

작은 frequent transaction에 유리. 큰 bulk는 cache pollution 위험.

#선택 가이드

사용 사례	인터페이스
PL register polling	M_AXI_GP
1 KB 이하, control 위주	M_AXI_GP + S_AXI_GP
1 KB 이상, throughput	S_AXI_HP
자주 access, cache 공유	S_AXI_ACP
Stream (camera, network)	S_AXI_HP + descriptor in GP
Real-time DMA	S_AXI_HP (predictable latency)

#AXI Lite slave 예 (M_AXI_GP 측)

1
module mbox_lite #(
2
    parameter ADDR_W = 8
3
)(
4
    input  wire        clk,
5
    input  wire        rstn,
6
    // AXI Lite slave
7
    input  wire [ADDR_W-1:0] s_awaddr,
8
    input  wire        s_awvalid,
9
    output wire        s_awready,
10
    input  wire [31:0] s_wdata,
11
    input  wire [3:0]  s_wstrb,
12
    input  wire        s_wvalid,
13
    output wire        s_wready,
14
    output wire [1:0]  s_bresp,
15
    output wire        s_bvalid,
16
    input  wire        s_bready,
17
    input  wire [ADDR_W-1:0] s_araddr,
18
    input  wire        s_arvalid,
19
    output wire        s_arready,
20
    output wire [31:0] s_rdata,
21
    output wire [1:0]  s_rresp,
22
    output wire        s_rvalid,
23
    input  wire        s_rready,
24
    // User signal
25
    output reg  [31:0] cmd,
26
    output reg  [31:0] doorbell,
27
    input  wire [31:0] status
28
);
29
    // ... 매우 길어짐. Vivado IP Wizard가 자동 생성 가능
30
endmodule

손으로 짜면 한 모듈에 100+ line. Vivado의 Create and Package New IP → AXI Peripheral wizard가 boilerplate를 만들어 줍니다.

#AXI Master (S_AXI_HP 측)

1
module dma_engine (
2
    input  wire        clk,
3
    input  wire        rstn,
4
    // AXI master
5
    output reg  [31:0] m_araddr,
6
    output reg  [7:0]  m_arlen,        // burst length
7
    output reg  [2:0]  m_arsize,
8
    output reg  [1:0]  m_arburst,      // INCR / WRAP
9
    output reg         m_arvalid,
10
    input  wire        m_arready,
11
    input  wire [63:0] m_rdata,
12
    input  wire        m_rvalid,
13
    output reg         m_rready,
14
    input  wire        m_rlast,
15
    // Control
16
    input  wire        start,
17
    input  wire [31:0] src_addr,
18
    input  wire [31:0] len_bytes
19
);
20
    // burst read state machine
21
endmodule

Burst size를 키울수록 throughput이 좋습니다. AXI는 최대 256 beat까지 한 burst.

#Linux에서 PL과 통신

#/dev/mem로 register access

1
int fd = open("/dev/mem", O_RDWR | O_SYNC);
2
volatile uint32_t *regs = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
3
                                MAP_SHARED, fd, 0x43C00000);
4
regs[0] = 0x1234;

테스트용은 OK. Production은 UIO 사용.

#UIO

1
int fd = open("/dev/uio0", O_RDWR);
2
volatile uint32_t *regs = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
3
                                MAP_SHARED, fd, 0);
4
regs[0] = OP_START;
5

6
uint32_t irq_count;
7
read(fd, &irq_count, 4);    /* wait IRQ */

Device tree에 PL device 등록:

1
my_accel@43c00000 {
2
    compatible = "generic-uio";
3
    reg = <0x43c00000 0x1000>;
4
    interrupt-parent = <&intc>;
5
    interrupts = <0 29 4>;
6
};

#DMA buffer

1
// CMA-allocated coherent buffer
2
void *buf = dma_alloc_coherent(dev, SIZE, &dma_handle, GFP_KERNEL);
3
// dma_handle은 physical addr — PL에 전달

cache invalidate를 kernel이 자동 처리.

#측정 — 인터페이스별 throughput

Zynq Z-7020, 100 MHz fabric, 533 MHz DDR3 기준입니다.

인터페이스	bandwidth (이론)	실측 (sustained)
`M_AXI_GP`	32-bit × 100 MHz	~80 MB/s (write), ~50 MB/s (read)
`S_AXI_HP`	64-bit × 150 MHz	~600 MB/s
`S_AXI_ACP`	64-bit × 150 MHz	~400 MB/s (cache hit), ~150 MB/s (miss)
4× `S_AXI_HP`	parallel	~2 GB/s aggregate

Camera 1080p60 (~370 MB/s)는 HP 하나면 충분. 4K60 raw 12-bit (~3 GB/s)는 4× HP 또는 압축 필요.

#자주 보는 함정

Cache invalidate 누락

1
dma_read_to(buf, SIZE);    /* PL이 buf 채움 */
2
process(buf);              /* PS가 옛 cache 값 봄 */

__invalidate_dcache_area_poc(buf, SIZE) 또는 non-cacheable 영역 사용.

Write coalescing 없는 HP

1
// 1 byte씩 100번 write → 100 transaction
2
// 32-byte burst 1번 = 25 cycle, 100배 차이

PL DMA는 burst 단위로 묶기. 1 cycle 1 transaction은 throughput을 망칩니다.

ACP를 큰 bulk에 사용

1
1 MB을 ACP로 read → L2 cache (256 KB) 전체 오염 → PS performance 폭락

ACP는 작고 잦은 access에. Bulk는 HP.

GP에 큰 데이터 보냄

1
for (int i = 0; i < 1024; i++) regs[i] = data[i];
2
/* → MMIO 1024번, 수십 ms */

GP는 control만. Bulk는 DMA + HP/ACP.

Multiple master 충돌

1
HP0: camera DMA
2
HP1: network packet
3
HP2: video encoder
4
HP3: audio
5

6
→ DDR bandwidth 한계로 *serialize* + jitter 증가

Total HP throughput ≤ DDR bandwidth × 0.7 (efficiency). 4× HP의 합이 DDR 한계를 넘으면 backpressure.

AXI handshake protocol 위반

valid가 떨어진 후 그대로 유지해야 ready까지 기다림. 중간에 valid를 떨구면 bus hang.

#정리

Zynq PS-PL은 GP (control), HP (throughput), ACP (coherent) 세 종류.
GP는 32-bit 250 MHz, 작은 register access.
HP는 64-bit 150 MHz × 4 port, ~2 GB/s aggregate, cache bypass.
ACP는 cache-coherent, 작은 frequent access에 유리.
HP 사용 시 cache flush/invalidate 명시.
Burst size 키우기 = throughput 키우기.
Linux는 UIO + dma_alloc_coherent가 표준.
DDR bandwidth 한계를 항상 염두에.

다음 편은 Mailbox Protocol입니다.

#관련 항목

Modern Embedded Recipes · 128 of 152

AXI 인터페이스 — AXI4·AXI4-Lite·AXI-Stream 비교

AMBA AXI4·AXI4-Lite·AXI-Stream을 역할별로 구분해 사용하는 법과 burst·outstanding·deadlock 회피를 정리합니다.

2026년 4월 20일·recipes

Intel Quartus 사용법 — Platform Designer·Nios II·HLS

Intel Quartus Prime·Platform Designer(Qsys)·Nios II soft processor·Intel HLS·partial reconfig 사용법.

2026년 4월 20일·recipes

OpenCL on FPGA — Kernel·Channel·Burst Memory 분석

Intel/AMD FPGA에서 OpenCL kernel·channel·burst memory를 활용하는 패턴과 SYCL/oneAPI FPGA backend.

2026년 4월 20일·recipes

이 글을 참조하는 글 (1)

FPGA 기초 분석 — LUT·FF·BRAM·DSP 자원 구조— Modern Embedded Recipes

#한 줄 요약

#어떤 상황에서 쓰나

#핵심 개념 — 인터페이스 종류

#GP — Control Register

#HP — DDR Bulk Throughput

#ACP — Cache-Coherent

#선택 가이드

#AXI Lite slave 예 (M_AXI_GP 측)

#AXI Master (S_AXI_HP 측)

#Linux에서 PL과 통신

#/dev/mem로 register access

#UIO

#DMA buffer

#측정 — 인터페이스별 throughput

#자주 보는 함정

#정리

#관련 항목

Modern Embedded Recipes · 128 of 152

관련 글

AXI 인터페이스 — AXI4·AXI4-Lite·AXI-Stream 비교

Intel Quartus 사용법 — Platform Designer·Nios II·HLS

OpenCL on FPGA — Kernel·Channel·Burst Memory 분석

이 글을 참조하는 글 (1)