Modern Embedded Recipes · 134/152

Vitis AI 분석 — DPU·xmodel·VART

2026년 4월 20일 · Hawk · 5분 읽기

#한 줄 요약

**“Vitis AI는 DPU(Deep Learning Processor Unit)라는 Xilinx의 INT8 inference 엔진과 그것을 위한 toolchain입니다.” TensorFlow/PyTorch 모델을 quantize → compile → xmodel → VART로 실행합니다.

#어떤 상황에서 쓰나

Zynq UltraScale+ MPSoC, Kria K26 SoM, Versal AI 장착 device에서 neural network inference를 돌릴 때 사실상 표준입니다. ZCU104, KV260 같은 dev kit이 대표 플랫폼입니다.

GPU·NPU 없이 FPGA fabric으로 deep learning을 하면서 INT8 throughput과 낮은 전력을 챙기는 게 핵심입니다. ResNet-50을 KV260에서 ~150 fps @ 5W 정도로 돌립니다.

#핵심 개념 — DPU

1
DPU = Deep Learning Processing Unit
2
- Xilinx의 IP block (RTL)
3
- FPGA fabric에 instantiate
4
- INT8 가속 (convolution, pool, FC, activation, ...)
5
- DPUCZDX8G: Zynq UltraScale+
6
- DPUCAHX8H: Alveo
7
- DPUCVDX8G: Versal

DPU는 고정된 instruction set을 가진 softcore accelerator입니다. xmodel은 DPU instruction stream. CPU instruction과 비슷한 관계입니다.

DPU 옵션:

옵션	성능	용도
B512	~256 GOPS	작은 model
B1024	—	—
B2304	~1100 GOPS	—
B4096	~2000 GOPS	ZCU104·KV260 표준
B8192	~4000 GOPS	대형 부서

LUT, DSP, BRAM 사용량이 옵션에 따라 다릅니다. KV260은 보통 B4096.

#전체 흐름

Model 학습 (TF/PyTorch)
Quantize (FP32 → INT8) — calibration dataset 필요
Compile (quantized → xmodel) — DPU arch 지정
Deploy: VART API로 실행

#Step 1 — Quantize

1
# PyTorch
2
from pytorch_nndct.apis import torch_quantizer
3

4
model = torchvision.models.resnet50(pretrained=True)
5
model.eval()
6

7
# input shape
8
input_shape = (1, 3, 224, 224)
9
dummy = torch.randn(input_shape)
10

11
# calibration
12
quantizer = torch_quantizer('calib', model, dummy, device='cpu')
13
quant_model = quantizer.quant_model
14

15
# calibration data 100~1000장 inference
16
for img in calib_loader:
17
    quant_model(img)
18

19
quantizer.export_quant_config()
20

21
# test mode로 다시
22
quantizer = torch_quantizer('test', model, dummy, device='cpu',
23
                            quant_config_file='quant_info.json')
24
quant_model = quantizer.quant_model
25

26
# verify accuracy
27
acc = evaluate(quant_model, val_loader)
28
print(f"Quantized accuracy: {acc}")
29

30
# export xmodel
31
quantizer.export_xmodel()

Calibration은 대표성 있는 100~1000장. Deploy 환경 분포와 가까워야 accuracy 손실 적습니다.

#Step 2 — Compile

1
vai_c_xir \
2
    --xmodel    ResNet50_int.xmodel \
3
    --arch      /opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json \
4
    --output_dir compiled \
5
    --net_name  resnet50

arch.json은 DPU 인스턴스의 spec. Board마다 다름. ZCU104, KV260, custom board.

출력: compiled/resnet50.xmodel. 이게 DPU instruction stream + 가중치.

#Step 3 — VART로 실행

C++ API:

1
#include <vart/runner.hpp>
2
#include <xir/graph/graph.hpp>
3

4
int main() {
5
    // Load
6
    auto graph = xir::Graph::deserialize("compiled/resnet50.xmodel");
7
    auto subgraphs = graph->get_root_subgraph()->children_topological_sort();
8
    auto dpu_sg = std::find_if(subgraphs.begin(), subgraphs.end(),
9
        [](const auto *s) {
10
            return s->get_attr<std::string>("device") == "DPU";
11
        });
12

13
    auto runner = vart::Runner::create_runner(*dpu_sg, "run");
14

15
    // Input / output tensor info
16
    auto in_tensors  = runner->get_input_tensors();
17
    auto out_tensors = runner->get_output_tensors();
18

19
    // Allocate buffers
20
    int8_t *in_buf  = (int8_t*)malloc(in_tensors[0]->get_data_size());
21
    int8_t *out_buf = (int8_t*)malloc(out_tensors[0]->get_data_size());
22

23
    // Preprocess + load
24
    load_image("test.jpg", in_buf);
25

26
    // Run
27
    std::vector<vart::TensorBuffer*> ins  = { make_buf(in_buf, in_tensors[0]) };
28
    std::vector<vart::TensorBuffer*> outs = { make_buf(out_buf, out_tensors[0]) };
29
    auto job = runner->execute_async(ins, outs);
30
    runner->wait((int)job.first, -1);
31

32
    // Postprocess
33
    int top = argmax_int8(out_buf, 1000);
34
    printf("class %d\n", top);
35
}

Python API도 거의 동일합니다.

1
import vart, xir
2
import numpy as np
3

4
g = xir.Graph.deserialize("compiled/resnet50.xmodel")
5
sg = next(s for s in g.get_root_subgraph().toposort_child_subgraph()
6
          if s.get_attr("device") == "DPU")
7
runner = vart.Runner.create_runner(sg, "run")
8

9
inputs  = [np.empty(t.dims, dtype=np.int8) for t in runner.get_input_tensors()]
10
outputs = [np.empty(t.dims, dtype=np.int8) for t in runner.get_output_tensors()]
11

12
inputs[0][:] = preprocess(img)
13

14
job = runner.execute_async(inputs, outputs)
15
runner.wait(job)
16

17
print("top:", outputs[0].argmax())

#Step 4 — Multi-thread Throughput

DPU는 한 모델 instance가 동시에 여러 frame을 in-flight로 처리. Multi-thread로 throughput을 끌어올립니다.

1
auto runner = vart::Runner::create_runner(dpu_sg, "run");
2

3
std::vector<std::thread> ts;
4
for (int i = 0; i < 4; i++) {
5
    ts.emplace_back([&]() {
6
        while (running) {
7
            auto img = queue_pop();
8
            std::vector<TensorBuffer*> ins{...}, outs{...};
9
            auto job = runner->execute_async(ins, outs);
10
            runner->wait(job.first, -1);
11
            result_push(outs);
12
        }
13
    });
14
}

4 thread × ResNet-50 → KV260에서 ~600 fps. Single-thread 대비 4배.

#YOLOv5 예 — Detection Model

1
from pytorch_nndct.apis import torch_quantizer
2
import torch
3

4
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)
5
model.model[-1].export = True   # detect layer 분리
6

7
dummy = torch.randn(1, 3, 640, 640)
8
quantizer = torch_quantizer('calib', model, dummy)
9
quant_model = quantizer.quant_model
10

11
for img in calib_loader:
12
    quant_model(img)
13

14
quantizer.export_xmodel()

1
vai_c_xir --xmodel YOLOv5s_int.xmodel \
2
          --arch /opt/.../KV260/arch.json \
3
          --output_dir compiled --net_name yolov5s

VART 실행 후 postprocess (NMS, box decode)는 CPU에서.

#DPU와 CPU의 분담

1
[CPU]  Preprocess (resize, normalize, BGR→RGB)
2
       ↓
3
[DPU]  Backbone, head (conv, pool, activation, ...)
4
       ↓
5
[CPU]  Postprocess (NMS, sigmoid, box decode)

CPU pre/post가 bottleneck이면 OpenCV optimize, OpenMP, NEON SIMD로 가속.

#Profile

1
# DPU runtime profile
2
xdputil benchmark resnet50.xmodel 1   # 1 thread
3
xdputil benchmark resnet50.xmodel 4   # 4 thread

1
ResNet-50 KV260:
2
1 thread: 150 fps, 6.6ms/frame
3
4 thread: 600 fps, 6.7ms/frame (latency 동일, throughput 4×)
4
8 thread: 750 fps, ~13ms (queue 적체)

Thread 수는 core 수가 아니라 queue depth로 보면 됩니다.

#DPU IP Instantiate

Vivado에서 DPU IP를 fabric에 instantiate:

Vivado에 Vitis-AI DPU IP 추가
Configure: B4096, single/multi DPU, RAM/URAM usage
AXI 연결: M_AXI_HP 4개 (DDR), M_AXI_GP (control)
Generate bitstream
petalinux로 PetaLinux 빌드, DPU driver 포함

Pre-built KV260 / ZCU104 image가 있으니 처음에는 그걸 그대로 사용.

#자주 보는 함정

Calibration data 부족

100장 이하로 calibration하면 accuracy가 5~~10% 떨어질 수 있음. 500~~1000장 권장.

Unsupported op

1
quantizer = torch_quantizer('calib', model, dummy)
2
# WARNING: op 'LayerNorm' not supported by DPU → CPU subgraph

DPU에서 지원 안 하는 op는 CPU로 떨어짐. 모델을 DPU-friendly하게 (BatchNorm vs LayerNorm 등) 또는 fuse.

Input shape 고정

DPU는 fixed shape만 지원. Dynamic shape 모델은 max shape으로 fixed + pad.

Multi-DPU 활용 안 함

KV260은 단일 DPU지만 ZU19EG 같은 큰 device는 2개 DPU. 두 instance를 다른 stream에 할당.

Memory bandwidth

DPU는 DDR bandwidth가 bottleneck. AXI HP 4개 모두 활용. 다른 master(camera, network)와 충돌하면 throughput 떨어짐.

CPU pre/post bottleneck

1
DPU: 6 ms inference
2
CPU: 20 ms preprocess + 10 ms NMS = 30 ms
3

4
total: 36 ms → 28 fps (DPU 6 ms는 의미 없음)

Profile으로 측정. Preprocess는 GStreamer 또는 별도 fabric block으로 가속.

#정리

Vitis AI = DPU IP + toolchain (quantize, compile, runtime).
Quantize: FP32 → INT8, calibration dataset 100~1000장 필요.
Compile: quantized → xmodel (DPU instruction).
Runtime: VART API (C++/Python).
Multi-thread로 4× throughput 일반적.
DPU는 FPGA fabric에 instantiate되는 softcore (RTL IP).
B4096이 KV260·ZCU104 표준 옵션.
CPU pre/post bottleneck을 항상 profile.

다음 편은 OpenCL on FPGA입니다.

Vitis AI 분석 — DPU·xmodel·VART

#한 줄 요약

#어떤 상황에서 쓰나

#핵심 개념 — DPU

#전체 흐름

#Step 1 — Quantize

#Step 2 — Compile

#Step 3 — VART로 실행

#Step 4 — Multi-thread Throughput

#YOLOv5 예 — Detection Model

#DPU와 CPU의 분담

#Profile

#DPU IP Instantiate

#자주 보는 함정

#정리

#관련 항목

Modern Embedded Recipes · 135 of 152

관련 글

Modern Embedded Recipes — 모던 임베디드 실전 레시피 시리즈 소개

온디바이스 LLM 추론 — llama.cpp·GGUF·MLX·KV Cache·NPU Backend

Edge Thermal Management — Throttling·DVFS·Fan Curve·Sustained

이 글을 참조하는 글 (2)