Modern Embedded Recipes · 141/152

TFLite Micro 분석 — Op Resolver·Tensor Arena·Cortex-M

2026년 4월 21일 · Hawk · 5분 읽기

recipes edge-ai tflite-micro mcu cortex-m

#한 줄 요약

“TFLite Micro는 MCU에 들어가는 ML runtime입니다.” 메모리 100 KB대, no malloc, no OS 의존, INT8 quantized model이 기본입니다.

#어떤 상황에서 쓰나

Cortex-M4/M7/M33/M55, RISC-V MCU, ESP32 등 KB ~ 수십 MB RAM의 MCU에서 keyword spotting, person detection, gesture recognition, anomaly detection 같은 작은 신경망을 돌릴 때 표준입니다.

ARM Ethos-U NPU와 결합하면 Cortex-M55 + Ethos-U55 같은 MCU급 AI inference가 됩니다. Battery로 24/7 always-on inference가 가능합니다.

#핵심 개념 — 디자인 원칙

No dynamic memory after init (no malloc/free)
Single .tflite model을 flash에 binary로 둠
Tensor arena 한 덩어리 메모리에서 inference 동안 reuse
Op resolver로 사용 op만 link → flash 절약
C++ static link

이 원칙이 실시간 + 결정적 + 작은 memory 환경에 맞습니다.

#모델 변환 흐름

1
import tensorflow as tf
2

3
# Train
4
model = tf.keras.Sequential([...])
5
model.compile(...)
6
model.fit(...)
7

8
# Quantize to INT8
9
converter = tf.lite.TFLiteConverter.from_keras_model(model)
10
converter.optimizations = [tf.lite.Optimize.DEFAULT]
11
converter.representative_dataset = rep_data_gen
12
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
13
converter.inference_input_type  = tf.int8
14
converter.inference_output_type = tf.int8
15

16
tflite_model = converter.convert()
17
with open('model.tflite', 'wb') as f:
18
    f.write(tflite_model)

.tflite는 FlatBuffer format. 그대로 MCU flash에 둡니다.

#.tflite → C array

1
xxd -i model.tflite > model.h

1
unsigned char model_tflite[] = {
2
    0x1c, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33,
3
    /* ... */
4
};
5
unsigned int model_tflite_len = 24512;

이 array가 MCU flash에 들어갑니다.

#TFLite Micro Basics

1
#include "tensorflow/lite/micro/all_ops_resolver.h"
2
#include "tensorflow/lite/micro/micro_interpreter.h"
3
#include "tensorflow/lite/schema/schema_generated.h"
4
#include "model.h"
5

6
// 1. Memory arena
7
constexpr int kArenaSize = 100 * 1024;
8
alignas(16) static uint8_t tensor_arena[kArenaSize];
9

10
// 2. Model
11
static const tflite::Model *model = nullptr;
12

13
// 3. Interpreter
14
static tflite::MicroInterpreter *interp = nullptr;
15

16
void setup(void) {
17
    model = tflite::GetModel(model_tflite);
18
    if (model->version() != TFLITE_SCHEMA_VERSION) {
19
        MicroPrintf("Model schema version mismatch\n");
20
        return;
21
    }
22

23
    static tflite::AllOpsResolver resolver;
24
    static tflite::MicroInterpreter static_interp(
25
        model, resolver, tensor_arena, kArenaSize);
26
    interp = &static_interp;
27

28
    TfLiteStatus status = interp->AllocateTensors();
29
    if (status != kTfLiteOk) {
30
        MicroPrintf("AllocateTensors failed\n");
31
        return;
32
    }
33
}
34

35
void loop(void) {
36
    // Get input tensor
37
    TfLiteTensor *input = interp->input(0);
38

39
    // Fill input (INT8)
40
    for (int i = 0; i < input->bytes; i++) {
41
        input->data.int8[i] = sample_input[i];
42
    }
43

44
    // Invoke
45
    if (interp->Invoke() != kTfLiteOk) return;
46

47
    // Read output
48
    TfLiteTensor *output = interp->output(0);
49
    int8_t max_value = output->data.int8[0];
50
    int max_index = 0;
51
    for (int i = 1; i < output->dims->data[1]; i++) {
52
        if (output->data.int8[i] > max_value) {
53
            max_value = output->data.int8[i];
54
            max_index = i;
55
        }
56
    }
57
    MicroPrintf("class=%d score=%d\n", max_index, max_value);
58
}

이게 전체입니다. Malloc 없음. OS 없음.

#Tensor Arena 크기

Arena는 intermediate tensor + scratch buffer를 담는 단일 영역입니다.

크기 결정:

처음에 넉넉히 (예: 200 KB)
AllocateTensors 후 interp->arena_used_bytes() 호출
실제 사용량 확인
그 + 10% 정도로 줄임

예시:

모델	Arena
Person Detection 96×96	~70 KB
Speech Commands tiny	~10 KB
Keyword Spotter	~15 KB
Visual Wake Word	~50 KB

#Op Resolver 최적화

AllOpsResolver는 모든 op를 link해 flash가 큽니다. 사용하는 op만 명시:

1
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
2

3
static tflite::MicroMutableOpResolver<8> resolver;
4
resolver.AddConv2D();
5
resolver.AddDepthwiseConv2D();
6
resolver.AddFullyConnected();
7
resolver.AddSoftmax();
8
resolver.AddRelu();
9
resolver.AddMaxPool2D();
10
resolver.AddReshape();
11
resolver.AddQuantize();

AllOpsResolver가 200 KB이면 MutableOpResolver로 30 KB까지 줄어듭니다.

#CMSIS-NN — Cortex-M 최적화 kernel

CMSIS-NN은 ARM이 제공하는 Cortex-M용 최적화된 NN kernel. TFLite Micro에 통합됩니다.

1
# CMake build
2
make -f tensorflow/lite/micro/tools/make/Makefile \
3
     TARGET=cortex_m_generic \
4
     TARGET_ARCH=cortex-m7+fp \
5
     OPTIMIZED_KERNEL_DIR=cmsis_nn \
6
     hello_world

CMSIS-NN을 활성화하면 INT8 conv가 5-10배 빨라집니다 (M7에서). Cortex-M4 SIMD (DSP extensions)와 M55의 Helium (MVE)을 활용.

#Ethos-U Delegate

Ethos-U NPU와 결합하려면 모델을 Vela로 변환:

1
vela model.tflite --accelerator-config ethos-u55-256
2
# → model_vela.tflite 생성 (NPU command + 일부 CPU fallback)

Vela 변환된 모델은 NPU custom op를 포함합니다. Op resolver에 추가:

1
#include "tensorflow/lite/micro/kernels/ethosu.h"
2

3
resolver.AddEthosU();

이제 NPU-targetable layer는 Ethos-U55가, fallback layer는 Cortex-M55 + CMSIS-NN이 처리.

#사례 — Person Detection on STM32

STM32 Cube AI 또는 직접 빌드:

1
Model: MobileNetV1 0.25 96×96 grayscale
2
Input: 96 × 96 × 1 (int8)
3
Flash: 350 KB (model + runtime + ops)
4
RAM:   90 KB (arena) + 10 KB (other)

1
// 카메라 frame을 96×96 grayscale로 변환
2
camera_capture(rgb_buffer);
3
resize_grayscale(rgb_buffer, 320, 240, gray_buffer, 96, 96);
4

5
// 입력 채우기
6
TfLiteTensor *in = interp->input(0);
7
for (int i = 0; i < 96*96; i++)
8
    in->data.int8[i] = gray_buffer[i] - 128;   // -128~127 범위
9

10
// Inference
11
interp->Invoke();
12

13
// 출력: [no_person_prob, person_prob]
14
TfLiteTensor *out = interp->output(0);
15
int8_t person = out->data.int8[1];
16
if (person > THRESHOLD) led_on();

Cortex-M7 480 MHz에서 ~200 ms/inference. CMSIS-NN으로 ~30 ms. Ethos-U55 추가 시 ~5 ms.

#Memory Layout

Flash:

Code: ~ 200 KB (TFLite Micro + kernels)
Model: ~ 100~500 KB (.tflite as C array)
Other code, libs: ~ 100 KB

RAM:

Tensor arena: ~ 50~200 KB
System (stack, etc.): ~ 50 KB

Cortex-M4 보드는 RAM 128 KB / Flash 512 KB가 흔합니다. 적당한 모델이 들어갑니다. M7 (1 MB RAM)에는 더 큰 모델.

#Profiling

1
#include "tensorflow/lite/micro/micro_profiler.h"
2

3
static tflite::MicroProfiler profiler;
4
static tflite::MicroInterpreter static_interp(
5
    model, resolver, tensor_arena, kArenaSize, nullptr, &profiler);
6

7
interp.Invoke();
8
profiler.LogTicksPerTagCsv();

각 op의 실행 시간이 출력. Bottleneck을 찾아 optimize.

#측정 비교

1
Speech Commands (Tiny CNN) on Cortex-M4 (STM32F4 168 MHz):
2
  Reference kernel (no SIMD):  120 ms
3
  CMSIS-NN INT8:                25 ms
4
  → 5×
5

6
Person Detection (MobileNetV1 0.25) on Cortex-M7 (STM32H7 480 MHz):
7
  Reference:                    300 ms
8
  CMSIS-NN INT8:                 35 ms
9
  → 9×
10

11
MobileNetV1 0.25 on Cortex-M55 + Ethos-U55:
12
  CMSIS-NN only:                 35 ms
13
  Ethos-U55 delegate:             5 ms
14
  → 7× over CMSIS-NN

#자주 보는 함정

Float32 모델 그대로

1
Float32 MobileNet on Cortex-M4:  > 2 seconds
2
INT8 quantized:                  ~ 100 ms

MCU에서 float은 느림. 반드시 INT8 quantize.

Arena 너무 작음

1
AllocateTensors → kTfLiteError

Arena가 부족하면 silently fail 안 하고 error code 반환. 처음에 넉넉히 잡고 줄임.

AllOpsResolver 사용

1
Flash usage: 600 KB (model 50 KB) — 대부분이 op resolver

MutableOpResolver로 필요한 op만.

CMSIS-NN 활성화 안 함

1
# build flag
2
OPTIMIZED_KERNEL_DIR=cmsis_nn

기본 빌드는 reference kernel. 5-10× 차이.

Unsupported op

1
Unsupported op: SOFTMAX_V2

TFLite Micro는 모든 op를 지원 안 함. Vela report 또는 빌드 에러로 확인.

Input quantization 누락

1
// Input은 float이라고 가정
2
in->data.f[i] = pixel / 255.0f;
3
// → INT8 모델에는 잘못된 형식

Quantized model은 INT8 input. q = round(x / scale - zero_point)로 변환.

#정리

TFLite Micro = MCU용 TFLite, no malloc, OS-free.
모델은 .tflite → C array로 flash에 둠.
Tensor arena 한 덩어리에서 intermediate tensor reuse.
MutableOpResolver로 사용 op만 link.
CMSIS-NN으로 Cortex-M INT8 5-10배 가속.
Vela로 Ethos-U binary 변환 + Ethos-U delegate.
INT8 quantization 필수. Float32는 MCU에 너무 무거움.
Arena 크기는 arena_used_bytes()로 측정 후 조정.

다음 편은 ONNX Runtime입니다.

TFLite Micro 분석 — Op Resolver·Tensor Arena·Cortex-M

#한 줄 요약

#어떤 상황에서 쓰나

#핵심 개념 — 디자인 원칙

#모델 변환 흐름

#.tflite → C array

#TFLite Micro Basics

#Tensor Arena 크기

#Op Resolver 최적화

#CMSIS-NN — Cortex-M 최적화 kernel

#Ethos-U Delegate

#사례 — Person Detection on STM32

#Memory Layout

#Profiling

#측정 비교

#자주 보는 함정

#정리

#관련 항목

Modern Embedded Recipes · 142 of 152

관련 글

온디바이스 LLM 추론 — llama.cpp·GGUF·MLX·KV Cache·NPU Backend

Edge Thermal Management — Throttling·DVFS·Fan Curve·Sustained

ONNX Runtime 분석 — Execution Provider와 Cross-Platform 배포

이 글을 참조하는 글 (2)