Series · 57 posts

Embedded Performance Engineering

3c aba accelerator acquire adaptive affinity ahb ai alignment amba amdahl arbitration arm arm-ds axi bandwidth baremetal barrier bcc benchmark bht big-little blocking bpftrace branch break-even brendan-gregg btb burst bus cache cache-policy cacheline capacity cas case-study chi clock clock-gettime coalescing coherence coherency compulsory concurrency conflict contention continuous-profiling convoy coremark cortex-a cortex-m coz cpu cpufreq cuda cxl cxl-cli cxl-mem cycle-counter cyclictest damon deadline directory dma double-buffer dvfs dwt dynamic-tracing ebpf embedded epoch etm false-sharing flamegraph forwarding ftrace function-tracer futex gpio gprof gpu gustafson hardware-counter hazard hazard-pointer hbm hdr-histogram helium histogram hold-time hotspot instrumentation interconnect interrupt intrinsics irq isolation isr itm jetson jitter junction-temp kv-cache l1 l2 l3 latency latency-tracer lauterbach layout lazy-stacking ldar line linux llm-inference load-balancing lock lock-free low-overhead lto mcs measurement memcpy memory memory-bandwidth memory-model memory-ordering mesi methodology migration miss mlc mmio moesi mutex napi neon neoverse npu nsight numa numastat nvtx off-cpu ooo operational-intensity optimization padding parallel parca percentile perf perf-lock perf-mem performance peripheral pgo pipeline pixie pll pmu power prediction prefetch priority-inheritance profiling ptm pyroscope qos race race-to-idle rate-limit raw-event rcc rcu reader realtime red-method register release renaming reorder-buffer reproducibility risc-v riscv roofline rwlock sampling scalability scatter-gather scripting sensor seq-cst seqlock set-associative simd smp snoop speculative spinlock stall starvation statistics storm stream streamline sve systick tail-chaining tardiness tetragon thermal thrashing throttle throughput ticket-lock tiered-memory tiling trace-cmd trace32 tracepoint tracy uftrace use-method utilization volatile wait-time warmup wcet writer

Embedded Performance Engineering — 임베디드 성능 엔지니어링 시리즈 소개

왜 느린가? Cache miss, pipeline stall, bus contention부터 profiling 도구 활용까지. 임베디드 시스템 성능 분석의 모든 것.

임베디드 성능 분석 방법론 — Measure → Analyze → Optimize 사이클

감으로 최적화 금지. USE·RED 메서드, 임베디드 적용. 과학적 접근의 시작.

성능 지표 정의 — Latency·Throughput·Utilization 분석

3 핵심 지표 + 임베디드 추가 — Jitter·Deadline. Service time vs Response time.

성능 측정의 기본 — Wall-Clock·CPU Cycle·Instruction Count

어떻게 측정하나. DWT, PMU, clock_gettime, GPIO + 로직 분석기. Overhead 줄이기.

성능 데이터 통계적 분석 — Percentile·Histogram·평균의 함정

평균은 거짓말. p99·p999·max·long tail. HdrHistogram·임베디드 fixed-bucket.

실시간 성능 분석 — WCET·Jitter·Deadline Miss 측정

Real-time 시스템의 측정 — 평균 아닌 worst-case. WCET 4 방법과 jitter·tardiness 분석.

임베디드 벤치마킹 기초 — 재현성·Warmup·노이즈 제거

신뢰할 수 있는 벤치마크는 warmup, isolation, multi-run이 필요합니다. CoreMark·Dhrystone·SPEC을 살펴봅니다.

성능 모델링 — Amdahl·Gustafson·Roofline Model 적용

최적화 한계를 예측하는 수학 모델. Serial 부분이 결정. Memory-bound vs Compute-bound.

프로파일링 기법 개요 — Sampling vs Instrumentation·PGO·LTO

두 가지 큰 접근을 비교합니다. Sampling은 perf처럼 가볍고 Instrumentation은 gprof처럼 정확하지만 무겁습니다. PGO로 최적화하는 방법도 다룹니다.

CPU 파이프라인 분석 — 5-stage·Cortex-M·Cortex-A 비교

Fetch·Decode·Execute·Memory·Writeback의 5-stage 파이프라인을 봅니다. Cortex-M3/M4는 3-stage, Cortex-A는 8~15-stage입니다.

Pipeline Stall 분석 — Data·Structural·Control Hazard·Forwarding

Stall은 pipeline bubble을 만듭니다. RAW·WAR·WAW hazard, forwarding, PMU STALL counter를 살펴봅니다.

Branch Prediction 분석 — Static·2-bit·BTB·BHT·Mispredict 비용

BTFNT, 2-bit saturating counter, BTB·BHT. Mispredict 10-20 cycle. PMU BR_MIS_PRED.

Speculative Execution 분석 — OoO·Reorder Buffer·Register Renaming

Out-of-order execution. ROB·issue queue·rename. Spectre 측면. Cortex-A 사례.

CPU Cache 기초 — L1·L2·L3·Set Associative·Replacement Policy

Cache hierarchy. Direct mapped vs N-way set associative. LRU·PLRU·Random.

Cache Miss 3C Model 분석 — Compulsory·Capacity·Conflict

Cold/Compulsory, Capacity (working set > cache), Conflict (associativity 한계).

Cache Line 최적화 — Alignment·Prefetch·False Sharing 처리

64-byte line alignment, software prefetch, false sharing 회피, SoA·AoS 선택.

메모리 대역폭 분석 — STREAM·Roofline·Bus Saturation 측정

STREAM benchmark (Copy·Scale·Add·Triad). Roofline. PMU BUS_ACCESS · DDR bandwidth.

SIMD·NEON 활용 — 128-bit Vector·Auto-Vectorization·SVE/SVE2

ARM NEON 128-bit, SVE 가변폭. Auto-vectorize (-O3). Intrinsics. Cortex-M Helium (MVE).

PMU·HPM 하드웨어 카운터 분석 — 정밀 성능 진단

ARMv8 PMU 6+ counter, RISC-V HPM. CYCLE·INST_RETIRED·CACHE·BRANCH. perf 활용.

임베디드 Bus Architecture — AHB·AXI·CHI 진화와 5-Channel

ARM AMBA — AHB·APB·AXI·ACE·CHI. AXI 5 channel, burst, outstanding transaction.

Bus Contention 진단 — Arbitration·QoS·Starvation 측정

Round-robin·priority·QoS arbitration. Master 다수 시 starvation. AXI QoS·BUSY counter.

DMA 성능 최적화 — Burst·Scatter-Gather·Chain·Cache 일관성

Burst size 최적화. Scatter-gather, chain. Cache clean/invalidate, double buffer.

DMA vs CPU Copy 성능 비교 — Break-even·Setup Overhead 실측

DMA setup overhead. CPU memcpy 최적화. Break-even size. 실측 데이터.

Interrupt Latency 분석 — 진입·종료·Tail-Chaining·Late Arrival

Cortex-M 12-cycle latency. Tail-chaining 6-cycle. Late arrival, lazy stacking, FreeRTOS hooks.

Interrupt Storm 처리 — NAPI·Rate-Limit·Polling 전환

IRQ flooding으로 main loop 봉쇄. NAPI 패턴, rate limit, interrupt coalescing.

MMIO 접근 성능 — Cache Policy·Write-Combining·Volatile·Barrier

MMIO uncached strongly-ordered. Write-combining for PCIe BAR. Volatile, DMB·DSB·ISB.

Peripheral Clock 분석 — PLL·Divider·Gating·DVFS

PLL/divider/gating으로 peripheral clock. STM32 RCC, Linux CCF. Power vs Performance.

Power vs Performance 트레이드오프 — DVFS·Race-to-Idle·Big.LITTLE

DVFS governor, race-to-idle, big.LITTLE, CPU 코어 hotplug, 측정·tuning.

Thermal Throttling 분석 — Junction Temp·Trip Point·냉각

Junction temperature, thermal sensor, trip point, throttling 동작, 자동차·우주 환경.

CXL Interconnect 분석 — AI 시대 메모리 대역폭 확장

CXL 2.0/3.1과 Neoverse V2가 만든 cache-coherent interconnect. CXL.io·CXL.cache·CXL.mem 세 프로토콜, Type 1/2/3 디바이스, latency·대역폭의 현실.

Concurrency 기초 — Concurrency vs Parallelism·Race·Memory Model

Concurrency vs Parallelism (Rob Pike). Race condition. Memory model 도입.

False Sharing 진단 — Cache Line Ping-Pong·Padding·측정

False sharing 원인. Cache coherence ping-pong. Padding으로 line 분리. 측정 방법.

Lock Contention 분석 — Wait·Hold·Convoy·측정 기법

Wait time과 hold time, contention ratio를 측정하고 lock convoy를 회피하는 법.

Spinlock 성능 분석 — Spin-Wait vs Context Switch·Ticket·MCS

Spinlock 비용 분석과 ticket lock, MCS lock의 scalability 차이.

Mutex 성능 분석 — Futex·Adaptive·Priority Inheritance

Mutex blocking 비용과 Linux futex 2-stage, adaptive mutex, priority inheritance overhead.

Reader-Writer Lock 성능 — Reader/Writer Priority·RCU·Seqlock

RW-lock의 종류와 reader/writer priority, RCU 비교, seqlock의 read-mostly 대안.

Lock-Free 자료구조 성능 — CAS·ABA·Hazard Pointer·Epoch Reclamation

CAS 기반 lock-free와 ABA 문제, hazard pointer와 epoch reclamation 비교.

Memory Ordering 분석 — Acquire·Release·Seq-Cst·ARM Relaxed Model

C11/C++11 memory_order와 acquire-release pair, seq-cst 비용, ARM ldar/stlr.

Cache Coherency 프로토콜 — MESI·MOESI·Snoop·Directory

MESI와 MOESI 프로토콜, snoop과 directory 방식, coherency overhead 측정.

SMP 성능 분석 — Per-Core·Affinity·Load Balance·Scalability

Per-core utilization과 CPU affinity, NUMA, migration cost, Amdahl 한계.

Linux perf 기초 — stat·record·report 활용

Linux perf 표준 도구의 세 가지 핵심 명령. 설치, 권한, 그리고 첫 측정부터 핫스팟 분석까지.

Linux perf 고급 — Raw Event·Tracepoint·perf script

perf의 raw event, tracepoint, perf script Python을 사용한 커스텀 분석.

ftrace 활용 — function·function_graph·latency tracer

ftrace의 function tracer, function_graph, irqsoff·preemptoff latency tracer 활용.

eBPF·bpftrace 동적 트레이싱 — 커널 무수정 관측

eBPF VM과 verifier, bpftrace one-liner, BCC tools, kprobe·uprobe·USDT 비교.

Flamegraph 분석 — On-CPU·Off-CPU·Differential

Brendan Gregg flamegraph로 on-CPU·off-CPU·차분 분석. perf·BCC stack 수집.

ARM DS·Lauterbach 분석 — Hardware Trace 전문 도구

ARM Development Studio Streamline, Lauterbach TRACE32, ETM·PTM hardware trace.

Bare-metal 프로파일링 — GPIO·DWT·SysTick·ITM 활용

OS도 perf도 없는 환경에서 GPIO, DWT cycle counter, SysTick, ITM으로 측정하기.

NVIDIA Nsight Systems — GPU·NPU 포함 시스템 분석

NVIDIA Nsight Systems로 CPU·GPU·메모리 통합 timeline 분석. Jetson 임베디드 활용.

모던 프로파일러 비교 — Tracy·Hotspot·uftrace·Coz

Tracy nanosecond instrumentation, Hotspot GUI, uftrace function trace, Coz causal profiler.

연속 프로파일링 — Parca·Pixie·Pyroscope·Tetragon

eBPF 기반 continuous profiling. Parca, Pixie, Pyroscope, Cilium Tetragon으로 24/7 분석.

실전 사례 — ISR Latency 100µs Deadline Miss 추적

산업용 센서 보드에서 산발적으로 발생한 ISR latency spike. 가설 두 개를 거쳐 SD 카드 드라이버를 범인으로 확정한 과정.

실전 사례 — Matrix Multiply가 예상의 10배 느린 이유

1024×1024 matrix multiply가 이론값의 10배 느렸다. SIMD부터 의심했지만 진짜 범인은 캐시 미스 90%였다.

실전 사례 — 8-core가 4-core를 넘으면 throughput이 떨어지는 이유

8-core 서버에서 thread를 늘릴수록 throughput이 오히려 감소. 단일 global mutex가 cache invalidation 폭주를 일으킨 사례.

실전 사례 — 카메라 1080p 60fps가 30fps로 떨어지는 이유

Cortex-A 보드의 카메라 캡처가 frame drop. CPU는 한가했고 진짜 범인은 DMA burst size와 AXI bus 효율이었다.

CXL.mem 지연·대역폭 실측 — Direct·Switch·Pooled 토폴로지 비교

CXL.mem 토폴로지별 실측 — Direct attach·Single switch·Multi-host pool의 지연·대역폭 비용 측정.

CXL 성능 프로파일링 도구 — cxl-cli·DAMON·perf-mem 활용

CXL.mem 환경 성능 도구 — cxl-cli 토폴로지·DAMON page activity·perf-mem로 보는 CXL 트래픽·numastat 통계.

실전 사례 — CXL.mem 추가로 LLM inference KV cache 처리량 회복

70B 모델 KV cache가 HBM 한계를 넘어 throughput이 무너졌을 때, CXL.mem 256 GB pool 추가로 회복한 실전 케이스.