Postmortem Debugging · 5/6

CXL 디바이스 Core Dump 분석 — Device State·Mailbox Log·NUMA 토폴로지

2026년 6월 18일 · Hawk · 3분 읽기

cxl postmortem core-dump drgn mailbox-log vmcore

Twitter LinkedIn

#CXL 관련 postmortem이 왜 다른가

일반 프로세스 core dump는 CPU 레지스터·메모리·스레드 상태가 핵심입니다. CXL 디바이스 fail 시에는 추가로 디바이스 측 상태가 필요합니다:

디바이스 mailbox 명령 이력 — 어떤 명령이 실패했나
HDM Decoder 매핑 테이블 — 메모리가 어디로 매핑되어 있었나
Poison list 변화 — bad media 누적 추이
NUMA 토폴로지 — 어느 node가 CXL이었나
Region·decoder 객체 상태 — sysfs path와 매핑

이 정보들은 대부분 vmcore에 들어 있지만 추출하려면 별도 도구가 필요합니다.

#drgn으로 vmcore 분석

drgn은 kdump core에서 살아 있는 커널처럼 CXL 구조를 검사할 수 있습니다. CXL 전용 helper module은 drgn에 표준 포함 진행 중이므로 없으면 struct walker를 자체 작성합니다.

1
# drgn 세션
2
$ drgn --core /var/crash/vmcore --vmlinux /usr/lib/debug/.../vmlinux
3

4
>>> # 모든 CXL port 나열 (개념적 — 자체 walker 또는 향후 추가될 helper)
5
>>> ports = walk_cxl_ports(prog)  # 직접 작성 or future helper
6
>>> print(f"{len(list(ports))} CXL ports")
7
3 CXL ports
8

9
>>> # Crash 시점의 region 상태
10
>>> for region in prog["cxl_regions"]:
11
...     print(f"region {region.name}: ", end="")
12
...     print(f"size={region.size:#x} ", end="")
13
...     print(f"state={region.state}")
14
region region0: size=0x4000000000 state=COMMIT
15
region region1: size=0x2000000000 state=COMMIT
16
region region2: size=0x0          state=ERROR     ← 의심
17

18
>>> # 의심 region의 decoder 상태
19
>>> r2 = prog["cxl_region_lookup"]("region2")
20
>>> for dec in r2.decoders:
21
...     print(f"  decoder{dec.id}: hpa={hex(dec.hpa_range.start)}")
22
...     print(f"    flags={dec.flags:#x}")
23
  decoder3.0: hpa=0x80000000000
24
    flags=0x4   ← ERROR flag set

ERROR state region과 flag가 crash 직전의 상태를 알려 줍니다.

#Mailbox 명령 이력 복원

cxl_core는 mailbox 명령 ring buffer를 유지합니다. vmcore에서 복원:

1
>>> mbox = prog["cxl_memdev"][0].mbox
2
>>> for i, cmd in enumerate(mbox.cmd_log):
3
...     print(f"[{cmd.timestamp:>16}] opcode={cmd.opcode:#06x} ret={cmd.ret}")
4
[1719724823501] opcode=0x4400 ret=0      # Get Health Info OK
5
[1719724824112] opcode=0x4300 ret=0      # Get LSA OK
6
[1719724824890] opcode=0x4302 ret=-110   # Set LSA TIMEOUT ← 의심
7
[1719724825101] opcode=0x0000 ret=-19    # device removed

timeout 후 ENODEV가 나면 디바이스가 응답 정지했다는 신호입니다.

#NUMA 토폴로지 복원

CXL 노드가 crash 시점에 어떤 상태였는지 확인:

1
>>> # 모든 노드 정보
2
>>> for node in prog["node_data"]:
3
...     if node:
4
...         print(f"node {node.node_id}: ", end="")
5
...         print(f"present={node.node_present_pages}, ", end="")
6
...         print(f"online={node.node_spanned_pages}")
7
node 0: present=33554432, online=33554432    # DDR socket 0
8
node 1: present=33554432, online=33554432    # DDR socket 1
9
node 2: present=67108864, online=0           # CXL — offline at crash!

node 2가 offline이 되었다면 CXL 디바이스가 사라진 시점이 crash 직전임을 알 수 있습니다.

#Kernel log에서 단서

vmcore의 dmesg buffer:

1
$ crash /usr/lib/debug/.../vmlinux /var/crash/vmcore
2
crash> log | tail -30
3
[12345.6789] cxl_pci 0000:5e:00.0: mailbox timeout (opcode 0x4302)
4
[12345.7892] cxl_pci 0000:5e:00.0: device went offline
5
[12345.7893] pci 0000:5e:00.0: AER: Multiple Uncorrectable error received
6
[12345.7895] cxl_mem mem0: removing memory device
7
[12345.8001] BUG: kernel NULL pointer dereference, address: 00000018
8
[12345.8002] RIP: 0010:cxl_region_access+0x2a/0x80

Mailbox timeout → device offline → AER UE → driver NULL deref의 연쇄가 보입니다.

#자주 만나는 함정

증상	원인
drgn에 cxl helper 없음	drgn 0.0.24+ 필요. 또는 자체 helper 작성
`cxl_regions`가 비어 있음	crash 시점에 region 모두 해제됨
HDM Decoder address 0	early crash — decoder programming 전
vmcore에 mailbox log 없음	`cxl_core` 모듈이 로그 안 유지 — kernel patch 필요
Node `present_pages` ≠ `online_pages`	device offline 또는 hot-remove 진행 중
`RIP`가 cxl 모듈 안	NULL deref — race condition 의심
AER 이벤트는 있는데 cxl 메시지 없음	PCI 레벨에서 죽음. cxl driver 호출 전
`Multiple UE`인데 disconnect 안 됨	`pci=noaer` 부팅 옵션?
Mailbox cmd_log timestamp 비현실적	TSC 미보정 또는 hot-plug 후 reset
Region state ERROR + sysfs 정상	sysfs와 kernel state 불일치 — recovery 필요