[SRU][Q][PATCH 0/2] BUG: kernel NULL pointer dereference in amdgpu(regression)
AceLan Kao
acelan.kao at canonical.com
Tue Apr 7 07:53:41 UTC 2026
From: "Chia-Lin Kao (AceLan)" <acelan.kao at canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2144577
[Impact]
System freezes during boot on machines with AMD Southern Islands (SI) GPUs
using the amdgpu driver
.
The amdgpu driver calls flush_gpu_tlb_pasid() in a workqueue, but on SI
hardware this function pointer is NULL. The kernel hits a NULL pointer
dereference in amdgpu_gmc_flush_gpu_tlb_pasid() and crashes.
Error log:
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
kernel: Workqueue: events amdgpu_tlb_fence_work [amdgpu]
kernel: RIP: 0010:0x0
kernel: Call Trace:
kernel: amdgpu_gmc_flush_gpu_tlb_pasid+0xfd/0x480 [amdgpu]
kernel: amdgpu_tlb_fence_work+0x77/0x110 [amdgpu]
Hits every boot on affected hardware. Regression from 6.17.0-14 to 6.17.0-19.
[Fix]
Two patches fix this together:
1. f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
Adds a NULL check for flush_gpu_tlb_pasid before calling it.
Upstream in v7.0-rc1.
2. e3a6eff92bbd ("drm/amdgpu: Fix validating flush_gpu_tlb_pasid()")
Fixes the first patch — the early return skipped the unlock, causing
a deadlock. Changes the bare return to a goto that unlocks first.
Upstream in v7.0-rc1.
Fixes: f4db9913e4d3
[Test Plan]
On a machine with an AMD SI GPU (Tahiti, Pitcairn, Verde, Oland, Hainan)
booted with amdgpu.si_support=1:
$ sudo reboot
Without patches: kernel NULL pointer dereference during boot, system freezes.
With patches: system boots normally, no crash or error in dmesg.
Check dmesg after boot:
$ dmesg | grep -i "BUG\|NULL pointer\|amdgpu"
Without patches: "BUG: kernel NULL pointer dereference" present.
With patches: no BUG or NULL pointer lines.
[Where problems could occur]
Could break TLB flushing on amdgpu.
If the NULL check gates too broadly, TLB flushes could be skipped on GPUs
that do have flush_gpu_tlb_pasid. This would cause stale TLB entries and
GPU page faults or rendering corruption.
The unlock path change in the second patch touches the reset/lock logic in
amdgpu_gmc_flush_gpu_tlb_pasid(). A wrong goto target could leave the
reset domain lock held, deadlocking the GPU.
[Other Info]
Both patches are upstream in v7.0-rc1.
Prike Liang (1):
drm/amdgpu: validate the flush_gpu_tlb_pasid()
Timur Kristóf (1):
drm/amdgpu: Fix validating flush_gpu_tlb_pasid()
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 6 ++++++
1 file changed, 6 insertions(+)
--
2.53.0
More information about the kernel-team
mailing list