[SRU][Q][PATCH 0/2] BUG: kernel NULL pointer dereference in amdgpu(regression)

AceLan Kao acelan.kao at canonical.com
Tue Apr 7 07:53:41 UTC 2026


From: "Chia-Lin Kao (AceLan)" <acelan.kao at canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2144577

[Impact]
System freezes during boot on machines with AMD Southern Islands (SI) GPUs
using the amdgpu driver
.
The amdgpu driver calls flush_gpu_tlb_pasid() in a workqueue, but on SI
hardware this function pointer is NULL. The kernel hits a NULL pointer
dereference in amdgpu_gmc_flush_gpu_tlb_pasid() and crashes.

Error log:
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
kernel: Workqueue: events amdgpu_tlb_fence_work [amdgpu]
kernel: RIP: 0010:0x0
kernel: Call Trace:
kernel: amdgpu_gmc_flush_gpu_tlb_pasid+0xfd/0x480 [amdgpu]
kernel: amdgpu_tlb_fence_work+0x77/0x110 [amdgpu]

Hits every boot on affected hardware. Regression from 6.17.0-14 to 6.17.0-19.

[Fix]
Two patches fix this together:
1. f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
   Adds a NULL check for flush_gpu_tlb_pasid before calling it.
   Upstream in v7.0-rc1.
2. e3a6eff92bbd ("drm/amdgpu: Fix validating flush_gpu_tlb_pasid()")
   Fixes the first patch — the early return skipped the unlock, causing
   a deadlock. Changes the bare return to a goto that unlocks first.
   Upstream in v7.0-rc1.
   Fixes: f4db9913e4d3

[Test Plan]
On a machine with an AMD SI GPU (Tahiti, Pitcairn, Verde, Oland, Hainan)
booted with amdgpu.si_support=1:

$ sudo reboot

Without patches: kernel NULL pointer dereference during boot, system freezes.
With patches: system boots normally, no crash or error in dmesg.

Check dmesg after boot:
$ dmesg | grep -i "BUG\|NULL pointer\|amdgpu"

Without patches: "BUG: kernel NULL pointer dereference" present.
With patches: no BUG or NULL pointer lines.

[Where problems could occur]
Could break TLB flushing on amdgpu.

If the NULL check gates too broadly, TLB flushes could be skipped on GPUs
that do have flush_gpu_tlb_pasid. This would cause stale TLB entries and
GPU page faults or rendering corruption.

The unlock path change in the second patch touches the reset/lock logic in
amdgpu_gmc_flush_gpu_tlb_pasid(). A wrong goto target could leave the
reset domain lock held, deadlocking the GPU.

[Other Info]
Both patches are upstream in v7.0-rc1.

Prike Liang (1):
  drm/amdgpu: validate the flush_gpu_tlb_pasid()

Timur Kristóf (1):
  drm/amdgpu: Fix validating flush_gpu_tlb_pasid()

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 6 ++++++
 1 file changed, 6 insertions(+)

-- 
2.53.0




More information about the kernel-team mailing list