[SRU][N][PATCH 0/1] Fix qxl driver crash causing VM console freeze
ghadi.rahme at canonical.com
ghadi.rahme at canonical.com
Tue Jul 15 17:12:22 UTC 2025
From: Ghadi Elie Rahme <ghadi.rahme at canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2065153
[ Impact ]
* The qxl driver currently has a bug that causes console freezes on qxl paravirtualized GPUs. This issue does not cause a full system hang since the system is still accessible via other means such as SSH, but it does cause the virtual console output to hang. The following dmesg output is seen when the issue occurs:
[ 280.618452] [TTM] Buffer eviction failed
[ 280.618463] qxl 0000:00:01.0: object_init failed for (3149824, 0x00000001)
[ 280.618466] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
* The issue was caused by commit: (5a838e5d5825 "drm/qxl: simplify qxl_fence_wait") Which does not add any new code but tries to simplify the already existing function.
This commit due to the problems it has caused, has been reverted upstream with: 07ed11afb68d Revert ("drm/qxl: simplify qxl_fence_wait"). The commit also adds back the DMA_FENCE_WARN macro due to it's usage in the reverted functions. The macro was originally removed with: d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2").
[ Test Plan ]
To Reproduce the bug follow the below steps:
1. Install a Ubuntu version with an affected kernel in a VM and make sure that the QXL video driver is in use instead of virtio. The server edition is enough for the reproducer no need for a DE to be installed. The issue is reproducible on Jammy 5.15 and above except Plucky since the fix is included in kernel 6.14.
2. Create a script and make it executable with the following content:
```
#!/bin/bash
chvt 3
for j in $(seq 80); do
echo "$(date) starting round $j"
if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ];
then
echo "bug was reproduced after $j tries"
exit 1
fi
for i in $(seq 100); do
dmesg > /dev/tty3
done
done
echo "bug could not be reproduced"
exit 0
```
3. Execute the script from the virtual console and from an SSH session, monitor the dmesg logs until you see the following:
[ 280.618452] [TTM] Buffer eviction failed
[ 280.618463] qxl 0000:00:01.0: object_init failed for (3149824, 0x00000001)
[ 280.618466] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[ Where problems could occur ]
* Virtual displays might still freeze or hang
* Warning messages related to the qxl driver might occur.
[ Other Info]
* The patch does cause a warning message to show up on boot when using the qxl video driver. The warning itself is harmless and does not seem to have any negative effects in my testing:
[ 5.011445] WARNING: CPU: 15 PID: 822 at kernel/workqueue.c:2985 check_flush_dependency.part.0+0xde/0x140
[ 5.011449] Modules linked in: qrtr cfg80211 binfmt_misc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm snd_hda_codec_generic irqbypass snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi rapl snd_hda_codec snd_hda_core snd_hwdep snd_pcm joydev snd_timer snd qxl i2c_i801 soundcore drm_ttm_helper i2c_smbus lpc_ich ttm input_leds mac_hid serio_raw sch_fq_codel dm_multipath msr efi_pstore nfnetlink dmi_sysfs qemu_fw_cfg ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic usbhid hid crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 ahci sha1_ssse3 libahci psmouse virtio_rng xhci_pci xhci_pci_renesas aesni_intel crypto_simd cryptd
[ 5.011493] CPU: 15 PID: 822 Comm: kworker/u65:1 Not tainted 6.8.0-999-generic #70
[ 5.011495] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 5.011496] Workqueue: ttm ttm_bo_delayed_delete [ttm]
[ 5.011501] RIP: 0010:check_flush_dependency.part.0+0xde/0x140
[ 5.011502] Code: 24 18 4d 89 f0 49 8d 8d b0 00 00 00 48 c7 c7 e0 8f e6 8a c6 05 f3 90 8c 02 01 48 8b 70 08 48 81 c6 b0 00 00 00 e8 a2 5e fd ff <0f> 0b eb 91 0f b6 1d d9 90 8c 02 80 fb 01 0f 87 38 57 0a 01 83 e3
[ 5.011503] RSP: 0018:ffffbd85c0ce7c28 EFLAGS: 00010046
[ 5.011505] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 5.011506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 5.011506] RBP: ffffbd85c0ce7c48 R08: 0000000000000000 R09: 0000000000000000
[ 5.011507] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9f308158a540
[ 5.011508] R13: ffff9f30801cea00 R14: ffffffffc0946570 R15: 0000000000000000
[ 5.011509] FS: 0000000000000000(0000) GS:ffff9f31f7d80000(0000) knlGS:0000000000000000
[ 5.011510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.011510] CR2: 000000c000a02000 CR3: 0000000108cf8000 CR4: 0000000000750ef0
[ 5.011514] PKRU: 55555554
[ 5.011514] Call Trace:
[ 5.011516] <TASK>
[ 5.011518] ? show_regs+0x6d/0x80
[ 5.011521] ? __warn+0x89/0x160
[ 5.011523] ? check_flush_dependency.part.0+0xde/0x140
[ 5.011524] ? report_bug+0x17e/0x1b0
[ 5.011527] ? handle_bug+0x6e/0xb0
[ 5.011529] ? exc_invalid_op+0x18/0x80
[ 5.011532] ? asm_exc_invalid_op+0x1b/0x20
[ 5.011535] ? __pfx_qxl_gc_work+0x10/0x10 [qxl]
[ 5.011539] ? check_flush_dependency.part.0+0xde/0x140
[ 5.011540] ? check_flush_dependency.part.0+0xde/0x140
[ 5.011541] start_flush_work+0xba/0x340
[ 5.011543] flush_work+0x5f/0xb0
[ 5.011545] qxl_queue_garbage_collect+0x8c/0x90 [qxl]
[ 5.011548] qxl_fence_wait+0xa3/0x1b0 [qxl]
[ 5.011552] dma_fence_wait_timeout+0x64/0x140
[ 5.011555] dma_resv_wait_timeout+0x7f/0xf0
[ 5.011556] ttm_bo_delayed_delete+0x2a/0xc0 [ttm]
[ 5.011560] process_one_work+0x181/0x3a0
[ 5.011562] worker_thread+0x306/0x440
[ 5.011563] ? __pfx_worker_thread+0x10/0x10
[ 5.011565] kthread+0xef/0x120
[ 5.011569] ? __pfx_kthread+0x10/0x10
[ 5.011572] ret_from_fork+0x44/0x70
[ 5.011574] ? __pfx_kthread+0x10/0x10
[ 5.011578] ret_from_fork_asm+0x1b/0x30
[ 5.011581] </TASK>
[ 5.011582] ---[ end trace 0000000000000000 ]---
* The Jammy version of the patch (5.15) does not need the re-introduction of the DMA_FENCE_WARN macro since it already exist.
Alex Constantino (1):
Revert "drm/qxl: simplify qxl_fence_wait"
drivers/gpu/drm/qxl/qxl_release.c | 50 +++++++++++++++++++++++++++----
include/linux/dma-fence.h | 7 +++++
2 files changed, 52 insertions(+), 5 deletions(-)
--
2.43.0
More information about the kernel-team
mailing list