ACK: [SRU][N][PATCH 0/3] Incorrect backport for CVE-2025-21861 causes kernel hangs

Tue Aug 12 07:57:23 UTC 2025

On 12.08.25 09:38, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2120330
> 
> [Impact]
> 
> The patch for CVE-2025-21861 was incorrectly backported to the noble 6.8
> kernel, leading to hangs when freeing device memory.
> 
> commit 41cddf83d8b00f29fd105e7a0777366edc69a5cf
> Author: David Hildenbrand <david at redhat.com>
> Date:   Mon Feb 10 17:13:17 2025 +0100
> Subject: mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=41cddf83d8b00f29fd105e7a0777366edc69a5cf
> ubuntu-noble: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/commit/?id=3858edb1146374f3240d1ec769ba857186531b17
> 
> An incorrect backport was performed, causing the old page to be placed
> back instead of the new page, e.g.:
> 
>                  src = page_folio(page);
>                  dst = page_folio(newpage);
> +               if (!is_zone_device_page(page))
> +                       putback_lru_page(page);
> 
> when in 41cddf83d8b00f29fd105e7a0777366edc69a5cf we have:
> 
> +               if (!folio_is_zone_device(dst))
> +                       folio_add_lru(dst);
> 
> in which case, we should really have had the backport as:
> 
> +               if (!folio_is_zone_device(newpage))
> +                       folio_add_lru(newpage);
> 
> This keeps references alive to the old memory pages, preventing them from being
> released and freed.
> 
> Stack traces of stuck processes:
> 
> ID: 871438 TASK: ffff007d4d668200 CPU: 95 COMMAND: "nvbandwidth"
>   #0 [ffff80010e8ef840] __switch_to at ffffc0f22798c550
>   #1 [ffff80010e8ef8a0] __schedule at ffffc0f22798c89c
>   #2 [ffff80010e8ef900] schedule at ffffc0f22798cd40
>   #3 [ffff80010e8ef930] schedule_preempt_disabled at ffffc0f22798d388
>   #4 [ffff80010e8ef9c0] rwsem_down_write_slowpath at ffffc0f227990dc8
>   #5 [ffff80010e8efa20] down_write at ffffc0f2279912d0
>   #6 [ffff80010e8efaa0] uvm_va_space_mm_shutdown at ffffc0f1c2a451ec [nvidia_uvm]
>   #7 [ffff80010e8efb00] uvm_va_space_mm_unregister at ffffc0f1c2a457a0 [nvidia_uvm]
>   #8 [ffff80010e8efb30] uvm_release at ffffc0f1c2a226d4 [nvidia_uvm]
>   #9 [ffff80010e8efc00] uvm_release_entry.part.0 at ffffc0f1c2a227dc [nvidia_uvm]
> #10 [ffff80010e8efc20] uvm_release_entry at ffffc0f1c2a22850 [nvidia_uvm]
> #11 [ffff80010e8efc30] __fput at ffffc0f2269a5760
> #12 [ffff80010e8efc70] ____fput at ffffc0f2269a5a80
> #13 [ffff80010e8efc80] task_work_run at ffffc0f2265ceedc
> #14 [ffff80010e8efcc0] do_exit at ffffc0f2265a0bc8
> #15 [ffff80010e8efcf0] do_group_exit at ffffc0f2265a0fec
> #16 [ffff80010e8efd50] get_signal at ffffc0f2265b8750
> #17 [ffff80010e8efe10] do_signal at ffffc0f22650166c
> #18 [ffff80010e8efe40] do_notify_resume at ffffc0f2265018f0
> #19 [ffff80010e8efe70] el0_interrupt at ffffc0f227985564
> #20 [ffff80010e8efe90] __el0_irq_handler_common at ffffc0f2279855f0
> #21 [ffff80010e8efea0] el0t_64_irq_handler at ffffc0f227986080
> #22 [ffff80010e8effe0] el0t_64_irq at ffffc0f2264f17fc
> 
> PID: 871467 TASK: ffff007f6aa66000 CPU: 66 COMMAND: "UVM GPU4 BH"
>   #0 [ffff80015ddef580] __switch_to at ffffc0f22798c550
>   #1 [ffff80015ddef5e0] __schedule at ffffc0f22798c89c
>   #2 [ffff80015ddef640] schedule at ffffc0f22798cd40
>   #3 [ffff80015ddef670] io_schedule at ffffc0f22798cec4
>   #4 [ffff80015ddef6e0] migration_entry_wait_on_locked at ffffc0f22686e3f0
>   #5 [ffff80015ddef740] migration_entry_wait at ffffc0f22695a6d4
>   #6 [ffff80015ddef750] do_swap_page at ffffc0f2268d6378
>   #7 [ffff80015ddef7d0] handle_pte_fault at ffffc0f2268da688
>   #8 [ffff80015ddef870] __handle_mm_fault at ffffc0f2268da7f8
>   #9 [ffff80015ddef8b0] handle_mm_fault at ffffc0f2268dab48
> #10 [ffff80015ddef910] handle_fault at ffffc0f1c2aace18 [nvidia_uvm]
> #11 [ffff80015ddef950] uvm_populate_pageable_vma at ffffc0f1c2aacf24 [nvidia_uvm]
> #12 [ffff80015ddef990] migrate_pageable_vma_populate_mask at ffffc0f1c2aad8c0 [nvidia_uvm]
> #13 [ffff80015ddefab0] uvm_migrate_pageable at ffffc0f1c2ab0294 [nvidia_uvm]
> #14 [ffff80015ddefb90] service_ats_requests at ffffc0f1c2abf828 [nvidia_uvm]
> #15 [ffff80015ddefbb0] uvm_ats_service_faults at ffffc0f1c2ac02f0 [nvidia_uvm]
> #16 [ffff80015ddefd40] uvm_parent_gpu_service_non_replayable_fault_buffer at ffffc0f1c2a82e00 [nvidia_uvm]
> #17 [ffff80015ddefda0] non_replayable_faults_isr_bottom_half at ffffc0f1c2a3c3e4 [nvidia_uvm]
> #18 [ffff80015ddefe00] non_replayable_faults_isr_bottom_half_entry at ffffc0f1c2a3c590 [nvidia_uvm]
> #19 [ffff80015ddefe20] _main_loop at ffffc0f1c2a207c8 [nvidia_uvm]
> #20 [ffff80015ddefe70] kthread at ffffc0f2265d40dc
> 
> There is no workaround.
> 
> [Fix]
> 
> To make things less confusing, revert the incorrect backport, and backport
> "mm: migrate_device: use more folio in migrate_device_finalize()" to use the
> new upstream notations, and correctly backport "mm/migrate_device: don't add
> folio to be freed to LRU in migrate_device_finalize()". This approach was
> suggested and tested by Krister Johansen, and I think it is reasonable.
> 
> commit 58bf8c2bf47550bc94fea9cafd2bc7304d97102c
> Author: Kefeng Wang <wangkefeng.wang at huawei.com>
> Date:   Mon Aug 26 14:58:12 2024 +0800
> Subject: mm: migrate_device: use more folio in migrate_device_finalize()
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=58bf8c2bf47550bc94fea9cafd2bc7304d97102c
> 
> commit 41cddf83d8b00f29fd105e7a0777366edc69a5cf
> Author: David Hildenbrand <david at redhat.com>
> Date:   Mon Feb 10 17:13:17 2025 +0100
> Subject: mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=41cddf83d8b00f29fd105e7a0777366edc69a5cf
> 
> The first patch landed in 6.12-rc1 and the second patch in 6.14-rc4. Both are
> in plucky.
> 
> [Testcase]
> 
> There are a few ways to trigger the issue.
> 
> You can run the hmm selftests. Note, you need to build a new kernel and set
> CONFIG_TEST_HMM=m first.
> 
> 1) Check out a kernel git tree
> 2) cd tools/testing/selftests/mm/
> 3) make
> 4) sudo ./test_hmm.sh
> 
> You can also run nvidia tests like nvbandwidth, if your system has a Nvidia GPU:
> https://github.com/NVIDIA/nvbandwidth
> 
> $ git clone https://github.com/NVIDIA/nvbandwidth.git
> $ cd nvbandwidth
> $ sudo ./debian_install.sh
> $ sudo ./nvbandwidth
> 
> A test package is available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf416039-test
> 
> If you install it, and run the hmm selftests, it should no longer hang.
> 
> [Where problems can occur]
> 
> This changes some core mm code for device memory from standard pages to using
> folios, and carries some additional risk because of this.
> 
> If a regression were to occur, it would primarily affect users of devices with
> internal memory, such as graphics cards, and quite possibly high end network
> cards.
> 
> The largest userbase affected by this regression is nvidia users, so it really
> would be a bad idea to release with the broken implementation, and instead, to
> respin and release with the fixed implementation.
> 
> David Hildenbrand (1):
>    mm/migrate_device: don't add folio to be freed to LRU in
>      migrate_device_finalize()
> 
> Kefeng Wang (1):
>    mm: migrate_device: use more folio in migrate_device_finalize()
> 
> Matthew Ruffell (1):
>    UBUNTU: SAUCE: Revert "mm/migrate_device: don't add folio to be freed
>      to LRU in migrate_device_finalize()"
> 
>   mm/migrate_device.c | 37 ++++++++++++++++++++-----------------
>   1 file changed, 20 insertions(+), 17 deletions(-)
> 

Acked-by: Stefan Bader <stefan.bader at canonical.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250812/53629531/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250812/53629531/attachment-0001.sig>