[SRU][J:linux][PATCH 1/1] userfaultfd: fix checks for huge PMDs

Emil Renner Berthing emil.renner.berthing at canonical.com
Wed Jun 11 14:21:12 UTC 2025


Philip Cox wrote:
> From: Jann Horn <jannh at google.com>
>
> CVE-2024-46787
>
> Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
>
> The pmd_trans_huge() code in mfill_atomic() is wrong in three different
> ways depending on kernel version:
>
> 1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
>    the right two race windows) - I've tested this in a kernel build with
>    some extra mdelay() calls. See the commit message for a description
>    of the race scenario.
>    On older kernels (before 6.5), I think the same bug can even
>    theoretically lead to accessing transhuge page contents as a page table
>    if you hit the right 5 narrow race windows (I haven't tested this case).
> 2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
>    detecting PMDs that don't point to page tables.
>    On older kernels (before 6.5), you'd just have to win a single fairly
>    wide race to hit this.
>    I've tested this on 6.1 stable by racing migration (with a mdelay()
>    patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
>    VM, that causes a kernel oops in ptlock_ptr().
> 3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
>    to yank page tables out from under us (though I haven't tested that),
>    so I think the BUG_ON() checks in mfill_atomic() are just wrong.
>
> I decided to write two separate fixes for these (one fix for bugs 1+2, one
> fix for bug 3), so that the first fix can be backported to kernels
> affected by bugs 1+2.
>
> This patch (of 2):
>
> This fixes two issues.
>
> I discovered that the following race can occur:
>
>   mfill_atomic                other thread
>   ============                ============
>                               <zap PMD>
>   pmdp_get_lockless() [reads none pmd]
>   <bail if trans_huge>
>   <if none:>
>                               <pagefault creates transhuge zeropage>
>     __pte_alloc [no-op]
>                               <zap PMD>
>   <bail if pmd_trans_huge(*dst_pmd)>
>   BUG_ON(pmd_none(*dst_pmd))
>
> I have experimentally verified this in a kernel with extra mdelay() calls;
> the BUG_ON(pmd_none(*dst_pmd)) triggers.
>
> On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
> pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
> a BUG_ON(), since the page table access helpers are actually designed to
> deal with page tables concurrently disappearing; but on older kernels
> (<=6.4), I think we could probably theoretically race past the two
> BUG_ON() checks and end up treating a hugepage as a page table.
>
> The second issue is that, as Qi Zheng pointed out, there are other types
> of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
> (in particular, migration PMDs).
>
> On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
> PMD that contains a migration entry (which just requires winning a single,
> fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
> assumes that the PMD points to a page table.
>
> Breakage follows: First, the kernel tries to take the PTE lock (which will
> crash or maybe worse if there is no "struct page" for the address bits in
> the migration entry PMD - I think at least on X86 there usually is no
> corresponding "struct page" thanks to the PTE inversion mitigation, amd64
> looks different).
>
> If that didn't crash, the kernel would next try to write a PTE into what
> it wrongly thinks is a page table.
>
> As part of fixing these issues, get rid of the check for pmd_trans_huge()
> before __pte_alloc() - that's redundant, we're going to have to check for
> that after the __pte_alloc() anyway.
>
> Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
>
> Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@google.com
> Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@google.com
> Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
> Signed-off-by: Jann Horn <jannh at google.com>
> Acked-by: David Hildenbrand <david at redhat.com>
> Cc: Andrea Arcangeli <aarcange at redhat.com>
> Cc: Hugh Dickins <hughd at google.com>
> Cc: Jann Horn <jannh at google.com>
> Cc: Pavel Emelyanov <xemul at virtuozzo.com>
> Cc: Qi Zheng <zhengqi.arch at bytedance.com>
> Cc: <stable at vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
> (backport from commit 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8)
> [philcox: replace calls to pmdp_get_lockless() with pmd_read_atomic()
>           in __mcopy_atomic()i due to function being renamed]

There is an extra i here, and I assume the function that is patches was itself
also renanemd from __mcopy_atomic() to mcopy_atomic.

/Emil

> Signed-off-by: Philip Cox <philip.cox at canonical.com>
> ---
>  mm/userfaultfd.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 98a9d0ef2d917..50b76c34dd20e 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -602,21 +602,23 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  		}
>
>  		dst_pmdval = pmd_read_atomic(dst_pmd);
> -		/*
> -		 * If the dst_pmd is mapped as THP don't
> -		 * override it and just be strict.
> -		 */
> -		if (unlikely(pmd_trans_huge(dst_pmdval))) {
> -			err = -EEXIST;
> -			break;
> -		}
>  		if (unlikely(pmd_none(dst_pmdval)) &&
>  		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
>  			err = -ENOMEM;
>  			break;
>  		}
> -		/* If an huge pmd materialized from under us fail */
> -		if (unlikely(pmd_trans_huge(*dst_pmd))) {
> +		dst_pmdval = pmd_read_atomic(dst_pmd);
> +		/*
> +		 * If the dst_pmd is THP don't override it and just be strict.
> +		 * (This includes the case where the PMD used to be THP and
> +		 * changed back to none after __pte_alloc().)
> +		 */
> +		if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
> +			     pmd_devmap(dst_pmdval))) {
> +			err = -EEXIST;
> +			break;
> +		}
> +		if (unlikely(pmd_bad(dst_pmdval))) {
>  			err = -EFAULT;
>  			break;
>  		}
> --
> 2.43.0
>
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team



More information about the kernel-team mailing list