[SRU Bionic 1/1] gup: document and work around "COW can break either way" issue

Wed Jan 13 10:54:06 UTC 2021

On Wed, Dec 16, 2020 at 10:25:57PM -0300, Thadeu Lima de Souza Cascardo wrote:
> From: Linus Torvalds <torvalds at linux-foundation.org>
> 
> Doing a "get_user_pages()" on a copy-on-write page for reading can be
> ambiguous: the page can be COW'ed at any time afterwards, and the
> direction of a COW event isn't defined.
> 
> Yes, whoever writes to it will generally do the COW, but if the thread
> that did the get_user_pages() unmapped the page before the write (and
> that could happen due to memory pressure in addition to any outright
> action), the writer could also just take over the old page instead.
> 
> End result: the get_user_pages() call might result in a page pointer
> that is no longer associated with the original VM, and is associated
> with - and controlled by - another VM having taken it over instead.
> 
> So when doing a get_user_pages() on a COW mapping, the only really safe
> thing to do would be to break the COW when getting the page, even when
> only getting it for reading.
> 
> At the same time, some users simply don't even care.
> 
> For example, the perf code wants to look up the page not because it
> cares about the page, but because the code simply wants to look up the
> physical address of the access for informational purposes, and doesn't
> really care about races when a page might be unmapped and remapped
> elsewhere.
> 
> This adds logic to force a COW event by setting FOLL_WRITE on any
> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> pointer as a result.
> 
> The current semantics end up being:
> 
>  - __get_user_pages_fast(): no change. If you don't ask for a write,
>    you won't break COW. You'd better know what you're doing.
> 
>  - get_user_pages_fast(): the fast-case "look it up in the page tables
>    without anything getting mmap_sem" now refuses to follow a read-only
>    page, since it might need COW breaking.  Which happens in the slow
>    path - the fast path doesn't know if the memory might be COW or not.
> 
>  - get_user_pages() (including the slow-path fallback for gup_fast()):
>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
>    very similar semantics to FOLL_FORCE.
> 
> If it turns out that we want finer granularity (ie "only break COW when
> it might actually matter" - things like the zero page are special and
> don't need to be broken) we might need to push these semantics deeper
> into the lookup fault path.  So if people care enough, it's possible
> that we might end up adding a new internal FOLL_BREAK_COW flag to go
> with the internal FOLL_COW flag we already have for tracking "I had a
> COW".
> 
> Alternatively, if it turns out that different callers might want to
> explicitly control the forced COW break behavior, we might even want to
> make such a flag visible to the users of get_user_pages() instead of
> using the above default semantics.
> 
> But for now, this is mostly commentary on the issue (this commit message
> being a lot bigger than the patch, and that patch in turn is almost all
> comments), with that minimal "enable COW breaking early" logic using the
> existing FOLL_WRITE behavior.
> 
> [ It might be worth noting that we've always had this ambiguity, and it
>   could arguably be seen as a user-space issue.
> 
>   You only get private COW mappings that could break either way in
>   situations where user space is doing cooperative things (ie fork()
>   before an execve() etc), but it _is_ surprising and very subtle, and
>   fork() is supposed to give you independent address spaces.
> 
>   So let's treat this as a kernel issue and make the semantics of
>   get_user_pages() easier to understand. Note that obviously a true
>   shared mapping will still get a page that can change under us, so this
>   does _not_ mean that get_user_pages() somehow returns any "stable"
>   page ]
> 
> Reported-by: Jann Horn <jannh at google.com>
> Tested-by: Christoph Hellwig <hch at lst.de>
> Acked-by: Oleg Nesterov <oleg at redhat.com>
> Acked-by: Kirill Shutemov <kirill at shutemov.name>
> Acked-by: Jan Kara <jack at suse.cz>
> Cc: Andrea Arcangeli <aarcange at redhat.com>
> Cc: Matthew Wilcox <willy at infradead.org>
> Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
> (backported from commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f)
> [cascardo: fixed conflict due to missing commit ad415db817964e96df824e8bb1a861527f8012b6]
> [cascardo: fixed conflict due to missing commit b798bec4741bdd80224214fdd004c8e52698e425]
> [cascardo: fixed minor conflict on __get_user_pages_fast comment]
> [cascardo: fixup for absence of FOLL_PIN]
> [cascardo: use write=1 on s390's get_user_pages_fast too]
> CVE-2020-29374
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo at canonical.com>
> ---
>  arch/s390/mm/gup.c                      |  9 ++++-
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  8 +++++
>  mm/gup.c                                | 44 +++++++++++++++++++++----
>  mm/huge_memory.c                        |  7 ++--
>  4 files changed, 57 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
> index 9bce54eac0b0..713cf80a740e 100644
> --- a/arch/s390/mm/gup.c
> +++ b/arch/s390/mm/gup.c
> @@ -283,9 +283,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  {
>  	int nr, ret;
>  
> +	/*
> +	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
> +	 * because get_user_pages() may need to cause an early COW in
> +	 * order to avoid confusing the normal COW routines. So only
> +	 * targets that are already writable are safe to do by just
> +	 * looking at the page tables.
> +	 */
>  	might_sleep();
>  	start &= PAGE_MASK;
> -	nr = __get_user_pages_fast(start, nr_pages, write, pages);
> +	nr = __get_user_pages_fast(start, nr_pages, 1, pages);
>  	if (nr == nr_pages)
>  		return nr;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index f1bd66b5f006..d87beef8ceb3 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -640,6 +640,14 @@ static int i915_gem_userptr_get_pages(struct drm_i915_gem_object *obj)
>  				      GFP_KERNEL |
>  				      __GFP_NORETRY |
>  				      __GFP_NOWARN);
> +		/*
> +		 * Using __get_user_pages_fast() with a read-only
> +		 * access is questionable. A read-only page may be
> +		 * COW-broken, and then this might end up giving
> +		 * the wrong side of the COW..
> +		 *
> +		 * We may or may not care.
> +		 */
>  		if (pvec) /* defer to worker if malloc fails */
>  			pinned = __get_user_pages_fast(obj->userptr.ptr,
>  						       num_pages,
> diff --git a/mm/gup.c b/mm/gup.c
> index 12b9626b1a9e..ebf4a3482dee 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  }
>  
>  /*
> - * FOLL_FORCE can write to even unwritable pte's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
> + * but only after we've gone through a COW cycle and they are dirty.
>   */
>  static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>  {
> -	return pte_write(pte) ||
> -		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> +	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
> +}
> +
> +/*
> + * A (separate) COW fault might break the page the other way and
> + * get_user_pages() would return the page from what is now the wrong
> + * VM. So we need to force a COW break at GUP time even for reads.
> + */
> +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> +{
> +	return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET));
>  }
>  
>  static struct page *follow_page_pte(struct vm_area_struct *vma,
> @@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			if (!vma || check_vma_flags(vma, gup_flags))
>  				return i ? : -EFAULT;
>  			if (is_vm_hugetlb_page(vma)) {
> +				if (should_force_cow_break(vma, foll_flags))
> +					foll_flags |= FOLL_WRITE;
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> -						gup_flags, nonblocking);
> +						foll_flags, nonblocking);
>  				continue;
>  			}
>  		}
> +
> +		if (should_force_cow_break(vma, foll_flags))
> +			foll_flags |= FOLL_WRITE;
> +
>  retry:
>  		/*
>  		 * If we have a pending SIGKILL, don't keep faulting pages and
> @@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write)
>  /*
>   * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
>   * the regular GUP. It will only return non-negative values.
> + *
> + * Careful, careful! COW breaking can go either way, so a non-write
> + * access can get ambiguous page results. If you call this function without
> + * 'write' set, you'd better be sure that you're ok with that ambiguity.
>   */
>  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			  struct page **pages)
> @@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	 *
>  	 * We do not adopt an rcu_read_lock(.) here as we also want to
>  	 * block IPIs that come from THPs splitting.
> +	 *
> +	 * NOTE! We allow read-only gup_fast() here, but you'd better be
> +	 * careful about possible COW pages. You'll get _a_ COW page, but
> +	 * not necessarily the one you intended to get depending on what
> +	 * COW event happens after this. COW may break the page copy in a
> +	 * random direction.
>  	 */
>  
>  	if (gup_fast_permitted(start, nr_pages, write)) {
> @@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  					(void __user *)start, len)))
>  		return -EFAULT;
>  
> +	/*
> +	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
> +	 * because get_user_pages() may need to cause an early COW in
> +	 * order to avoid confusing the normal COW routines. So only
> +	 * targets that are already writable are safe to do by just
> +	 * looking at the page tables.
> +	 */
>  	if (gup_fast_permitted(start, nr_pages, write)) {
>  		local_irq_disable();
> -		gup_pgd_range(addr, end, write, pages, &nr);
> +		gup_pgd_range(addr, end, 1, pages, &nr);
>  		local_irq_enable();
>  		ret = nr;
>  	}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b0a6ba0833e4..9f8fbb504d15 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1425,13 +1425,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  }
>  
>  /*
> - * FOLL_FORCE can write to even unwritable pmd's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
> + * but only after we've gone through a COW cycle and they are dirty.
>   */
>  static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
>  {
> -	return pmd_write(pmd) ||
> -	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
> +	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));

The comment above this function mentions FOLL_FORCE, but I see it
removed here. Is that removal intentional?

William Breathitt Gray

>  }
>  
>  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> -- 
> 2.27.0
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20210113/c7c87eac/attachment.sig>